PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of ijbiostatThe International Journal of BiostatisticsThe International Journal of BiostatisticsSubmit to The International Journal of BiostatisticsSubscribe
 
Int J Biostat. Jan 1, 2010; 6(1): Article 17.
Published online May 17, 2010. doi:  10.2202/1557-4679.1181
PMCID: PMC2898626

Collaborative Double Robust Targeted Maximum Likelihood Estimation*

Abstract

Collaborative double robust targeted maximum likelihood estimators represent a fundamental further advance over standard targeted maximum likelihood estimators of a pathwise differentiable parameter of a data generating distribution in a semiparametric model, introduced in van der Laan, Rubin (2006). The targeted maximum likelihood approach involves fluctuating an initial estimate of a relevant factor (Q) of the density of the observed data, in order to make a bias/variance tradeoff targeted towards the parameter of interest. The fluctuation involves estimation of a nuisance parameter portion of the likelihood, g. TMLE has been shown to be consistent and asymptotically normally distributed (CAN) under regularity conditions, when either one of these two factors of the likelihood of the data is correctly specified, and it is semiparametric efficient if both are correctly specified.

In this article we provide a template for applying collaborative targeted maximum likelihood estimation (C-TMLE) to the estimation of pathwise differentiable parameters in semi-parametric models. The procedure creates a sequence of candidate targeted maximum likelihood estimators based on an initial estimate for Q coupled with a succession of increasingly non-parametric estimates for g. In a departure from current state of the art nuisance parameter estimation, C-TMLE estimates of g are constructed based on a loss function for the targeted maximum likelihood estimator of the relevant factor Q that uses the nuisance parameter to carry out the fluctuation, instead of a loss function for the nuisance parameter itself. Likelihood-based cross-validation is used to select the best estimator among all candidate TMLE estimators of Q0 in this sequence. A penalized-likelihood loss function for Q is suggested when the parameter of interest is borderline-identifiable.

We present theoretical results for “collaborative double robustness,” demonstrating that the collaborative targeted maximum likelihood estimator is CAN even when Q and g are both mis-specified, providing that g solves a specified score equation implied by the difference between the Q and the true Q0. This marks an improvement over the current definition of double robustness in the estimating equation literature.

We also establish an asymptotic linearity theorem for the C-DR-TMLE of the target parameter, showing that the C-DR-TMLE is more adaptive to the truth, and, as a consequence, can even be super efficient if the first stage density estimator does an excellent job itself with respect to the target parameter.

This research provides a template for targeted efficient and robust loss-based learning of a particular target feature of the probability distribution of the data within large (infinite dimensional) semi-parametric models, while still providing statistical inference in terms of confidence intervals and p-values. This research also breaks with a taboo (e.g., in the propensity score literature in the field of causal inference) on using the relevant part of likelihood to fine-tune the fitting of the nuisance parameter/censoring mechanism/treatment mechanism.

Keywords: asymptotic linearity, coarsening at random, causal effect, censored data, crossvalidation, collaborative double robust, double robust, efficient influence curve, estimating function, estimator selection, influence curve, G-computation, locally efficient, loss-function, marginal structural model, maximum likelihood estimation, model selection, pathwise derivative, semiparametric model, sieve, super efficiency, super-learning, targeted maximum likelihood estimation, targeted nuisance parameter estimator selection, variable importance

1. Introduction

Researchers acknowledge that questions about our infinite-dimensional, semi-parametric world are not well-addressed by parametric models. More sophisticated tools are needed to wrest meaning from data. We can and should develop and utilize methods specifically designed to estimate a relatively small-dimensional precisely specified parameter within such a semiparametric model that is identifiable from the data. The ideal method would be entirely a priori specified, have desirable statistical properties, avoid reliance on ad hoc or arbitrary specifications, and be computationally feasible.

Suppose one observes a sample of independent and identically distributed observations from a particular data generating distribution P0 in a semiparametric model, and that one is concerned with estimation of a particular pathwise differentiable parameter of the data generating distribution. A parameter should be viewed as a mapping from the semiparametric model to the parameter space (e.g., real line). A parameter mapping is pathwise differentiable at P0 if it is differentiable along all smooth parametric sub-models through P0, and its derivative is uniformly bounded as a linear mapping on the Hilbert space of all scores of these parametric submodels. Intuitively, a pathwise differentiable parameter is a parameter which has a finite generalized Cramer-Rao information lower bound, so that in principle, under enough regularity conditions, it is possible to construct an estimator which behaves like a sample mean of i.i.d. random variables. Due to the curse of dimensionality implied by the infinite dimension of semi-parametric models, standard (nonparametric) maximum likelihood estimation is often ill defined or breaks down due to overfitting, while, on the other hand, regularized sieve-based maximum likelihood estimation results in overly biased plug-in estimators of the target parameter of interest.

The latter is due to the fact that such likelihood based estimators seek and achieve a bias-variance trade-off that is optimal for the density of the distribution of the data itself. Since the variance of an optimally smoothed density estimator is typically much larger than the variance of a smooth (pathwise-differentiable) parameter of the density estimator, the substitution estimators are often too biased relative to their variance. That is, substitution estimators based on density estimators involving optimal (e.g., likelihood-based) bias-variance trade-off (for the whole density) are not targeted towards the parameter of interest.

Motivated by this problem with the bias-variance trade-off of maximum likelihood estimation in semiparametric models, while still wanting to preserve the log-likelihood as the principle criterion in estimation, in van der Laan and Rubin (2006) we introduced and developed a targeted maximum likelihood estimator of the parameter of interest.

The targeted maximum likelihood estimator of the distribution of the data is obtained by fluctuating an initial estimator of the relevant part of the data generating distribution with a parametric fluctuation model whose score at the initial estimator (i.e., at zero fluctuation) equals or includes the efficient influence curve of the parameter of interest, and estimating the fluctuation parameter (i.e., amount of fluctuation) with standard parametric maximum likelihood, treating the initial estimator as offset. Iteration of this targeted maximum likelihood modification step results in a so called k-th step targeted maximum likelihood estimator, and its limit in k solves the actual efficient influence curve equation defined by setting the empirical mean of the efficient influence curve equal to zero. The latter estimator we called the targeted maximum likelihood estimator, which also results in a corresponding plug-in targeted maximum likelihood estimator of the parameter of interest by applying the parameter mapping to the targeted maximum likelihood estimator.

This targeted maximum likelihood step using the fluctuation model removes bias of the initial estimator with respect to (w.r.t.) the target parameter, while increasing the variance of the estimator till the level of the semi-parametric information bound, thereby resulting in a consistent, asymptotically linear, and semi-parametric (locally) efficient estimator.

Although in a variety of applications the fluctuation model is known, e.g., randomized controlled trials with known treatment assignment and missingness mechanism, the fluctuation model typically depends on an unknown nuisance parameter, which then needs to be estimated as well. In censored data models satisfying the so called coarsening at random (CAR) assumption this nuisance parameter typically represents the censoring mechanism, and the density of the data factors in the relevant part of the density and the censoring mechanism density (e.g., Heitjan and Rubin (1991), Jacobsen and Keiding (1995), Gill et al. (1997)).

In this case, the bias reduction obtained at the targeted maximum likelihood step depends on how and how well we estimate the nuisance parameter. Specifically, the targeted maximum likelihood estimator is a so called double robust locally efficient estimator in censored data models (including causal inference models with the full data representing a collection of treatment regimen specific counterfactuals) in which the censoring mechanism satisfies the coarsening at random assumption. This means that, under regularity conditions, it is consistent and asymptotically linear if either the initial estimator is consistent or the nuisance parameter is consistent, and it is efficient in the semiparametric model if the initial estimator is consistent. Another approach for double robust locally efficient estimation is the estimating equation methodology (see, van der Laan and Robins (2003), and review below).

An outstanding open problem that obstructs the robust practical application of double robust estimators (in particular, in nonparametric censored data or causal inference models) is the selection of a sensible model or estimator of the nuisance parameter: this is particularly true when the efficient influence curve estimating equation involves inverse probability of censoring or treatment weighting, due to the enormous sensitivity of the estimator of the parameter of interest to the estimator of the nuisance parameter. A relevant recent discussion of these issues is found in Kang and Schafer (2007a), Ridgeway and McCaffrey (2007), Robins et al. (2007), Tan (2007), Tsiatis and Davidian (2007), Kang and Schafer (2007b).

Given an initial estimator, we are concerned with constructing an estimator of the nuisance parameter, that results in a better bias-variance trade-off (i.e. better MSE) for the resulting targeted maximum likelihood estimator of the target parameter than current practice. In this article we introduce a new strategy for nuisance parameter estimator selection for targeted maximum likelihood estimators that addresses this challenge by using the log-likelihood of the targeted maximum likelihood estimator (of the relevant density) indexed by the nuisance parameter estimator as the principal selection criterion. The nuisance parameter estimators needed for the targeting step are selected based on the relevant log-likelihood loss function of the resulting targeted maximum likelihood estimator, not on a loss function for the nuisance parameter itself. This approach takes into account the established fit of the initial estimator, and that the resulting estimator of the target parameter is indeed based on the relevant part of the likelihood.

Recognizing that the selected estimator of the nuisance parameter is very much a function of the goodness of fit of the initial estimator led to the development of a new theory of collaborative double robust estimation. The asymptotic linearity theory presented below involves characterizing a true minimal nuisance parameter indexed by the initial estimator limit, g0(Q), that results in an efficient influence curve that is unbiased for the target parameter. This defines the collaborative double robustness of the efficient influence curve. Given a nested sequence of increasingly non-parametric estimators, gδ, there is a gδmin corresponding to g0(Q) which makes the efficient influence curve unbiased for the target parameter. In addition, all estimates of g in the sequence that are more nonparametric than the estimator indexed by δmin, i.e. δ > δmin, also make the efficient influence curve unbiased for the target parameter. These results allow us to establish asymptotic linearity of the collaborative double robust targeted maximum likelihood estimator, under appropriate regularity conditions.

The theory is fascinating, and results in potentially super efficient estimators, the main intuition, in the context of CAR-censored data models, is that the covariates that enter the treatment mechanism and censoring mechanism estimator (i.e., nuisance parameter estimator) used to define the fluctuation model in the targeted maximum likelihood step should explain the difference between the initial estimator, Qn, and the true relevant density, Q0.

Such collaborative double robust estimators involve a variety of choices, including the choice of initial estimator and the choice of collaborative nuisance parameter estimator, but all solve the efficient influence curve equation and all rely on the collaborative nuisance parameter estimator being correctly specified so that the wished unbiasedness of the efficient influence curve is achieved in the limit. We propose using cross-validation w.r.t. a targeted loss function to select among these different collaborative targeted maximum likelihood estimators of the relevant density. In addition, we suggest the square of their influence curve or square of the efficient influence curve as a particularly suitable loss function, corresponding with selection of the estimator with minimal asymptotic variance.

An overview of relevant literature

The construction of efficient estimators of pathwise differentiable parameters in semi-parametric models requires utilizing the so called efficient influence curve, defined as the canonical gradient of the pathwise derivative of the parameter. A fundamental result of the efficiency theory is that a regular estimator is efficient if and only if it is asymptotically linear with influence curve equal to the efficient influence curve. We refer to Bickel et al. (1997), and Andersen et al. (1993). There are two distinct approaches for construction of efficient (or locally efficient) estimators: the estimating equation approach that uses the efficient influence curve as an estimating equation (e.g., one-step estimators based on the Newton-Raphson algorithm in Bickel et al. (1997)), and the targeted MLE that uses the efficient influence curve to define a targeted fluctuation function, and maximizes the likelihood in that targeted direction.

The construction of locally efficient estimators in censored data models in which the censoring mechanism satisfies the so called coarsening at random assumption (Heitjan and Rubin (1991), Jacobsen and Keiding (1995), Gill et al. (1997)) has been a particular focus area. This also includes the theory for locally efficient estimation of causal effects under the sequential randomization assumption (SRA), since the causal inference data structure can be viewed as a missing data structure on the intervention-specific counterfactuals, and SRA implies the coarsening at random assumption on the missingness mechanism, while not implying any restriction on the data generating distribution.

Gill and Robins (2001) present an implicit construction of counterfactuals as a mapping from the observed data distribution, such that the observed data structure augmented with the counterfactuals satisfies the consistency assumption and the SRA. Yu and van der Laan (2002) provide a particular explicit construction of counterfactuals from the observed data structure in terms of quantile-quantile functions, satisfying the consistency assumption and SRA. These results show that, without loss of generality, one can view causal inference as a missing data structure estimation problem. Causal graphs make explicit the real assumptions needed to claim that these counterfactuals are actually the counterfactuals of interest.

Inverse probability of censoring weighted (IPCW) estimators were originally developed to correct for confounding-induced bias in causal effect estimation. Theory for IPCW estimation and augmented locally efficient IPCW-estimator based on estimating functions defined in terms of the orthogonal complement of the nuisance tangent space in CAR-censored data models (including the optimal estimating function implied by efficient influence curve) was originally developed in Robins (1993), Robins and Rotnitzky (1992). Many papers build on this framework (see van der Laan and Robins (2003) for a unified treatment of this estimating equation methodology, and references). In particular, double robust locally efficient augmented IPCW-estimators have been developed (Robins and Rotnitzky (2001b), Robins and Rotnitzky (2001a), Robins et al. (2000b), Robins (2000a), van der Laan and Robins (2003), Neugebauer and van der Laan (2005), Yu and van der Laan (2003)).

Causal inference for multiple time-point interventions under sequential randomization was first addressed by Robins in the eighties: e.g. Robins (1986), Robins (1989).

The popular propensity score methods to assess causal effects of single time point interventions (e.g., Rosenbaum and Rubin (1983), Sekhon (2008), Rubin (2006)) have no natural generalization to multiple time-point interventions and may be inefficient (and less robust) estimators for single time point interventions, relative to the locally efficient double robust estimators such as the augmented IPCW and the targeted MLE. One crucial ingredient of these proposed methods is propensity score estimation in the absence of any knowledge of the outcomes.

Structural nested models and marginal structural models for single and multiple time point static treatment regimens were proposed by Robins as well: Robins (1997b), Robins (1997a), Robins (2000b). Many application papers on marginal structural models exist, involving the application of estimating equation methodology (IPCW and DR-IPCW): e.g., Hernan et al. (2000), Robins et al. (2000a), Bryan et al. (2003), Yu and van der Laan (2003). In van der Laan et al. (2005) history adjusted marginal structural models were proposed as a natural extension of marginal structural models, and it was shown that the latter also imply an individualized treatment rule of interest (a so called history adjusted statically optimal treatment regimen): see Petersen et al. (2005) for an application to the “when to switch” question in HIV research.

Murphy et al. (2001) present a nonparametric estimator for a mean under a dynamic treatment in an observational study. Structural nested models for modeling and estimating an optimal dynamic treatment were proposed by Murphy (2003), Robins (2003), Robins (2005a), Robins (2005b). Marginal structural models for user supplied set of dynamic treatment regimens were developed and proposed in van der Laan (2006), van der Laan and Petersen (2007) and, simultaneously and independently, in a technical report authored by Rotnizky and co-workers (2006), and Robins et al. (2008). van der Laan and Petersen (2007) also includes a data analysis application of these models to assess the mean outcome under a rule that switches treatment when CD4-count drops below a cut-off, and the optimal cut-off is estimated as well. Another practical illustration in sequentially randomized trials of these marginal structural models for realistic individualized treatment rules is presented in Bembom and van der Laan (2007).

Unified loss based learning based on cross-validation was developed in- van der Laan and Dudoit (2003), including construction of adaptive minimax estimators for infinite dimensional parameters of the full data distribution in CAR-censored data and causal inference models: see also van der Laan et al. (2006), van der Vaart et al. (2006), van der Laan et al. (2004), Dudoit and van der Laan (2005), Keleş et al. (2002), Sinisi and van der Laan (2004).

The oracle results for the cross-validation selector inspired a unified super learning methodology mapping a library of candidate estimators into a weighted combination with optimal cross-validated risk, thereby resulting in an estimator which either achieves the best possible parametric model rate of convergence up till a log-n-factor, or it is asymptotically equivalent with the oracle selected estimator that selects the best set of weights for the given data set. These results rely on the assumption that the loss function is uniformly bounded and that the number of candidates in the library is polynomial in sample size (van der Laan et al. (2007), Polley and van der Laan (2009)).

The super learning methodology applied to a loss function for the G-computation formula factor, Q0, in causal inference, or the full-data distribution factor, Q0, of the observed data distribution in CAR-censored data models, provides substitution estimators of the target parameter ψ0. However, although these super learners of Q0 are optimal w.r.t. the dissimilarity with Q0 implied by the loss function, the corresponding substitution estimators will be overly biased for a smooth parameter mapping Ψ. This is due to the fact that cross-validation makes optimal choices w.r.t. the (global) loss-function specific dissimilarity, but the variance of Ψ(Q) is of smaller order than the variance of Q itself.

van der Laan and Rubin (2006) integrates the loss-based learning of Q0 into the locally efficient estimation of pathwise differentiable parameters, by enforcing the restriction in the loss-based learning that each candidate estimator of Q0 needs to be a targeted maximum likelihood estimator (thereby, in particular, enforcing each candidate estimator of Q0 to solve the efficient influence curve estimating equation). Another way to think about this is that each loss function L(Q) for Q0 has a corresponding targeted loss function L(Q*), with Q* the targeted MLE algorithm applied to initial Q, and we apply the loss-based learning to the latter targeted version of the loss function L(Q). Rubin and van der Laan (2008) propose the square of efficient influence curve as a valid and sensible loss function L(Q) for selection and estimation of Q0 in models in which g0 can be estimated consistently, such as in randomized controlled trials.

The implications of this targeted loss based learning are that Q0 is estimated optimally (maximally adaptive to the true Q0) w.r.t. the targeted loss function L(Q*) using the super learning methodology, and due to the targeted MLE step the resulting substitution estimator of ψ0 is now asymptotically linear as well if the targeted fluctuation function is estimated at a good enough rate (and only requiring adjustment by confounders not yet accounted for by initial estimator: see collaborative targeted MLE): either way, bias reduction will occur as long as the censoring/treatment mechanism is estimated consistently. Targeted MLE have been applied in a variety of estimation problems: Bembom et al. (2008), Bembom et al. (2009) (physical activity), Tuglus and van der Laan (2008) (biomarker analysis), Rosenblum et al. (2009) (AIDS), van der Laan (2008a) (case control studies), Rose and van der Laan (2008) (case control studies), Rose and van der Laan (2009) (matched case control studies), Moore and van der Laan (2009) (causal effect on time till event, allowing for right-censoring), van der Laan (2008b) (adaptive designs, and multiple time point interventions), Moore and van der Laan (2007) (randomized trials with binary outcome). We refer to van der Laan et al. (September, 2009) for collective readings on targeted maximum likelihood estimation.

1.1. Advantages of TMLE relative to augmented IPCW estimating function methodology

Even though the augmented IPCW-estimator is also double robust, targeted maximum likelihood estimation has the following important advantages relative to estimating equation methods such as the augmented-IPCW estimator: 1) the TMLE is a substitution estimator and thereby respects global constraints of model such as that one might be estimating a probability in [0, 1] or a (monotone) survival function at a finite set of points, 2) since, given an initial estimator, the targeted MLE step involves maximizing the likelihood along a smooth parametric targeted fluctuation model, it does not suffer from multiple solutions of a (possibly non-smooth in the parameter) estimating equation, 3) the TMLE does not require that the efficient influence curve can be represented as an estimating function in the target parameter, and thereby applies to all pathwise differentiable parameters 4) it can use the cross-validated log-likelihood (of the targeted maximum likelihood estimator), or any other cross-validated risk of an appropriate loss function for the relevant factor Q0 of the density of the data, as principle criterion to select among different targeted maximum likelihood estimators indexed by different initial estimators or targeted maximum likelihood steps.

The latter allows fine tuning of initial estimator of Q0 as well as the fine tuning of the estimation of the unknowns (e.g., censoring/treatment mechanism g0) of the fluctuation function applied in the targeted maximum likelihood step, thereby utilizing the excellent theoretical and practical properties of the loss-function specific cross-validation selector. In particular, this property results in a collaborative double robust, and possibly super efficient, TMLE, as introduced and studied in this article, thereby adding theoretical and practical properties that go beyond the double robustness and efficiency. In contrast, the augmented-IPCW estimator cannot be evaluated based on a loss function for Q0 alone: the augmented-IPCW estimator is not a substitution estimator equation M1 for some equation M2 of Q0, as is the TMLE. Instead the augmented-IPCW estimator ψn is a certain function of an initial Qn and gn, where the performance of gn is scored based on the orthogonal loglikelihood of g0, for which a good fit can result in bad fit of ψ0. In trying to address these shortcomings of the augmented IPCW-estimators we converged to the targeted MLE and, subsequent refinement, the collaborative targeted MLE.

1.2. Organization of article

In Section 2 we present a description of the two stage collaborative targeted maximum likelihood methodology, the first stage representing the initial estimator, and the second stage representing the construction of a sequence of targeted maximum likelihood estimators indexed by increasingly nonparametric nuisance parameter estimators, and log-likelihood based cross-validation to select among the TMLEs and thereby select the nuisance parameter estimator. The second stage of the C-DR-TMLE can be viewed as a mapping from an initial estimator of the relevant density into a particular estimator of the nuisance parameter needed in the fluctuation function, and corresponding targeted maximum likelihood estimator using this nuisance parameter estimator in the targeted maximum likelihood step. We also provide the rational for the consistency of this C-DR-TML estimator under the collaborative double robustness assumption, relying on the earlier established oracle property of the log-likelihood-based cross-validation selector, which itself relies on the assumption that the log-likelihood loss function is uniformly bounded.

In Section 3 we define and study collaborative double robustness of the efficient influence curve. In particular, we define true nuisance parameters depending on a choice of relevant density (i.e., limit of initial estimator), which make the efficient influence curve an unbiased function for the target parameter. A collaborative targeted maximum likelihood estimator solves the efficient influence curve equation and relies on the nuisance parameter estimator to consistently estimate this true initial estimator-specific nuisance parameter or more nonparametric nuisance parameter. We also discuss alternative collaborative nuisance parameter estimators that can be used in the targeted MLE or in estimating equation methodology.

In Section 4 we prove an asymptotic linearity theorem for such collaborative double robust estimators, such as the collaborative double robust targeted maximum likelihood estimator, and discuss the conditions and implications of this theorem. In particular, this theorem provides us with influence curve based confidence intervals and tests of null hypotheses. A study of the influence curve teaches us that the C-DR-TMLE can be super efficient.

In Section 5 we consider targeted loss functions that can be used to select among different C-DR-TMLEs indexed by different initial estimators and choices of nuisance parameter estimator. These targeted loss functions can also be used to build the candidate nuisance parameter estimators within a C-DR-TMLE estimator, and thereby to construct the sequence of corresponding candidate targeted maximum likelihood estimators in the collaborative targeted maximum likelihood algorithm. Even though we enforce the use of a log-likelihood-based cross-validation selector to select among these candidate targeted maximum likelihood estimators in the C-DR-TMLE algorithm, we propose a penalized log-likelihood loss function that is more targeted towards the target parameter in the case the target parameter is borderline identifiable. This penalty is particularly important to robustify the estimation procedure in situations in which the variance of the efficient influence curve easily blows up to infinity for certain realization of the nuisance parameter estimator (e.g., close to zero inverse weights).

In section 6 we consider estimation of a causal effect in a marginal structural model, and define the collaborative double robust targeted penalized maximum likelihood estimator of the unknown parameters of the marginal structural model. In the companion paper we present a simulation study and data analysis for the C-DR-TMLE of the causal effect EY (1) − Y (0) of a binary treatment A, adjusting for baseline confounders W, based on observing n i.i.d. copies of a time-ordered data structure (W, A, Y = Y (A)).

We end this article with a discussion in Section 7, which provides a global overview.

1.3. An example to keep in mind

Although the methodology is completely general, throughout the paper we ground the discussion by referring to the following example, estimation of the additive causal effect of a binary treatment on an outcome. This example is rich enough to illustrate the ideas and methods, and has been used intensively in the causal inference literature. In this subsection we provide the notation, and objects required to define the C-DR-TMLE.

Let O = (W, A, Y = Y (A)) ~ P0 be an observed missing data structure on full data structure X = (W, Y (0), Y (1)) with missingness binary variable A [set membership] {0, 1}. For concreteness, we consider the case that Y is binary. Suppose the model for P0 is nonparametric, that the missingness mechanism g0(1 | X) = P0(A = 1 | X) = P0(A = 1 | W) satisfies the coarsening at random assumption, and that our target parameter is the causal additive risk

equation M3

where Q0 = (Q01, Q02) denotes the marginal distribution of W and conditional distribution of Y, given A, W, respectively. For notational convenience, we will suppress the F from “Full Data Parameter” in ΨF. We note that dP0(O) = Q0(O)g0(A | X) = Q01(W)Q02(Y | A, W)g0(A | X).

The efficient influence curve of Ψ at dP0 = Q0g0 is given by

equation M4

where Q0(A, W) = EQ0 (Y | A, W), hg0(A, W) = A/g0(1 | W) – (1 – A)/g0(0 | X). We note that hg0 also plays the role of the clever covariate in the targeted maximum likelihood fluctuation of the conditional distribution of Y, given A, W: log Q(ε)/(1 – Q(ε)) = log Q/(1 – Q) + εhg0.

We also note that an alternative representation of the efficient influence curve is given by the augmented IPCW-representation:

equation M5

where DIPCW(g0, ψ0) = (A/g0(1) – (1 – A)/g0(0))Y – Ψ(Q0) is the IPCW-estimating function, and DCAR(Q0, g0) is its projection onto TCAR defined as the sub-Hilbert space of equation M6 consisting of all functions of (A, W) with conditional mean zero, given W. Here equation M7 is the Hilbert space of functions of O endowed with inner product left angle bracketh1, h2right angle bracketP0 = EP0h1(O)h2(O).

2. Collaborative double robust targeted maximum likelihood estimators

We will describe the proposed collaborative double robust targeted maximum likelihood estimators in the context of censored data models, but the generalization to general semi-parametric models is immediate. We first review targeted maximum likelihood estimation and loss-based cross-validation in order to provide a foundation for the explanation of C-DR-TMLE.

2.1. Targeted MLE in CAR-censored data model

Let O = Φ(C, X) be a censored data structure on a full data random variable X, where C denotes the censoring variable. We assume coarsening at random so that the observed data structure O ~ P0 has a probability distribution whose density w.r.t an appropriate dominating measure factors as dP0(O) = Q0(O)g0(O | X), where Q0 is the part of the distribution of X that is identifiable, and g0 denotes the conditional probability distribution of O, given X, which we often refer to as the censoring mechanism. By CAR, we have g0(O | X) = h(O) for some measurable function h. If C is observed itself, then g0 denotes the conditional distribution of C, given X.

A semiparametric model equation M8 for the probability distribution P0 of the observed data structure O is implied by a model equation M9 for the full-data distribution factor Q0, and a model equation M10 for the censoring mechanism g0. The conditional distribution of O, given X, is identified by the conditional distribution of C, given X. For notational convenience, we will denote both with g0. Let O1, . . . , On be n independent and identically distributed (i.i.d.) observations of the experimental unit O with probability distribution P0 [set membership] equation M11. Let Pn be the empirical probability distribution of O1, . . . , On which puts mass 1/n on each of the n observations.

Let Ψ : equation M12 → IRd be a d-dimensional parameter that is path-wise differentiable at each P [set membership] equation M13 (w.r.t. a class of finite dimensional paths through P) with canonical gradient D*(P): i.e., for a rich class of parametric submodels {P(δ) : δ} [subset or is implied by] equation M14 through P at δ = 0 with score equation M15, equation M16 being the Hilbert space of mean zero functions of O endowed with inner product left angle bracketh1, h2right angle bracketP = Eh1h2(O) (i.e., the covariance operator), we have

equation M17

Because D*(P) is an element of the Hilbert space in equation M18 generated by all scores S of these parametric submodels (the so called tangent space), it is the canonical gradient D*(P), also called the efficient influence curve at P. Any D(P) such that EPD*(P)S = EPD(P)S for all scores S in the tangent space is called a gradient of the path-wise derivative. Thus the canonical gradient is the unique gradient that is an element of the tangent space. For the sake of illustration, it is assumed that Ψ(PQ,g) = ΨF (Q) for some ΨF : i..e, the parameter of interest is a parameter of the full data distribution of X. The efficient influence curve D*(P) at P with dP = Qg will also be denoted with D*(Q, g).

The Targeted Maximum Likelihood estimator indexed by initial (Q, g): Given any P [set membership] equation M19 with dP = Qg, let {P(ε) : ε} [subset or is implied by] equation M20 be a submodel with finite dimensional parameter ε, dominated by P, through P at ε = 0, and whose scores at ε = 0 span a finite dimensional space within equation M21 that includes the (components of the) efficient influence curve D*(P) = D*(Q, g). Because our parameter of interest is a parameter of Q0 and the factorization dP0 = Q0g0, it follows that such a fluctuation model can be chosen to only fluctuate Q with a submodel Qg(ε) [subset or is implied by] equation M22, where this fluctuation model will be indexed by g. Let dP(ε) = Qg(ε)g be such a fluctuation model with fluctuation parameter ε. In van der Laan and Rubin (2006) we also consider fluctuation models that vary both Q and g.

At a given (Q, g), one can now define a k-th step targeted maximum likelihood version equation M23 of Q0 as follows. Let L(Q) = − log Q be the log-likelihood loss. Firstly, let equation M24, where

equation M25

Here we use the notation Pf = ∫ f(o)dP(o). In general, equation M26, where

equation M27

One iterates this updating till equation M28 equals zero within a user supplied precision. The final update is refered to as the (iterative) targeted maximum likelihood estimator equation M29, indexed by the initial starting point (Q, g).

The Targeted Maximum Likelihood estimator indexed by initial estimator and estimator of nuisance parameter: The above procedure, applied to an initial estimator equation M30, and an estimator gn of g0, defines the k-th step targeted maximum likelihood estimator and its limit in k, equation M31, as introduced and analyzed in van der Laan and Rubin (2006). By definition, the targeted maximum likelihood estimator ( equation M32, gn) solves the efficient influence curve equation:

equation M33

Remark: Cross-validated initial estimator in the targeted MLE. If the initial estimator is an over-fit, then the bias reduction of the targeted MLE algorithm is not as effective. To protect against such cases one can use a cross-validated initial estimator. Specifically, let Bn [set membership] {0, 1}n be a random variable that splits the sample in a training sample {i : Bn(i) = 0} and validation sample {i : Bn(i) = 1}, and, let equation M34, equation M35, denote the empirical distribution of the training and validation sample, respectively. The above targeted MLE iterative algorithm is now given by: equation M36, where

equation M37

2.2. Loss-based cross-validation to select among (collaborative) targeted maximum likelihood estimators

Consider a loss function L*(Q) for Q0 that satisfies

equation M38

Or, more precisely, we only require that Ψ (arg minQ P0L*(Q)) = Ψ(Q0). An example of such a loss function is the the log-likelihood L*(Q)(O) = L(Q) = − log Q(O). Each loss function has a corresponding dissimilarity d(Q, Q0) = P0{L*(Q) – L*(Q0)}.

Given different targeted maximum likelihood estimators, equation M39, of Q0, for example, indexed by different initial estimators, we can use a preferred loss-function based cross-validation to select among them. Specifically, let Bn [set membership] {0, 1}n be a random variable that splits the sample in a training sample {i : Bn(i) = 0} and validation sample {i : Bn(i) = 1}, and, let equation M40, equation M41, denote the empirical distribution of the training and validation sample, respectively. The loss-function based cross-validation selector is now defined by

equation M42

The resulting targeted maximum likelihood estimator is then given by

equation M43

Cross-validation selector: Consider a preferred loss function that satisfies

equation M44
(1)

and that is uniformly bounded

equation M45

where the supremum is over the support of P0, and over all possible candidate estimators of Q0 that will ever be considered. The first property (1) applies to the log-likelihood loss function and any weighted squared residual loss function, among others. The property (1) is essentially equivalent with the assumption that the loss-function based dissimilarity d(Q, Q0) = P0L*(Q) – L*(Q0) is quadratic in a distance between Q and Q0. The property (1) has been proven for log-likelihood loss functions and weighted L2-loss functions, and is in essence equivalent with stating that the loss function implies a quadratic dissimilarity d(Q, Q0) (see van der Laan and Dudoit (2003)). If this property does not hold for the loss function, the rates 1/n for second order terms in the below stated oracle inequality reduce to the rate equation M46.

For such loss functions, the cross-validation selector satisfies the following (so called) oracle inequality: for any δ > 0,

equation M47

where the constant C(M1, M2, δ) = 2(1 + δ)2(M1/3 + M2/δ) (see page 25 of van der Laan and Dudoit (2003)). This result proves (see van der Laan and Dudoit (2003) for the precise statement of these implications) that, if the number of candidates K(n) is polynomial in sample size, then the cross-validation selector is either asymptotically equivalent with the oracle selector (based on sample of size of training samples, as defined on right-hand side of above inequality), or it achieves the parametric rate log n/n for convergence w.r.t. d(Q, Q0) [equivalent] P0{L(Q) – L(Q0)}. So in most realistic scenarios, in which none of the candidate estimators achieve the rate of convergence one would have with an a priori correctly specified parametric model, the cross-validated selected estimator selector performs asymptotically exactly as well (up till constant!) as the oracle selected estimator. These oracle results are generalized for estimated loss functions equation M48 that approximate a fixed loss function L*(Q). If arg minQ equation M49, then the oracle inequality also presents second order terms due to the estimation of the loss function.

This preferred loss function based cross-validation can now be used to select among different candidate targeted maximum likelihood estimators indexed by different initial estimators, and possibly different censoring mechanism estimators. Specifically, we will use a preferred targeted loss function to select among different collaborative targeted maximum likelihood estimators, which are just special targeted maximum likelihood estimators in the sense that gn is estimated in collaboration with the initial Qn.

For a given loss function L(Q), and an estimator Q(Pn), we will refer to PnL(Q(Pn)) as the entropy of the fit Q(Pn). Similarly, for a loss function L1(g) of g0, and an estimator ĝ(Pn), we will refer to PnL1(ĝ(Pn)) as the entropy of ĝ(Pn). Both the preferred loss function for Q0, as well as this loss function L1 for g0 represent important choices. For example, one likes to select the loss function L1 so that the dissimilarity P0{L1(g) – L1(g0)} measures strongly how well g approaches the optimal fluctuation function implied by g0. In other words, we need to keep in mind how g is used, namely that it is used to fit the wished fluctuation function implied by g0. For example, if the clever covariate defining the fluctuation function is given by AE(A/σ2(A, W) | W)/E(12(A, W) | W), as in the semiparametric regression model E(Y | A, W) – E(Y | A = 0, W) = βA, one might want to define as loss function L(θ1(g), θ2(g)) = w1(A/σ2(A, W) – θ1(W))2 + w2(1/σ2(A, W) – θ2(W))2, for weight-functions w1, w2 (functions of W), and θ1, θ2 representing the numerator and denominator of the conditional expectations in the clever covariate. Similarly, the preferred loss function for Q0 can be tuned to represent a dissimilarity d(Q, Q0) that measures strongly how well Ψ(Q) approximates Ψ(Q0). We discuss such choices in more detail in a later section.

2.3. Building a collaborative estimator of censoring mechanism/nuisance parameter

A C-TMLE estimator is constructed by building a family of candidate estimators, then choosing the best among them, using cross-validation to drive the choice to Q0. However, we also rely upon a loss function when building each candidate nuisance parameter (e.g. censoring mechanism) estimator, and it is not necessary that these two loss functions be the same. In fact, as part of building a collaborative nuisance parameter estimator in the collaborative T-MLE procedure, we couple an increase in the log-likelihood entropy of the targeted maximum likelihood estimator with an increase in the g0-loss function specific entropy of the corresponding nuisance parameter estimator. In this manner, we arrange that, for increasing sample size, the cross-validation selector will be driven towards the selection of targeted maximum likelihood estimator with an initial estimator closer to Q0 and simultaneously a more and more nonparametric estimator of g0 (thereby achieving the full wished bias reduction in the limit).

That is, given a collection of candidate estimators of g0, ordered by empirical fit w.r.t. a loss function for g0 such as the log-likelihood, we will build a sequence of targeted maximum likelihood estimators of Q0 ordered by log-likelihood entropy and indexed by increasingly nonparametric estimators of g0, where the extend of being nonparamatric is measured by the L1-entropy. Subsequently, we use the cross-validated log-likelihood for Q0 to choose among these candidate targeted maximum likelihood estimators.

There are many possible approaches that construct such an ordered sequence of targeted maximum likelihood estimators in which a next element in the sequence has both a higher entropy for the Q0-loss as well as a higher g0-loss entropy for its corresponding censoring estimator. Of course, the strict ordering is not what drives the properties of the resulting estimator, but the sequence should represent an approximately monotone function in the log-likelihood entropy of Q0 and L1-entropy of g0.

This procedure represents one particular approach for constructing a targeted maximum likelihood estimator that uses a collaboratively estimated nuisance parameter. We refer to any algorithm that maps into a targeted maximum likelihood estimator that uses a collaborative nuisance parameter estimator (relative to the Q-estimator), as a collaborative targeted maximum likelihood estimator.

2.4. A template for collaborative targeted MLEs

We present the following template providing a class of collaborative targeted maximum likelihood estimators.

Initial estimator of Q0: Build an estimator Qn of Q0, such as a super learner based on the log-likelihood loss function L(Q), or any other loss function.

Preferred loss function for Q0: Let L*(Q) be a (targeted) loss function for Q0. We note that the loss function can also be data dependent, and, in particular, the choice of loss function can depend on an initial estimator Qn of Q0, and corresponding collaborative estimator gn (see DR-IPCW loss functions in van der Laan and Dudoit (2003), and our section on targeted loss functions).

Loss function for g0: Let L1(g) be a loss function for g0.

Candidate estimators of censoring mechanism/nuisance parameter: For each δ in an index set, let g be a candidate estimator of g0. Let d(δ) = PnL1(g) denote the entropy of g, thereby measuring how data adaptive g is, and for a maximal value d(δ) or for d(δ) approximating a maximum value we have that g is actually a consistent estimator of g0.

Select ordered sequence of entropies for censoring mechanism (nuisance parameter) estimator

Select a sequence d0 > d1 > . . . > dK.

Select initial targeted maximum likelihood estimator: We start out with a equation M50 with entropy larger than d0 and a corresponding targeted maximum likelihood estimator equation M51 applied to initial estimator Qn. We refer to the pair ( equation M52, equation M53) as the initial targeted maximum likelihood estimator in the sequence of targeted maximum likelihood estimators that will be constructed below.

Construct next targeted maximum likelihood estimator in sequence

We are given an current initial estimator equation M54, a current targeted maximum likelihood estimator ( equation M55, equation M56) in our sequence of targeted maximum likelihood estimators, with equation M57 being the targeted maximum likelihood estimator applied to current initial estimator equation M58 and nuisance parameter estimator equation M59. The current nuisance parameter estimator equation M60 has entropy larger than dk. We are also given k, and thereby two corresponding entropy values dk > dk+1. (we note that the initial estimator does not get updated at each step k, but it corresponds with one of the elements in current sequence of targeted maximum likelihood estimators)

Consider an algorithm that searches among a specified set of candidate estimators g with {δ : dk > d(δ) > dk+1)} with the goal of minimizing the preferred loss L*-fit of the targeted maximum likelihood estimator, applied to initial equation M61:

equation M62
(2)

Recall that equation M63 denotes the targeted maximum likelihood estimator that uses the optimal fluctuation model identified by censoring mechanism gnδ applied to initial estimator equation M64. Let gn be the selected estimator. If either the fit is improved relative to current T-MLE equation M65,

equation M66

or the above holds for the log-likelihood loss function L(Q) = − log Q on which the targeted maximum likelihood algorithm operates, then we accept δn, and thereby the next targeted maximum likelihood estimator, equation M67, equation M68, in the sequence we are constructing. The algorithm now delivered its next k + 1-th targeted maximum likelihood estimator. We set k = k + 1, keep the initial estimator equation M69 unchanged, and the current targeted maximum likelihood estimator ( equation M70, equation M71) is now updated.

If this monotonicity condition fails to hold for both the log-likelihood fit as well as the preferred loss function fit, then we reject this δn, and update the initial estimator equation M72 by setting it equal to the current targeted maximum likelihood estimator equation M73. We now, rerun the above procedure with initial equation M74, and same dk > dk+1. This time the resulting δn will always be accepted since the log-likelihood fit of a targeted maximum likelihood estimator (a maximum likelihood fluctuation of an initial estimator) is larger than the log-likelihood of initial estimator. So the algorithm now delivers the next k + 1-th targeted maximum likelihood estimator equation M75, equation M76 in its sequence. We set k = k + 1, the initial estimator is still set at equation M77, and the current targeted maximum likelihood estimator ( equation M78, equation M79) is now updated (the last one in sequence so far).

k-th step collaborative targeted maximum likelihood estimator:

The above algorithm maps a running current initial estimator, a current targeted MLE ( equation M80, equation M81) (the lastly constructed in current sequence), into a new targeted MLE ( equation M82, equation M83), and possible updated current initial estimator. We start this algorithm with k = 0, and iterate it. This now defines the k-th step collaborative targeted maximum likelihood estimator ( equation M84, equation M85), k = 0, 1, 2, . . . , K.

We are guaranteed that the fit of equation M86 is either increasing w.r.t. the preferred loss function (most likely, since that is the loss we minimize at each step), or it is increasing w.r.t the log-likelihood loss used to define the targeted maximum likelihood step, relative to previous targeted maximum likelihood estimator equation M87. In addition, the corresponding equation M88 has a L1-fit that is larger than the L1-fit of equation M89. At every step in which the initial estimator is updated, we also know that the log-likelihood fit is increasing.

Cross-validation to select number of iterations k in k-th step C-TMLE:

Given this sequence of k-th step collaborative targeted maximum likelihood estimators equation M90, using estimator equation M91, it remains to select k, k = 0, 1, . . . , K.

We select k based on the cross-validated log-likelihood:

equation M92

where the random vector Bn [set membership] {0, 1}n denotes a cross-validation scheme such as V-fold cross-validation, and equation M93, equation M94 are the empirical probability distributions of the training sample {i : Bn(i) = 0} and validation sample {i : Bn(i) = 1}, respectively, as identified by the split vector Bn.

This finalizes the mapping from the initial estimator Qn, and the data, into a collaborative estimator of the censoring mechanism, equation M95. We refer to equation M96, paired with collaborative estimator gn, as the collaborative targeted maximum likelihood estimator of Q0.

The Collaborative (Double Robust) Targeted Maximum Likelihood Estimator: The corresponding targeted maximum likelihood estimator of ψ0 = ΨF (Q0) is given by the substitution estimator

equation M97

We refer to this estimator as the collaborative (double robust) targeted maximum likelihood estimator (C-DR-TMLE or C-TMLE) of ψ0, and we recall that it is paired with a collaborative estimator gn.

C-TMLE solves an efficient influence curve equation: Since the C-TMLE is a targeted maximum likelihood estimator equation M98, applying the fluctuation function with censoring mechanism estimator equation M99 to the estimator equation M100, it solves the efficient influence curve equation:

equation M101

This is a fundamental property of the collaborative targeted MLEs driving the targeted bias reduction w.r.t. the target parameter of interest,ψ0.

Selection among candidate C-TMLEs: The collaborative targeted maximum likelihood estimator depends on a choice of initial estimator equation M102, and choices that concern the second stage. As a consequence, one might have a set of collaborative targeted maximum likelihood estimators ( equation M103, gnj) indexed by such choices, j = 1, . . . , J. We can now select among these estimators equation M104 based on loss-based cross-validation using the preferred loss function L* for Q0.

Selection based on empirical efficiency maximization: Since, under regularity conditions of our asymptotic linearity theorem, each j-specific C-TMLE is asymptotically linear with influence curve ICj( equation M105, g0j) (equal to D*( equation M106, g0j) plus a contribution from gnj), we can select j as the minimizer of a (cross-validated) estimate of the variance of ICj( equation M107, g0j), or, if ψ0 has dimension larger than 1, then we can minimize an estimate of the variance of a function of ψ0. One could here ignore the contribution from gnj and thus use the cross-validated or empirical variance of the efficient influence curve at the collaborative targeted maximum likelihood estimator:

equation M108

Generalization. The above C-TMLE can also be called the collaborative minimum loss estimator. The loss function L(Q) needs to satisfy that the derivative of εL(Qg(ε)) at ε = 0 for a suitably constructed path {Qg(ε) : ε} equals the efficient influence curve D*(Q, g), where the efficient influence curve at P only depends on Q(P) and g(P), while the target parameter Ψ(P) = ΨF (Q(P)) depends on P only through Q(P). No further structure is needed for the above template (such as dP = Q*g, or CAR-censored data structure).

2.5. The rationale of the consistency of the collaborative-TMLE

The C-TMLE procedure starts with an initial estimator Qn of Q0. Suppose that the sequence constructed in the C-TMLE template consists of a finite number K of targeted maximum likelihood estimators equation M109. By construction, the last targeted maximum likelihood estimator in this sequence uses a censoring mechanism estimator that is nonparametric (maximal g0-entropy): i.e, the nuisance parameter estimator equation M110 as selected by the K-th step C-TMLE converges to the true g0. We also know that equation M111 is increasingly nonparametric in k, k = 1, . . . , K.

For simplicity, we also assume that the k-th targeted maximum likelihood estimator in the sequence is obtained by applying the targeted maximum likelihood algorithm to the previous targeted maximum likelihood estimator in sequence. This is not necessary, since we can apply the argument to the subsequence for which that is true (the elements in the sequence at which the targeted maximum likelihood update is actually carried out), but it simplifies the presentation.

Consider the limits equation M112 of the targeted maximum likelihood estimators equation M113 in our sequence, where gk is the limit of equation M114, k = 1, . . . , K, and thus gK = g0. We also know that equation M115 is increasing in k, by the fact that each element in the sequence is a targeted maximum likelihood estimator applied to previous element in sequence (as initial estimator in the T-MLE algorithm). Therefore, equation M116 is non-decreasing in k. As discussed in introduction, if the log Q is uniformly bounded in all its candidates Q, then the cross-validation selector of k is asymptotically equivalent with the oracle selector equation M117. For n large enough, this oracle selector behaves as equation M118, where this maximum might be non-unique. One maximum is obtained at k = K, giving equation M119 and, we know that equation M120. So if k = K, then the c-tmle will be consistent for ψ0. Suppose that k is actually smaller than K. Then we have, suppressing the g’s in the notation,

equation M121

We know that Q*k+1 is a T-MLE with Q*k as initial. So the above equalities are only possible if Q*k+1 = Q*k for k = k, . . . , K – 1. Thus equation M122. Since equation M123 is a targeted MLE at nuisance parameter g0, it follows that equation M124 is maximized at ε = 0: compare with equation M125 is maximized at ε = 0 by definition of the T-MLE algorithm. Since we just showed that equation M126, it also follows now

equation M127

is maximized at ε = 0. In particular, this means that the derivative at ε = 0 equals zero, giving us:

equation M128

However, the efficient influence curve typically satisfies that P0D*(Q, g0) = 0 implies Ψ(Q) = ψ0, which then implies equation M129. Thus equation M130 is consistent, and thereby equation M131 is consistent.

Figure 1 illustrates the collaborative nature of the construction of a sequence of increasingly data-adaptive nuisance parameter estimators, equation M132, and its relation to the performance of the initial estimator. We generated 5000 observations of O = (W, A, Y) from data generating distribution dP0 = Q0g0 defined as:

equation M133

where W1 through W5 are independent random variables ~ N(0, 1), Y = Q0(A, W) + ε, ε ~ N(0, 1), and g0 is the conditional density of A given confounding variables W = {W1, W2, W3, W4, W5}. We applied the C-TMLE to estimate the effect of binary treatment A on outcome Y, adjusting for W, defined as ψ0 = EW (E(Y | A = 1, W) – E(Y | A = 0, W)).

Figure 1:
Construction of a sequence of nuisance parameter estimators based on a poor initial fit of the density (top) and a good initial fit for the density (bottom). Kernel estimates of true densities Q0 and g0 are shown in gray.

A kernel density estimator was applied to Y and to the predicted values of two initial estimators of Q0 = E0(Y | A, W), which we denote with equation M134, and equation M135, respectively. These estimators were obtained with the D/S/A algorithm (Sinisi and van der Laan, 2004), a data-adaptive machine learning approach to model selection that was set to search over all second degree polynomials of size six. The kernel density estimates are displayed in plots on the left hand side of the figure.

In addition, we plotted the kernel density estimates of the predicted values of each set of the collaboratively-constructed candidate ĝ estimators, and we can compare them with the density of the true predictions g0(1 | W) = P0(A = 1|W). These are plotted on the right hand side of the figure, overlaid with the density estimator applied to the true values g0(1 | W). When the initial fit of Q0 is poor, the nuisance parameter estimator equation M136 converges quickly to g0 in k, and the selected candidate estimator closely approximates g0. Plots in the bottom half of the figure shows the behavior of the C-TMLE procedure when equation M137 is a good estimate of Q0. When the initial fit of Q0 is good, the nuisance parameter estimator grows slowly towards g0, and a candidate estimator that estimates a true treatment mechanism that adjusts for fewer covariates than the true treatment mechanism g0 that was used to generate the data.

2.6. Revisiting the additive causal effect example

Recall that the targeted maximum likelihood estimator applied to an estimator, Qn, of P(Y = 1 | A, W) is obtained by running a univariate logistic regression of Y , with offset the initial estimator, on an estimate of the univariate clever covariate hg0(A, W) = A/g0(1 | W) – (1 – A)/g0(0 | W), implied by the treatment mechanism estimator, using an estimator gn of treatment mechanism P(A = 1 | W).

The collaborative targeted maximum likelihood estimation procedure starts with computing equation M138, an initial estimator of P(Y = 1 | A, W) using super learning, and then collaboratively generating a sequence of targeted maximum likelihood estimators. These use increasingly nonparametric estimators of g0, applied to subsequent targeted maximum likelihood updates of the initial estimator (as needed to guarantee the monotonicity in fit). In this way the sequence of constructed targeted maximum likelihood estimators has increasing log-likelihood fit. The selection of the sequence of increasingly nonparametric treatment mechanism estimators was based on maximizing the fit of the corresponding targeted maximum likelihood estimators of P(Y = 1 | A, W), as outlined in our template, thus very much driven by the outcome data. Likelihood based cross-validation selects the wished targeted maximum likelihood estimator, with its paired treatment mechanism estimator, from this sequence. It is assumed that the resulting selection of the estimator of g0 is nonparametric enough so that the collaborative double robustness of the efficient influence curve as presented in next section is utilized, and, thereby, that our asymptotic linearity theorem in later section can indeed be applied.

A collaborative targeted maximum likelihood estimator constructed in this manner has made every effort to make the estimator of the additive causal effect as unbiased as possible. If we now construct a set of such collaborative targeted maximum likelihood estimators, possibly indexed by different initial estimators, and different ways of constructing the sequence of targeted maximum likelihood estimators, we can then select among these estimators the estimator with minimal estimated variance (based on the influence curve). To obtain an honest estimate of the variance of the resulting estimator, just as one obtained honest cross-validated risk of an estimator that internally uses cross-validation, one uses the honest cross-validated variance of the influence curve of the complete estimator, including cross-validating this final selection step that involves minimizing the variance.

3. Collaborative double robustness of estimating functions in CAR censored data models

In this section we establish a new kind of collaborative robustness of the class of estimating functions in CAR-censored data models, where, as in van der Laan and Robins (2003), the class of estimating functions is implied by the orthogonal complement of the nuisance tangent space of the target parameter Ψ : equation M139 → IRd. This orthogonal complement of the nuisance tangent space equals the space spanned by the gradients of the pathwise derivative of Ψ, and thus includes the canonical gradient/efficient influence curve. The collaborative robustness result teaches us that the censoring mechanism required to obtain an unbiased estimating function at a mis-specified Q for the parameter of interest need not always condition on the whole full data structure. In fact, it teaches us that the better Q approximates Q0 the less of an adjustment by full data random variables is necessary for the censoring mechanism to still obtain an unbiased estimating function for the parameter of interest. The precise collaborative property of (Q, g0(Q)) such that P0D(ψ0, g0(Q), Q) = 0 will be explicitly specified, where D represents the estimating function, such as the one implied by the canonical gradient.

3.1. The formal collaborative robustness result

The new form of double robustness we wish to establish is understood as follows. Consider an estimating function D(Ψ(Q), G, Q) for the parameter of interest ψ0 that is indexed by nuisance parameters (G0, Q0), and which is already known to satisfy the classical double robustness property: for any G under which ψ0 is identifiable from PQ0,G, we have E0D(ψ0, G, Q) = 0 if either Q = Q0 or G = G0 (van der Laan and Robins (2003)). Given a Q, we are interested in the question under what conditional distribution G0δ of censoring variable C, given a reduction X(δ) of X, will we still have P0D(ψ0, G0δ, Q) = 0 and thereby that D is an unbiased estimating function for ψ at this mis-specified Q.

Firstly, we note that P0D(ψ0, G, Q) = P0{D(ψ0, G, Q) – D(ψ0, G, Q0)} + P0D(ψ0, G, Q0), and the latter term is zero under any G that allows identifiability of ψ0. Thus, it remains to determine for what G0δ we will have P0{D(ψ0, G0δ, Q) – D(ψ0, G0δ, Q0)} = 0. This choice of G0δ (e.g., it includes G0 itself) is not unique but will be dependent on a difference Q – Q0 in the sense that X(δ) has to be rich enough so that it contains a difference Q–Q0.

By the general representation theorem for estimating functions that are orthogonal to the nuisance tangent space of the target parameter (Theorem 1.6, van der Laan and Robins (2003)), one can typically represent an estimating function D(ψ0, G, Q) as an Inverse Probability of Censoring Weighted Estimating function DIPCW(G, ψ0) plus a function DCAR(Q, G) in the tangent space TCAR(G) of the censoring mechanism at G. The function DCAR(Q, G) is defined as the projection of – DIPCW(G, ψ0) on the tangent space TCAR(G) = {h(O) : EG(h(O) | X) = 0} of the censoring mechanism when only assuming coarsening at random, where this projection is carried out in the Hilbert space of all functions of O with mean zero and finite variance endowed with inner product the covariance operator left angle bracketf1, f2right angle bracket = EQ,gf1(O)f2(O). In other instances, the DIPCW might depend on Q0 through another parameter beyond ψ0, in which case it will need to be assumed that this parameter is correctly specified.

This teaches us that P0{D(ψ0, G, Q) – D(ψ0, G, Q0)} = P0{DCAR(Q, G) – DCAR(Q0, G)}, since the IPCW-difference equals zero. This representation theorem also teaches us that for all Q we have that DCAR(Q, G) has conditional mean zero under G, given X. In addition, this same theorem also shows that QDCAR(Q, G) is linear in Q. Therefore, it remains to show that P0DCAR(QQ0, G) = 0. Now, inspection of the proof that the conditional mean of DCAR(Q′, G) under G equals zero for a Q′ involves typically conditioning on a rich enough reduction of X so that a particular function indexed by Q′ is fixed under the conditioning. Thus, the censoring mechanism only needs to condition on a particular function of Q – Q0.

This is best illustrated with a concrete censored data structure. For example, consider the right censored data structures O = (C, X(C)), where X(t) is a time dependent process, X = (X(t) : t) represents the full data structure, and X(t) = {X(s) : st} represents the sample path up till time t. For this censored data structure, one can represent the projection of DIPCW onto TCAR as DCAR(Q, G) = ∫ HQ,G(u, X (u–))dMG(u), where

equation M140

and ΛC|X is the cumulative hazard of C, given X. For details, we refer to chapter 3 in van der Laan and Robins (2003). Here dMG(u) is a Martingale satisfying E(dMG(u) | X(u), Cu) = 0. Due to the linearity of the conditional expectation operator, we have DCAR(QQ0, G) = ∫ HQQ0,G(u, X(u))dMG(u). By conditioning on HQ–Q0,G(u, X(u)) within the integral, and using E(dMG(u) | X(u), Cu) = 0, it follows that DCAR(Q – Q0, G) also has mean zero under a censoring mechanism s.t. λC(u | X) only depends on X(u) (it only depends on X (u) by CAR) through HQQ0,G(u, X(u)). One can factorize HQQ0,G = H1(G)H2(Q – Q0), so that adjustment in λC(u | X) by the time-dependent covariate H2(Q – Q0) (u, X (u)) suffices. Alternatively, it also suffices if the censoring mechanism uses a self-iterated adjustment by HQQ0,G as described later in this section. If Q approximates Q0, this function HQQ0,G(u, X (u)) will be shrunk to zero, so that less conditioning becomes necessary.

The following much simpler (but in essence making the same point) example helps to further illustrate the general collaborative double robustness property of the efficient influence curve. Suppose the observed censored data structure is O = (W, Δ, ΔY) and X = (W, Y) is the full data random variable, where Δ is the censoring variable. Suppose one wishes to estimate ψ0 = E0Y. The efficient influence curve is given by

equation M141

where

equation M142

[product]0(W) = P0(Δ = 1 | W) and Q0(W) = E0(Y | W, Δ = 1). Consider a Q. We are interested in the question under what conditional distribution [product]0δ of Δ, given a reduction W(δ) of W, will we still have P0D(ψ0, [product]0δ, Q) = 0 and thereby that D is an unbiased estimating function for ψ at this mis-specified Q. Firstly, we note that P0D(ψ0, [product], Q) = P0{D(ψ0, [product], Q) – D(ψ0, [product], Q0)} + P0D(ψ0, [product], Q0), and the latter term is zero under any [product] for which P0([product](W) > 0) = 1. Thus, it remains to determine for what [product]0δ P0{D(ψ0, [product]0δ, Q) – D(ψ0, [product]0δ, Q0)} = 0.

This teaches us that P0{D(ψ0, [product], Q) – D(ψ0, [product], Q0)} = P0{DCAR(Q, [product]) – DCAR(Q0, [product])}, since the IPCW-difference equals zero:

equation M143

Note that we used here that QDCAR(Q, [product]) is linear in Q. Therefore, it remains to show that P0DCAR(Q – Q0, [product]) = 0. This can be represented as H(Q – Q0, [product]0)(W)(Δ – [product]0(W)) as above, with H(Q – Q0, [product]0)(W) = (Q – Q0)(W)/[product]0(W).

The proof that the conditional mean of DCAR(QQ0, [product]) under [product] equals zero involves conditioning on a rich enough reduction of W so that QQ0 is captured by the conditioning: if (QQ0)(W) only depends on W through W(δ), then

equation M144

In particular, we have that the conditional mean of DCAR(QQ0, [product]0), given (QQ0)(W), equals zero if [product]0(W) = P(Δ = 1 | QQ0(W)). This shows that if, for example, (QQ0)(W) only depends on one component W1, then P0D(ψ0, [product]0, Q) = 0 for [product]0(W1) = P0(Δ = 1 | W1), and, more general, for [product]0(W′) with W1 [subset or is implied by] W′ . That is, the better job Q does in approximating Q0 the less inverse probability of missingness weighting is required to still obtain an unbiased estimating function for ψ0.

Summary: Consider the efficient influence curve D(Q, G). Suppose we already know that for Q with Ψ(Q) = ψ0 P0D(Q0, G) = 0 for all G. Given a Q with Ψ(Q) = ψ0, characterize the set of G0(Q)s for which P0D(Q0, G0(Q)) – D(Q, G0(Q)) = 0. For such Q and corresponding G0(Q)’s we have P0D(Q, G0(Q)) = 0. Given the representation theorem for estimating functions derived from the orthogonal complement of the nuisance tangent space, it appears that we need to determine the conditional distributions G0δ of C, given a reduction X(δ) of X, for which E0DCAR(QQ0, G0δ(Q)) = 0. Thus we need to determine the conditional distributions G0(Q) of C that solves the score equation E0DCAR(QQ0, G0) = 0 of score DCAR(QQ0, G0). In particular, if G0(Q) is a MLE of a finite dimensional parameter (e.g., same dimension as ψ0), whose score spans DCAR(QQ0, G0), then E0DCAR(QQ0, G0) = 0. More generally, if G0 is a limit of an efficient (e.g. NPMLE) estimator in a model for G0 that has a tangent space at G0 that contains DCAR(QQ0, G0), then this G0 also satisfies E0DCAR(QQ0, G0) = 0. In addition, a self-iterated iterative MLE, starting with arbitrary offset G, for a parameter with score DCAR(QQ0, G), at G, can be employed as well, as presented below, resulting in an updated G0(Q) of G so that E0DCAR(QQ0, G0(Q)) = 0.

We will now present the general result which can be applied to any CAR-censored data model as defined and studied in van der Laan and Robins (2003).

Theorem 1 (Collaborative Double Robustness of Efficient Influence Curve/Estimating Functions)

CAR-censored data model: Let O = Φ(C, X) ~ P0 be a censored data structure with full data random variable X ~ PX0, and censoring variable C with conditional probability distribution G0 of C, given X. Assume G0 satisfies the coarsening at random assumption. Let g0(C | X) = dG0(C | X) a probability density of G0 w.r.t. an appropriate dominating measure that satisfies coarsening at random itself. Let equation M145 denote the observed data model for P0. Due to CAR, we have w.r.t. an appropriate dominating measure dP0(O) = Q0(O)g0(O | X), where g0(O | X) is only a function of O (by CAR), and Q0 denotes the identifiable part of the full data distribution PX0 (Gill et al. (1997)). (Here we abused notation to indicate that the conditional density of O, given X, is a deterministic function of the conditional density of C, given X, and, in fact, represents the identifiable part of the censoring mechanism G0.) Let equation M146 and equation M147 be models for Q0 and G0 which imply a model equation M148= {dP = Qg : Q [set membership] equation M149, G [set membership] equation M150} for P0.

Parameter of interest: Let Ψ: equation M151 IRd be pathwise differentiable parameter of interest and it is assumed that Ψ(P0) = ΨF (Q0) is only a function of Q0. Let D*(Q, G) be the efficient influence curve/canonical gradient of Ψ at dP = Qg.

We make the following assumptions:

Augmented “PCW”-representation of efficient influence curve:

(PCW stands for Probability of Censoring Weighted) For each Q [set membership] equation M152, G [set membership] equation M153,

equation M154

for mappings (G, Q) → h(G, Q), (h, G, Q) → Dh,PCW(G, Γ(Q)), (h, G, Q′) → Dh,CAR(G, Q′(Q, G)), both defined on equation M155 × equation M156 × equation M157, a parameter mapping Γ on equation M158, and (G, Q) → Q′(G, Q).

(We refer to Theorem 1.3 in van der Laan and Robins (2003) for such a general representation of the efficient influence curve and, more generally, the orthogonal complement of the nuisance tangent space, where the CAR-components are elements of the tangent space TCAR of G consisting of all functions of O with conditional mean zero, given X, under G. Under that representation, we have that E0Dh,PCW(G0, γ0) = 0 and Dh,CAR(G0, Q′) has conditional mean zero, given X, for all Q′.)

Linearity of CAR-component: Q′ → Dh,CAR(G, Q′) is linear on a set equation M159 containing {Q′(G, Q) : G, Q} in the sense that for all h [set membership] equation M160, and all Q1, equation M161

equation M162

Robustness for mis-specified censoring mechanism: For all Q0 [set membership] equation M163 and G [set membership] equation M164(Q0) [subset or is implied by] equation M165, where (e.g.,) equation M166(Q0) is defined as all censoring mechanisms G for which ψ0 can be identified from dP = dQ0g, we have

equation M167

Robustness of CAR-component: For a reduction X(δ) of X (i.e., X(δ) = f(X, δ) for some function f), let G0δ be the conditional distribution of C, given X(δ).

Let equation M168 be a set within equation M169 for which for each equation M170

equation M171

(Typically, one can select equation M172 as all functions in equation M173 that are only functions of X through X(δ).)

Let Γ(Q) = Γ(Q0) (typically implying Ψ(Q) = ψ0), G0δ [set membership] equation M174(Q0), and assume equation M175, where Q′ = Q′(G0δ, Q) and Q′0 = Q′(G0δ, Q0). Then

equation M176

We also have for all G [set membership] equation M177(Q0)

equation M178

Proof. Suppose Γ(Q) = Γ(Q0) and equation M179. Let equation M180 be the conditional distribution of C, given X(δ), and assume it is an element of equation M181(Q0).

By the “Augmented ‘PCW’-representation of efficient influence curve” assumption, we have

equation M182

for some h [set membership] equation M183. Thus,

equation M184

By the assumption that equation M185, it follows that the last term equation M186.

By the “PCW-representation” assumption we have

equation M187

By the assumption that Γ(Q) = Γ(Q0), the first term equals zero. By the “linearity of CAR-component”-assumption we have that the last term equals:

equation M188

where equation M189 and equation M190.

We assumed that equation M191. Thus, by the “Robustness of CAR-component”-assumption we have that

equation M192

This proves equation M193. □

3.2. Examples illustrating the collaborative double robustness in censored data models

For the sake of illustration, we will now explicitly establish the collaborative double robustness of the efficient influence curve estimating function in two additional examples. These results are also corollaries of the above general Theorem 1.

3.2.1. Example I: Marginal additive causal effect in nonparametric model

We have the following double robustness result for our additive causal effect example.

Theorem 2 Let dP0 = Q0dG0 be the distribution of O = (W, A, Y) and let the model for P0 be nonparametric.

Let Ψ(Q0) = EQ01{EQ02(Y | A = 1, W) – EQ02(Y | A = 0, W)} be the parameter on this model, where it is assumed that it is identifiable from P0. Here Q01 denotes marginal distribution of W and Q02 the conditional distribution of Y, given A, W. The efficient influence curve of Ψ at P = (Q, G) is given by

equation M194

where Q2(A, W) = EQ(Y | A, W) denotes the conditional mean of Y, given A, W, under Q = (Q1, Q2).

Assume

equation M195

is only a function of A, W(Q) for a W(Q) = Φ(Q2, W) for some mapping Φ: i.e., W(Q) denotes a reduction or subset of the full vector random variable W indexed by Q.

Let dG0(Q) be the conditional distribution of A, given W(Q). If Ψ(Q) = Ψ(Q0), then

equation M196

Or, equivalently, if we represent D*(Q, G) as D*(Ψ(Q), Q, G), then

equation M197

We also have: If Pr(PG(A = 0 | W) * PG(A = 1 | W) > 0) = 1, then

equation M198

or equivalently,

equation M199

Proof. The last statement is easy and well known (e.g., van der Laan and Robins (2003)). The first statement needs to be proved, or can be derived as a corollary of Theorem 1. Note, if Ψ(Q) = ψ0, then

equation M200

If E0(YQ(A, W) | A, W) = f0(A, W(Q)) is only a function of A, W(Q), then it follows by first taking the conditional mean, given A, W, and then taking the mean of A, given W(Q),

equation M201

Now, note that f0(A, W(Q)) = Q0(A, W) – Q(A, W), which proves that the latter quantity equals zero.

The implication of this result is that, given an estimate Q of Q0, we only need to estimate G0(Q), conditioning on W(Q), or any conditional distribution that conditions on more than W(Q). Thus, if Q already succeeds in explaining most of the true regression E0(Y | A, W), then only little inverse weighting with G0(Q) = P(A = · | W(Q)) remains to be done. That is, the amount and manner of inverse weighting required to obtain a consistent estimator of the causal effect ψ0 can be adapted to the approximation error of Q relative to the true regression.

3.2.2. Example II: Semiparametric regression

Let O = (W, A, Y) ~ P0. Assume the model E0(Y | A, W) – E0(Y | A = 0, W) = 0V for some V [subset or is implied by] W. If the variance of Y, given A, W, only depends on W, then the efficient score of β0 at P0 can be represented as

equation M202

where [product]0(W) = E0(A | W), and θ0(W) = E0(Y | A = 0, W). For the sake of illustration we will use this simpler representation, but the same double robustness applies to the general efficient influence curve representation as (e.g.) presented in van der Laan and Robins (2003).

Theorem 3 Suppose E0(Y – Aβ0V – θ(W) | A, W) = f0(W(θ)) for some function f0 of W(θ) where W(θ) = Φ(W, θ) is function of W and θ. Note that this states that θ0(W) – θ(W) = f0(W(θ)) is only a function of a reduction W(θ) of W. Let [product]0(θ)(W) = E0(A | W(θ)). Then

equation M203

We also have

equation M204

Proof. Only the first robustness result needs to be proved. First take the conditional mean, given A, W, which results in the term E0(A[product]0(θ)(W(θ))) f0(W(θ)). Subsequently, we take the conditional mean, given W(θ), which proves it equals zero. □

3.3. Construction of collaborative double robust estimators

By using a collaborative estimator gn(Q) of a g0(Q) in the set of conditional distributions that conditions on the required function of QQ0 (and g0 itself), one can construct collaborative double robust estimators. For example, one could use the targeted maximum likelihood estimator applied to initial estimator Qn and using the resulting collaborative estimator gn(Qn). One can also use estimating equation methodology, solving for ψn in 0 = PnD*(ψ, Qn, gn(Qn)). The formal asymptotic linearity (and thereby asymptotic normality) of such estimators is studied in the next section. Our proposed collaborative targeted maximum likelihood procedure is one particular collaborative double robust targeted maximum likelihood estimator, which also involves updating the initial estimator Qn beyond the construction of an appropriate gn(Qn). However, we could also simply have taken gn(Qn) from our proposed collaborative targeted maximum likelihood procedure, and still use the targeted maximum likelihood estimator with initial estimator Qn. In addition, we could also have used our proposed collaborative estimator gn(Qn) to solve an estimating equation 0 = PnD*(Qn, gn(Qn), ψ) = 0 in ψ.

Other methods for construction of collaborative estimators gn(Qn) are of interest as well. For example, one could consider a collection of one-dimensional fluctuations of Qn and use maximum likelihood to test these fluctuations. In this manner one can select a dimension reduction involving the X-components that still significantly increase the log-likelihood (or other loss function) beyond the initial fit Qn. One could then fit gn by running a machine learning algorithm that only conditions on the selected components. This procedure only uses the initial estimator to obtain a dimension reduction, but from then on it uses an external procedure based on the loss function for g0.

Given an initial estimator Qn, another idea of interest for construction of a collaborative estimator gn(Qn) is the following. One first constructs a sequence of increasingly nonparametric estimators ĝj of g0, j = 1, . . . , J. These estimators could already be based on a dimension reduction based on offset by initial estimator. Given an initial estimator Q, we select the following estimator of g0:

equation M205

where Bn denotes a random variable in {0, 1}n defining a random split in training sample {i : Bn(i) = 0} and validation sample {i : Bn(i) = 1}, and equation M206, equation M207 denote the empirical probability distributions of the training and validation sample, respectively. Thus, one selects the estimator that minimizes the Euclidean norm of the cross-validated mean of efficient influence curve at the estimator Q. If j is too small, then P0D*(Q, gj) will be non zero, so that jn will always select a large enough j for n tending to infinity. If, on the other hand, j is large enough so that P0D*(Q, gj) = 0, then the expectation of {PnD*(Q, gj)}2 will be equal to its variance which will be increasing in j, so that smaller j’s, but larger than the critical one, will be preferred. One now defines as collaborative estimator the estimator gn(Qn) [equivalent] ĝjn(Pn) indexed by this choice jn.

3.4. Estimating the sufficient minimal adjustment covariate from the data

Let H(g0, QQ0) be the component that needs to be adjusted for in g0 = g0(Q). One could estimate this component from the data using appropriate methodology. If, given an arbitrary initial fit g, one would add H(g, QQ0) as main term in a fluctuation model of g, and the fluctuation function is chosen so that the score of the coefficient of this main term at zero equals DCAR(QQ0, g0), then the MLE-update of g will solve the wished score equation P0DCAR(QQ0, g0) = 0. We can refer to H(g0, QQ0) as the minimal adjustment covariate, needed to obtain the wished collaborative robustness.

The sufficient covariate H(g, QQ0), that is needed to update g, depends on g itself as well, so that, even given an estimate of QQ0, the above maximum likelihood update of an initial g does not work. There are two approaches that can be used to deal with this self-dependence of the minimal adjustment covariate for g.

Firstly, one can extract the few components of only QQ0, and enforce nonparametric adjustment by these covariates in a fit of g0(Q) of the censoring mechanism. In this manner, the resulting censoring mechanism estimator will estimate a true conditional distribution that conditions on covariates that imply the value of H(g0(Q), QQ0). Secondly, one can also only adjust for H(g0, QQ0) as a main term, given an estimate g0, and iterate this updating process of g till convergence. In the latter case, as mentioned above, it is assumed that the score of the fluctuation of g implied by this main term extension H(g, QQ0) at zero equals DCAR(g, QQ0). Let’s illustrate these two approaches with an example.

For example, in the additive causal effect example with O = (W, A, Y), ψ0 = EY (1)–Y (0) and Y continuous, we have equation M208, and the score of ε, at ε = 0, of logistic regression model gε(1 | W) = 1/(1 + exp(–C0(W) – εH(g, QQ0)), using an offset C0(W), is given by DCAR(g, QQ0) = H(g, QQ0)(Ag(1 | W)).

One can estimate E(YQ|A = 1, W) and E(YQ|A = 0, W) with a machine learning algorithm, treating an initial estimate Q as offset (possibly cross-validated to make the offset independent of Yi). If Y is binary, we would use Q as offset and one could estimate these QQ0-components by running a logistic regression with Q as offset. Given this estimate of the two QQ0-components that span H(g0, QQ0), one could now force nonparametric adjustment by these two estimated covariates in the estimate of g0.

Alternatively, given an initial estimator equation M209, one could obtain an estimate equation M210 by plugging in an estimate of Q0Q, and this initial equation M211. One could now force in this equation M212 as a main term in equation M213, resulting in an updated equation M214. This process is iterated till convergence. In the limit we have that gn solves equation M215. Here equation M216 would already be a collaborative estimator of g0(Q), such as the one proposed in our collaborative targeted maximum likelihood estimator, so that the collaborative double robustness is preserved (we do not want to only rely on correct specification of equation M217, and thereby of this sufficient minimal covariate)

We do not advise starting the above iterative algorithm at a purposely misspecified estimator equation M218. Instead we want to apply the above iterative algorithm at a collaborative estimator equation M219, such as the one presented in our template of the C-TMLE in Section 2. For example, after having run the C-TMLE in Section 2, we would carry out a subsequent update of the resulting collaborative estimator gn by applying the above iterative updating algorithm, starting at equation M220, and using an estimator of Q0Q. If we would only include this estimate of the sufficient covariate H(g0, QQ0) in gn, then the consistency of the estimator ψn fully relies on correct estimation of QQ0, and thereby on correct estimation of Q0, and therefore would not utilize the collaborative double robustness of the efficient influence curve. Instead of carrying out a subsequent update of a collaborative estimator gn using the iterative algorithm, we could incorporate an estimate Hn (or its (QQ0)-components) in our proposed template for the collaborative targeted MLE by forcing it in our candidate censoring mechanism estimators.

In the above additive causal risk example, we estimate H(g, QQ0) by plugging in an initial estimate g0 and QQ0, and the iterative adjustment succeeds in its goal as long as the estimate of QQ0 is correct, even if (the initial) g is misspecified. One could also use a representation such as H(g0, QQ0) = Eg0,Q0(DIPCW(g0, Q) | A = 1, W) – E0(DIPCW(g0, Q) | A = 0, W), and estimate the two regressions by regressing an IPCW-function indexed by g0, Q on A, W, and evaluate it at A = 1 and A = 0, respectively. Here DIPCW = (YQ){A/g0(A | W) – (1 – A)/g0(A | W)}. One could now apply the above iterative updating algorithm to this (non-substitution based) manner of estimating H(g0, QQ0). As shown in (e.g.) van der Laan and Robins (2003) for monotone censored data structures and causal inference data structures, involving censoring and treatment actions over time, DCAR(QQ0, g0) does allow such a representation Σj Hj(g0, QQ0)(A(j)–g0j(1 | equation M221(j)), where equation M222(j) represents the history before censoring or treatment A(j), and Hj(g0, QQ0) = E0(DIPCW | A(j) = 1, equation M223(j)) – E0(DIPCW | A(j) = 0, equation M224(j)) for some IPCW-function. The disadvantage of this approach is that it relies on g0 representing a true conditional distribution, while in the iterative substitution based approach the main term adjustment at a possibly misspecified g still yields the wished collaborative double robustness.

4. Asymptotic linearity of collaborative double robust TMLE

The collaborative targeted maximum likelihood estimator equation M225 equals a kn-th step collaborative targeted maximum likelihood estimator, and thereby equals a targeted maximum likelihood estimator with a starting estimator equation M226 (e.g., the kn – 1-th collaborative targeted maximum likelihood estimator), and the censoring mechanism estimator gn = gn as selected in the kn-step, given the collection of candidate estimators g indexed by δ ranging over an index set.

Thus, just like the targeted maximum likelihood estimator, the collaborative targeted maximum likelihood estimator equation M227 of ψ0 solves the efficient influence curve estimating equation

equation M228

For simplicity, we will make the assumption that the efficient influence curve at a PQ,g can be represented as an estimating function in ψ: i.e., the efficient influence curve at P can be represented as D*(Q(P), g(P), ψ(Q(P))) for some mapping (Q, g, ψ) → D*(Q, g, ψ). However, the theorem in this section can be generalized to any efficient influence curve D*(Q, g) at a data generating distribution PQ,g.

It is a reasonable assumption that equation M229 converges to some element Q* in the model for Q0, where Q* is not necessarily equal to the true Q0. In addition, let’s assume that, for each δ, the δ-specific censoring mechanism estimator g converges to some g0δ. For example, if δ indicates an adjustment set, then it might be assumed that g converges to the true conditional distribution, given this δ-specific adjustment set.

For a given Q, we define δ(Q) as the index δ with entropy d(δ) minimal and so that

equation M230

In other words, given the family of adjustments indexed by δ, δ (Q) represents the minimal adjustment necessary in the censoring mechanism to obtain the collaborative double robustness/unbiased estimating function for ψ0. It is then a natural assumption that

equation M231

In other words, if one uses a more nonparametric estimator of the censoring mechanism than needed (i.e.., than δ(Q)), then one certainly obtains the wished unbiasedness.

We will assume that, as n converges to infinity, then the selected censoring mechanism estimator gn = gn converges to a fixed g0δ0 representing the limit of a g0 , not necessarily equal to the conditional distribution, given the full X. For notational convenience, we will also denote this limit with g0.

It is assumed that d(δ0) ≥ d(Q*) so that

equation M232

which will be the fundamental assumption for asymptotic normality of the CTMLE. In other words, it is assumed that our collaborative C-TMLE procedure selects a nonparametric enough estimator gn for the censoring mechanism (in collaboration with equation M233) so that the required unbiasedness of the efficient influence curve estimating function is achieved.

To derive the influence curve of equation M234, the asymptotic linearity theorem below assumes also that the limit of the selected censoring mechanism estimator satisfies

equation M235
(3)

As a consequence of this assumption (3), the influence curve does not involve a contribution requiring the analysis of a function of equation M236. This important simplification of the influence curve allows straightforward calculation of standard errors for the C-TMLE. The assumption (3) requires the limit g0 to be nonparametric enough w.r.t. the actual estimator equation M237 so that enough orthogonality is achieved to make the contribution equation M238 second order.

Why assumption (3) holds for C-TMLE: We now explain why this assumption is reasonable for the C-TMLE.

Define equation M239 as equation M240 with equation M241. In other words, equation M242 corresponds with the limit of the least nonparametric estimator (among all estimators more nonparametric than the one identified by δ0) that still yields the wished unbiasedness of the estimating function at equation M243, and it as close as possible to g0 = g0δ0.

We note that

equation M244

where Rn is a second order term (like Rn1 below) involving the difference equation M245 and equation M246. By definition of equation M247 and the fact that equation M248 converges to Q*, it is reasonable to assume equation M249 as n → ∞. So Rn is a second order term, so that it is reasonable to assume equation M250.

By definition of equation M251, we do not only have

equation M252

but also that equation M253 is equally or more nonparametric than g0(Q*) so that

equation M254

This implies now that indeed

equation M255

Finally, we note that the next theorem can be applied to any collaborative double robust estimator, as discussed in previous section, not only the collaborative double robust targeted maximum likelihood estimator.

Theorem 4 Let (Q, g, ψ) → D*(Q, g, ψ) be a well defined function that maps any possible (Q, g, Ψ(Q)) into a function of O. Let O1, . . . , On ~ P0 be i.i.d, and let Pn be the empirical probability distribution. Let Q → Ψ(Q) be a d-dimensional parameter, where ψ0 = Ψ(Q0) is the parameter value of interest. In the following template for proving asymptotic linearity of equation M256 as an estimator of Ψ(Q0), equation M257 represents the collaborative targeted maximum likelihood estimator, but it can be any estimator.

Let Q* denote the limit of equation M258. Let gn be an estimator and g0 denote its limit.

Assume

Efficient Influence Curve Estimating Equation: equation M259, where equation M260.

Censoring Mechanism Estimator is Nonparametric Enough:

equation M261

equation M262

(Above we show why the latter is indeed a second order term for the C-TMLE.)

Consistency:

equation M263

as n → ∞. And the same is assumed if one or two of the triplets equation M264 is replaced by its limit (Q*, g0, ψ0).

Identifiability/Invertibility: c0 = –d/dψ0P0D*(Q*, g0, ψ0) exists and is invertible.

Donsker Class: {D*(Q, g, Ψ(Q)) : Q, g} is P0-Donsker, where (Q, g) vary over sets that contain ( equation M265, gn), (Q*, gn), ( equation M266, g) with probability tending to 1.

Contribution due to Censoring Mechanism Estimation: Define the mapping g → Φ(g) [equivalent] P0D*(Q*, g, ψ0). Assume equation M267 for some mean zero function equation M268.

Second order terms: Define second order term

equation M269

and assume equation M270. Note Rn1 is a second order term involving difference between equation M271 and gng0.

Define second order term

equation M272

and assume equation M273. Note Rn2 is a second order term involving differences gn – g0 and ψn – ψ0.

Then, ψn is asymptotically linear estimator of ψ0 at P0 with influence curve

equation M274

That is,

equation M275

In particular, equation M276 converges in distribution to a multivariate normal distribution with mean zero and covariance matrix Σ0 = E0IC(P0)IC(P0)[top top].

Proof: The principal equations are equation M277 and P0D*(Q*, g0, ψ0) = 0. So, we have

equation M278

Let equation M279. Then,

equation M280

We denote the three terms on the right with I,II and III, and deal with them separately below.

I: By the Donsker condition, and consistency condition, we have

equation M281

Thus, we obtain equation M282 as first term approximation. We refer to van der Vaart and Wellner (1996) for this empirical process theorem.

II: We have

equation M283

The first term is equation M284 by our Donsker class condition, and consistency condition at equation M285. We also have

equation M286

where

equation M287

by assumption.

Rn1 is a second order term involving equation M288 and (gn, ψn) – (g0, ψ0). It remains to consider the term equation M289, which is equation M290 by “Censoring Mechanism is Nonparametric Enough”-assumption.

III: We have

equation M291

The first term is equation M292 by Donsker class condition, and consistency condition at equation M293, gn, ψn. We also have

equation M294

where

equation M295

by assumption. Thus the third term equals P0D*(Q*, gn, ψ0)–D*(Q*, g0, ψ0), which, by definition, equals Φ(gn)–Φ(g0). We assumed that equation M296. Thus, the third term equals equation M297.

We can thus conclude that

equation M298

This implies equation M299, and thereby the stated asymptotic linearity. □

4.1. Statistical Inference

If Q* = Q0, then ICg0 = 0, so that the influence curve reduces to the efficient influence curve D*(Q0, g0, ψ0) at a possibly weakly adjusted g0. If gn converges to the fully adjusted conditional distribution, given X, then we know that ICg0 equals minus the projection of D*(Q*, g0, ψ0) onto the tangent space of the model used by gn (van der Laan and Robins (2003), Section 2.3.7). We suggest that, even if g0 is not the fully adjusted censoring mechanism, we will typically still have that D*(Q*, g0, ψ0) is a conservative influence curve. In other words, if Qn starts approximating the true Q0, then the ICg0 contribution gets smaller and smaller, while if Qn stays away from Q0, then gn starts approximating the fully adjusted g0, in which case, inference based on D* is conservative. This might explain why we see good coverage in our simulations based on “influence curve” equation M300. If gn corresponds with a parametric MLE estimator (for a data adaptively selected parametric model), then we propose to use the parametric delta-method to compute the analytic formula for the influence curve ICg0 in order to obtain an accurate influence curve.

One can estimate the covariance matrix Σ = E0ICIC[top top] of the influence curve with the empirical covariance matrix equation M301, and statistical inference can be based on the corresponding mean zero multivariate normal distribution, as usual.

4.2. Selection among difference collaborative targeted maximum likelihood estimators

Suppose that we have a set of candidate collaborative targeted maximum likelihood estimators equation M302, k = 1, . . . , K. Suppose that each of these estimators satisfy the conditions of the theorem. For example, these might be collaborative targeted maximum likelihood estimators as defined in our template, using different initial estimators indexed by k, but the same collaborative estimator for the censoring mechanism as a function of the data and the initial estimator (thus still resulting in different realizations if the initial estimators are different). Then equation M303 is asymptotically linear with influence curve equation M304, k = 1, . . . , K. We can now select among these candidate C-DR-TMLEs by maximizing the estimated efficiency, as in Rubin and van der Laan (2008).

Specifically, let Ψ be a one-dimensional parameter. We now select the k that minimizes the cross-validated variance of the influence curve:

equation M305

Thus, we would use the estimator equation M306. If Ψ is multidimensional, then one needs to agree on a real valued criterion applied to the covariance matrix of the influence curve, such as the sum of the variances along the diagonal, and minimize over k the criterion of the cross-validated covariance matrix of the k-specific influence curve.

4.3. Irregular C-TMLE and super efficiency

If gn converges to the fully adjusted g0(· | X) (fully adjusting for X, under CAR) and equation M307 converges to Q0, then it follows that ψn is asymptotically linear with influence curve equal to the efficient influence curve D*(Q0, g0, ψ0). So in that case, ψn is an asymptotically efficient estimator and thereby also a regular estimator.

Due to the particular way gn is constructed in response to Qn, it is easily argued that the collaborative targeted MLE can be an irregular estimator and can be super efficient by achieving an asymptotic variance that is smaller than the variance of the efficient influence curve. In particular, our previous arguments showed that if the initial estimator is a maximum likelihood estimator according to a correctly specified parametric model, then gn will avoid nonparametric fits, thereby staying away from estimating the fully adjusted g0 that would result in an efficient estimator in first order. In this case, by the above theorem, the influence curve of ψn will be equal to D*(Q0, g0, ψ0), using a non-fully adjusted g0, so that the variance of the influence curve will be smaller than the variance of the efficient influence curve that involves a fully adjusted g0.

The super efficiency may have very attractive features in practice. For example, there might be a covariate that is very predictive of censoring/treatment, but have no relation to the outcome. The C-TMLE will now decide to not adjust for this covariate at all in the selected censoring mechanism, and as a consequence, it might achieve the efficiency bound for the data structure excluding this covariate, but still assuming CAR, so that the C-TMLE will have smaller asymptotic variance than the efficiency bound. The resulting super efficient estimator not only shows improved precision, but also yields more reliable confidence intervals, by avoiding heavily non-robust (and harmful) operations. In most practical scenarios, such a covariate will still have a weak link with the outcome. In this case, for very large sample sizes, the C-TMLE will adjust for this covariate and thereby only be asymptotically efficient, but it will still behave as a super efficient estimator for practical sample sizes, by not adjusting for this covariate. That is, it invests in effective bias reduction focussing on covariates that are still predictive of the outcome, taking into account the already included initial estimator. This behavior is completely compatible with an estimator that aims to minimize mean squared error of the estimator of the target parameter, and certainly avoids steps that both increase bias as well as variance.

Finally, we remark that in simulations in which Qn converges fast to the true Q0, gn seems to have a temptation to converge to a random choice g0 that is beyond the required minimal censoring mechanism with probability 1. That is, likelihood based cross-validation might over-select the adjustment in the censoring mechanism relative to the minimal adjustment, and the amount of over-selection remains random (but small) for large sample sizes (this is a known property of cross-validation). This naturally results in an irregularity of the estimator. Simulations have not shown practical problems for statistical inference, but this remains an area of study.

5. Targeted loss functions implied by efficient influence curve

The template of the collaborative targeted maximum likelihood estimators is based on 1) a log-likelihood loss function (i.e., same loss function that is maximized at targeted maximum likelihood step) to select among candidate targeted maximum likelihood estimators indexed by increasingly nonparametric estimators of censoring mechanism, and 2) a preferred loss function to compare targeted maximum likelihood estimators using different censoring mechanism estimators, in order to build these candidate censoring mechanism estimators. One can also use a preferred loss function to select among different candidate collaborative targeted maximum likelihood estimators (e.g., indexed by different initial estimators).

In this section we propose targeted loss functions implied by the efficient influence curve of Ψ. Firstly, the log-likelihood can be replaced by a penalized log-likelihood that is sensitive to sparse data bias w.r.t. target parameter, as defined in next subsection This penalized log-likelihood can also play the role of the preferred loss function. In the second subsection we propose as preferred loss function the cross-validated variance of the efficient influence curve, relying on an overall collaborative estimator of censoring mechanism w.r.t. an initial estimator, or a candidate specific collaborative estimator. In the last subsection, we utilize the mean of the efficient influence curve as a criterion to generate a targeted loss function for Q that incorporates a sequence of increasingly nonparametric estimators of the censoring mechanism.

5.1. The MSE-penalized cross-validated log-likelihood

In the C-DR-TMLE we applied loglikelihood based cross-validation to select among different targeted maximum likelihood estimators, indexed by different censoring mechanism estimators. We propose here a penalized log-likelihood criterion that results in robust estimators in the context of sparse data w.r.t. the parameter of interest.

Consider candidate (e.g., collaborative) targeted maximum likelihood estimators equation M308 of the true Q0 [set membership] equation M309, targeting a parameter ψ0 = Ψ(Q0), indexed by δ. Our proposed criterion for selecting δ is

equation M310

where the first term is the cross-validated log-likelihood for the candidate estimator equation M311, and MSE(Pn)(δ) is an estimator of the mean squared error (variance plus bias-squared) of the substitution estimator equation M312 as an estimator of its δ-specific limit (thus ignoring asymptotic bias). The MSE(Pn)(δ) is possibly appropriately scaled relative to the log-likelihood term. The sole motivation for the proposed additional penalty term is to make the criterion more targeted towards ψ0, while still preserving the log-likelihood as the dominant term in regular situations: i..e, asymptotically, the penalty is negligible (in regular situations the MSE behaves as 1/n).

5.1.1. Variance of targeted maximum likelihood estimator relative to its δ-limit

If the target parameter cannot be reasonably identified from the data the log-likelihood of the targeted maximum likelihood estimator is not sensitive enough to such a singularity: in fact, on many occasions this just means that the targeted maximum likelihood algorithm will be ineffective (i.e., the maximum likelihood fluctuations get too noisy) so that in essence the log-likelihood of the initial estimator drives the selection.

Therefore it is crucial that the log-likelihood terms are penalized by a term that blows up (in the negative direction) for δ-values for which the variance (or bias, addressed in next subsection) of the targeted maximum likelihood estimator equation M313 relative to its limit equation M314 gets large. Since we can derive the influence curve of the targeted maximum likelihood estimator equation M315 as an estimator of ψ0(δ), this variance can be estimated with the variance of this influence curve at this targeted maximum likelihood estimator equation M316. As follows from the study of TMLE in van der Laan and Rubin (2006) one can often use as approximate influence curve the efficient influence curve D*( equation M317, gδ) at the limit of the targeted maximum likelihood estimator ( equation M318, ĝδ), which simplifies the penalty while it remains equally effective.

We first define the cross-validated covariance matrix

equation M319

For example, if the target parameter is 1-dimensional (i.e., d = 1), then we have

equation M320

For example, one can define the variance term of the MSE as

equation M321

for a user supplied vector a, so that σ2(Pn)/n represents the variance estimate of the estimator of a[top top]ψ0(δ).

Our proposal presented in the MSE-subsection below will actually have the form

equation M322
(4)

where aj are the row vectors of the square root of a user supplied matrix such as the inverse of the the correlation matrix of Σ(Pn)(δ).

5.1.2. Bias of targeted maximum likelihood estimator relative to its δ-limit

We might also wish to estimate the bias of the targeted maximum likelihood estimator equation M323 relative to its limit ψ0(δ) (even though in most applications the variance appears to drive the MSE term). For example, this could be done with the bootstrap:

equation M324

where equation M325 represents the empirical distribution of a bootstrap sample equation M326 from the empirical distribution Pn. However, this would be much too computer intensive in many applications in which the targeted maximum likelihood estimator involves data adaptive model or algorithm selection. By noting that a bootstrap sample corresponds on average with 2/3 of the n observations, the following analogue bias estimate can be viewed as an approximation of this bootstrap bias that only requires 3 times applying the targeted maximum likelihood estimator to a sample of size n * 2/3:

equation M327

where Bn3 denotes the 3-fold cross-validation scheme.

If d = 1, then we will add to the variance term in the previous section the squared bias B(Pn)2 to create a MSE-term. If d > 1, then in our proposal below we will construct an appropriate function of B(Pn) representing the analogue of the variance term (4):

equation M328

Additional rationale behind bias term: To provide further understanding of this kind of bias estimate B(Pn), we note the following. Let [Psi](Pn) be an estimator of its target [Psi](P0), where it plays the role of the δ-specific targeted maximum likelihood estimator equation M329. The fundamental assumption allowing statistical inference for [Psi](P0) is the assumption of asymptotic linearity:

equation M330
(5)

where D(P0) is the influence curve of the estimator, and R(Pn) is the remainder. The asymptotic linearity assumption now assumes that equation M331.

The representation (5) of the mapping Pn[Psi](Pn) implies for any cross-validation scheme Bn

equation M332

where we use that equation M333. Thus, our proposed bias estimate B(Pn) equals, for any cross-validation scheme, an average difference of the remainder applied to a subsample of size n(1–p) and the full sample of size n. Therefore, one can conclude that this term will be very sensitive to a large remainder (e.g., second order terms) in the asymptotic linearity expansion (5).

5.1.3. MSE of targeted maximum likelihood estimator relative to its δ-limit

If d = 1, then we define the MSE term as

equation M334

If d > 1, then we assume that we are provided with a user-specified d × d symmetric positive definite matrix ρ, so that the square root of this matrix ρ1/2 exists. Our MSE term will represent the expectation of the Euclidean norm of ρ1/2([Psi]ψ), or equivalently, the expectation of ([Psi]ψ)[top top]ρ([Psi]ψ). One concrete proposal is to set ρ1/2 equal to the square root of the inverse of an estimate of the correlation matrix of the asymptotic covariance matrix of equation M335, so that the linearly transformed vector has uncorrelated components.

Let aj be the j-th row of the matrix ρ1/2, j = 1, . . . , d. The wished MSE term is now the sum of the MSEs of the linear combination equation M336. Therefore, the MSE term is represented as

equation M337

This is equivalent to defining a variance term

equation M338

a bias term

equation M339

and defining

equation M340

5.2. Targeted loss functions relying on a collaborative estimate of censoring mechanism

Let D*(Q0, g0, ψ0) be the efficient influence curve at dP0 = Q0g0 for the parameter Ψ : equation M341 → IR. Consider a set of estimators Qk(Pn) that are all more nonparametric than an initial estimator Q0(Pn). Let ĝ0 be a collaborative estimator, relative to the initial estimator Q0, so that P0D*(Q0*, g0, ψ0) = 0. Since, Qk is more nonparametric than Q0, it is reasonable to assume that ĝ0 is also a collaborative estimator for Qk, so that equation M342, where now equation M343 denotes the limit of the targeted maximum likelihood estimator of Qk using ĝ0.

We can use as criterion for selection among Qk, the cross-validated estimated variance of equation M344, where equation M345 denotes the limit of the targeted maximum likelihood estimator of Qk using ĝ0. Of course, this is also a selection among the collaborative targeted maximum likelihood estimators equation M346.

By our asymptotic linearity theorem, this selection among Qk corresponds with minimizing the variance of the influence curve of the (collaborative) targeted maximum likelihood estimators equation M347 based on initial estimator Qk using the collaborative estimator ĝ0. The crucial assumption is that ĝ0 is indeed a collaborative estimator estimating a true g0 that involves enough adjustment, so that equation M348. However, this assumption is needed for construction of estimators of ψ0, and is already relying on weaker assumption than double robustness.

This criterion can now be used as the preferred loss function in our template for the collaborative targeted maximum likelihood estimator to build the candidate censoring mechanism estimators. It will require obtaining a single collaborative estimator ĝ0 after having obtained the initial estimator: for example, one could carry out a dimension reduction based on improved fits relative to initial estimator, and estimate it with a super learner only adjusting for the selected variables.

We already discussed that the same criterion can be used to select among any set of candidate collaborative targeted maximum likelihood estimators ( equation M349, ĝk), relying on collaborative estimators ĝk, so that P0D*(Qk*, gk, ψ0) = 0:

equation M350

However, the above only relies on a single collaborative estimator ĝ0, and is therefore particularly suitable for building the second stage within the collaborative targeted MLE template.

5.3. Targeted loss functions incorporating a sequence of increasingly nonparametric estimators of censoring mechanism

Consider an initial estimator Q. We wish to develop a criterion to select among candidate estimators of Q0 that are more nonparametric than Q, such as estimators using Q as offset. A special application of the loss function presented in this subsection is that Q is empty. Consider a sequence of increasingly nonparametric estimators ĝj, j = 1, . . . , J, of g0, which could be based on Q: For example, they might be based on a dimension reduction which has as goal to estimate Q0Q. Consider the following hypothetical criterion for a candidate Q:

equation M351

for a list of weights/scalars (w(j) : j = 1, . . . , J), where ‖ · ‖ denotes a possibly weighted Euclidean norm applied to the vector of the form P0D*(Q, g).

Firstly, if Q = Q0, then the criterion is minimized. In addition, let j0(Q) be such that P0D*(Q, ĝj(P0)) = 0 for all jj0. Since this mean zero property holds if ĝj solves a score implied by QQ0, it follows that the closer Q is to Q0 the more j-specific terms within this sum are close to zero. Finally, by the fact that D* is a gradient of the path-wise derivative of the target parameter Ψ, for jj0, we have that P0D*(Q, ĝj(P0)) ≈ Ψ(Q) – ψ0 (either exact, or in first order), which shows that this criterion also targets ψ0 (see van der Laan and Robins (2003), Section 1.4).

The cross-validated analogue of this criterion is given by

equation M352

If ψ0 is one dimensional, a sensible choice of weights are given by the inverse of the variance of the empirical means, so that it downweights the noisy j-specific signals:

equation M353

and makes the criterion Dn() unit free. Analogues of such weighting can be obtained for multi-dimensional ψ0 as well, and are recommended.

One can add this criterion to a valid loss function such as the log-likelihood criterion L(Qk), giving a more targeted loss function

equation M354

The additional term Dn preserves the validity of the loss function L (i.e., its minimum still identifies Q0), while it makes the selection targeted towards ψ0. One can add both Dn as well as the above presented MSE-penalty, where the latter is asymptotically negligible but important in sparse data (w.r.t. ψ0) situations.

6. Example: Targeted maximum likelihood estimation of the marginal structural model

Suppose we observe O = (W, A, Y = Y (A)), where W are baseline covariates, A is a discrete treatment, and Y is a subsequently measured outcome. It is assumed that A is realized in response to the realization of W, and Y is realized in response to both W and A. The full data structure on the experimental unit is X = (W, (Y (a) : a)), so that A represents the missingness variable for the missing data structure O on X. We assume the randomization assumption: g0(a | X) = P0(A = a | X) = P0(A = a | W).

Consider a marginal structural model for the full data distribution

equation M355

that models the causal effect of a treatment intervention A = a on the outcome Y. For example, one might assume a simple linear model m(a, V | β0) = β0(a, V, aV).

Since it is often unreasonable to assume such a parametric form, but such parametric forms can still provide very meaningful projections of the true causal curve, we consider the nonparametric extensions of the parameter β0:

equation M356

where Q0(a, w) = E0(Y | A = a, W). We have that Ψh(P0) represents a projection of E0(Y (a) | V) onto the working model m(| β0). Specifically,

equation M357

In particular, if E0(Y (a) | V) = m(a, V | β0), then for each h we have Ψh(P0) = β0. Without the randomization assumption and consistency assumption, we can interpret Ψh(P0) as the same projection, but E(Y (a) | V) = E(E(Y | A = a, W) | V) now represents a (non-causal) dose response curve of the effect of A on Y that controls for the measured confounders W.

We note that this nonparametric extension only depends on P0 through the conditional mean of Y, given A, W, and the marginal distribution of W. For simplicity, we will also use the notation Ψh(Q0), where Q0 now denotes both the marginal distribution of W and the conditional distribution of Y , given A, W.

The efficient estimating function for this nonparametric extension Ψh of β0 is given by:

equation M358

where equation M359. We will assume that h1(A, V) = d/dψ0m(A, V | ψ0)h(A, V) is chosen so that h1 does not depend on ψ0, which is easily arranged for the case that m is linear in ψ and that m is logistic linear. For example, if we use the linear form (1, a, V, aV), then, if m is linear, then we can choose h1 = d/dψ0m(A, V | ψ0)g(A | V) = ((1, A, V, AV)g(A | V), and, if m is logistic linear, then we can choose h1 = d/dψ0m(A, V | ψ0)/(m(1 – m)(A, V | ψ0))g(A | V), which equals (1, A, V, AV)g(A | V), again.

Let equation M360 be the corresponding efficient influence curve obtained by standardizing the efficient estimating function by the negative of the inverse of the derivative matrix c0 = d/dψ0E0Dh(P0) (noting that Dh(P0) can indeed be viewed as a function in ψ0).

If Y is continuous and we use a normal error regression model as a working model, then a targeted maximum likelihood estimator of ψh0 can be obtained by adding to an initial estimator Q0(A, W) of E0(Y | A, W) the d-dimensional ε-extension εCh(g)(A, W), where

equation M361

for some fit g of g0, and fitting ε with maximum likelihood (i.e., least squares estimation) using Q0 as offset. The resulting update Q1(A, W) is now a first step targeted maximum likelihood estimator. It is also the actual targeted maximum likelihood estimator, since iteration is not resulting in further updates (the clever covariate does not change at a targeted MLE update). One estimates the distribution of W with the empirical distribution. The estimate Q1 and the empirical distribution of W now yields a substitution estimate of the target parameter ψh0. If Y is binary, the same εCh(g)(A, W) is added on the logit scale, and ε is fitted with maximum likelihood estimation. Again, the targeted maximum likelihood estimator converges in one step.

6.1. Penalized log-likelihood for candidate treatment mechanism fits

Let Q(Pn) be an initial regression estimator of Q0 = E0(Y | A, W). For a given Pnĝ(Pn), let equation M362 be the targeted maximum likelihood estimator corresponding with the covariate Ch(ĝ(Pn)). Let Bn be a cross-validation scheme, and let equation M363 and equation M364 be the empirical distributions of the validation and training sample, respectively, as identified by Bn [set membership] {0, 1}n. Let

equation M365

be the cross-validated estimate of the covariance matrix of the efficient influence curve at the estimator Q and ĝ. We also consider the empirical estimate of this covariance matrix

equation M366

Let

equation M367

be the bias estimator for the targeted maximum likelihood estimator equation M368 obtained by plugging in equation M369 in the parameter mapping Ψ. Here we can use three-fold cross-validation as choice for Bn.

We will penalize the log-likelihood with the an estimate of the following average mean squared error

equation M370

This mean squared error can be decomposed as equation M371 and equation M372. The variance terms of this mean squared error can be estimated by

equation M373

where

equation M374

We keep open the option that one uses either the cross-validated covariance matrix [Sigma]CV (Pn) or the empirical covariance matrix [Sigma](Pn).

The bias terms of this mean squared error can be estimated as

equation M375

If m is linear in β, then this reduces to

equation M376

Thus, we obtain the following mean squared error estimate for the targeted maximum likelihood estimator [Psi]ĝ(Pn) for a given g-estimator:

equation M377

We suggest that the penalized log-likelihood could also only be penalized by the empirical variance component of the MSE. Therefore, we also define

equation M378

Consider now the following two penalized log-likelihood criterions for ĝ, given the initial estimator Q0:

equation M379

or

equation M380

For Y binary, the RSS is replaced by the log-likelihood of Y , given A, W.

We will use the penalized log-likelihood as loss function to build the candidate treatment mechanism estimators, according to the template for the collaborative targeted MLE in Section 2.

6.2. Algorithm for estimating the treatment mechanism based on penalized log-likelihood

Given any candidate adjustment set W* [subset or is implied by] W, let an estimator ĝ(Pn)(W*) of g0(A | W*) be specified. This allows us to define a criterion in candidate adjustment sets W*, given the current estimator Q:

equation M381

Thus, one can evaluate/score any given adjustment set W* with L(W* | Q).

Given Q, one can now use this empirical criterion in adjustment sets to construct an estimator of g0(Q) with a greedy type algorithm maximizing over a set of candidate adjustment sets. One starts with the empty adjustment set and selects the best addition move among a set of candidate addition moves based on the criterion. One iterates this process until there does not exist an addition move that improves the criterion. More aggressive greedy algorithms can be implemented as well, as with any machine learning algorithm that is based on iterative local maximization of an empirical criterion. One could apply this algorithm to candidate adjustments sets that have a certain size or entropy for the corresponding ĝ(Pn)(W*).

Alternatively, one creates a sequence of nested (increasing in size) adjustment sets equation M382, j = 1, . . . , J, for each equation M383 one obtains a particular estimator ĝj(Pn) of equation M384 (e.g., using super learning), and maximizes the penalized log-likelihood criterion over all these J adjustment sets.

In our algorithm in the next subsection defining the sequence of C-TMLEs we apply this greedy algorithm to candidate estimators that are more nonparametric than the selected estimator of g0 in the previous step.

6.3. Iteration to obtain sequence of targeted maximum likelihood estimators indexed by increasingly non-parametric estimators of treatment mechanism

Given an initial estimator Q of E(Y | A, W) and a corresponding estimator ĝ(Q) defined above, sometimes denoted with ĝ, we define a resulting targeted maximum likelihood estimator

equation M385

where εn is the least squares estimator of the regression coefficient ε treating Q(Pn) as offset and h(ĝ(Q)(Pn)) as covariate. We can define this as a first step targeted maximum likelihood estimator based on an initial Q(Pn), and corresponding censoring mechanism estimator ĝ(Q)). Let’s denote this operation as:

equation M386

This process can now be iterated by replacing Q(Pn) by this update Q1(Pn):

equation M387

where we require that the next censoring mechanism estimator ĝ(Q1)(Pn) is obtained with the same algorithm as presented in above subsection, but now maximizing over candidate estimators that are more nonparametric than the previously obtained ĝ(Q)(Pn).

In general, we define the k-th step of this targeted maximum likelihood estimator as

equation M388

where ĝ(Qk−1)(Pn) involves maximizing over more nonparametric candidate estimators than ĝ(Qk–2)(Pn).

This algorithm results in a sequence of k-th step collaborative targeted maximum likelihood estimators Ψ(Qk(Pn)) of ψ0, and corresponding increasingly nonparametric censoring mechanism estimators ĝk(Pn) (i.e., ĝ(Qk–1)(Pn) in above notation), k = 1, . . . , K.

We could also have defined candidate targeted maximum likelihood estimators using a forward selection algorithm each time finding the best next term to add in the treatment mechanism, so that k denotes the number of terms included in ĝk. In that case, k corresponds with the size of the model for ĝk, and the targeted MLE step would be carried out when needed in order to guarantee an increase in either the penalized or non-penalized log-likelihood fit of Qk, as described in our general template for collaborative targeted maximum likelihood estimation in Section 2.

6.4. Collaborative TMLEs

If the initial estimator Q is indexed by a choice δ1 and the choice of algorithm ĝ(Q) is indexed by a δ2, then, for each δ1, δ2, this results in candidate k-th step collaborative targeted maximum likelihood estimators equation M389, corresponding treatment mechanism estimators equation M390, and corresponding equation M391 targeted maximum likelihood estimators of ψ0, indexed by k.

For each δ1, δ2, in order to select among these candidate targeted maximum likelihood estimators indexed by k, we use the cross-validated penalized log-likelihood defined as:

equation M392

This results now in candidate (δ1, δ2)-specific collaborative targeted maximum likelihood estimators equation M393, with corresponding initial estimator Qδ1 and collaborative treatment mechanism estimator ĝδ1,δ2 (note the choice of k is now a function of δ1, δ2 so that also the collaborative estimator ĝ is indexed by these choices).

6.5. Selection among candidate collaborative targeted maximum likelihood estimators

We could select δ1, δ2 by minimizing the same cross-validated penalized log-likelihood, e.g., by simply simultaneously minimizing the above criterion over the triplets (k, δ1, δ2). Alternatively, we could employ empirical efficiency maximization for all these candidate collaborative targeted maximum likelihood estimators that are assumed to be asymptotically linear with influence curve the efficient influence curve plus a contribution from the collaborative estimator of g0, as stated in our asymptotic linearity theorem. Thus, by ignoring this latter contribution to the influence curve, we could also select δ1, δ2 as the minimizer of the sum of the variances of the components of the efficient influence curve of ψ0: (with δ = (δ1, δ2))

equation M394

Other criteria based on the vector-efficient influence curve could be considered as well.

6.6. Statistical inference based on CLT

The resulting collaborative targeted maximum likelihood estimator Qn = Q*(Pn) and corresponding gn = ĝ(Pn) solve the efficient influence curve equation 0 = PnD*(Ψ(Qn), gn, Qn), so that ψn = Ψ(Qn) can be analyzed with our asymptotics theorem, and inference can be based on the influence curve. So we could estimate the covariance matrix as

equation M395

where one should include the gn-component to obtain more accurate inference. Statistical inference would be based on the normal working model ψn ~ N(ψ0, Σn) to construct confidence intervals, confidence bands, p-values, and possible multiple testing adjusted p-values.

7. Discussion

For most data sets little to no knowledge is available about the data generating distribution. Consequently, the true model is a large infinite dimensional semi-parametric model. In such models many data adaptive approaches can be considered for fitting the true distribution of the data, based on different approximation function spaces, different searching strategies for maximizing an empirical criterion (such as the empirical log-likelihood) over these spaces, and different methods for selecting the fine tuning parameters indexing the function spaces and search strategies. Depending on the true data generating distribution, these algorithms will have very different levels of performance in approximating the true data generating distribution. As a consequence, cross-validation based super learning should be employed to find the best weighted combination among a large user supplied set of candidate estimators of the true data generating distribution. The user has an option to select the wished loss function. The oracle property of the cross-validation selector (van der Vaart et al. (2006), van der Laan et al. (2006)) teaches us that the super learner will asymptotically perform exactly as well, w.r.t. the loss-based (e.g., Kullback-Leibler) dissimilarity measure, as the best weighted combination of the candidate algorithms optimized for each data set. A crucial assumption is that the loss function used in this super learning methodology is uniformly bounded in all the candidate estimators. This could be viewed as a semi-parametric model assumption, and it is essential for robustness of the resulting estimator against outliers. It can be arranged by bounding the candidate estimators so that the loss function remains bounded.

The super learner fit represents a best fit for the purpose of estimation of the whole distribution of the data w.r.t. the loss-function specific dissimilarity, so that the bias-variance trade-off is not targeted (enough) w.r.t. the parameter of interest. Overall, the resulting estimate it overly biased for the smooth target parameter.

Therefore, our methodology involves a second targeted modification of the first stage super learner fit that aims to reduce the bias w.r.t the target parameter, while simultaneously increasing the likelihood fit. A single fluctuation function that would yield asymptotic optimal bias reduction as defined by the efficient influence curve of the target parameter is determined. This fluctuation function needs to have a score-vector at zero fluctuation whose linear span includes the efficient influence curve of the target parameter. This fluctuation function depends on an unknown nuisance parameter of the data generating distribution, such as a censoring mechanism.

In one particular embodiment of our C-TMLE, we define an iterative sequence of subsequent fluctuations, starting with the initial super learner fit, where the subsequent fluctuation functions are estimated with increasingly nonparametric estimates of the nuisance parameter, including a final fully non-parametrically estimated fluctuation function. In addition, by construction, we make sure that for each fluctuation function the nuisance parameter estimator that results in maximal increase in likelihood (or preferred loss function) fit is selected, among the candidate nuisance parameter estimators that are more nonparametric than the one selected at previous fluctuation function. In this way, we arrange that most of the targeted bias reduction occurs in the first few fluctuations. The actual number of times we carry out the subsequent update is selected with likelihood based cross-validation.

Essentially, we try to move towards the asymptotically optimal bias reduction (identified by the true nuisance parameter/censoring mechanism used in the data generating distribution) along a sequence of targeted bias reduction steps, but we stop moving towards this asymptotically optimal bias reduction when it results in a loss of likelihood fit as measured by the cross-validated log-likelihood. In general, our template for C-TMLE allow a fine sequence of nested targeted bias reduction steps (i.e., a fine sequence of candidate second stage estimators indexed by increasingly nonparametric nuisance parameter estimators) whose fits contain this set of candidate-fits as a subsequence, thereby potentially providing an additional improvement in practical performance of the resulting C-TMLE.

Theoretical results teach us that this push towards the asymptotically optimal bias reduction takes into account how well the initial estimator approximates the true distribution, by giving preference to targeted bias reduction steps that improve the log-likelihood fit using the initial estimator as off-set. As a consequence, the C-TMLE is able to avoid selecting irrelevant or harmful (w.r.t. relevant factor of density) fits of the nuisance parameter, even though such fits might improve the overall fit of the nuisance parameter. That is, the fit of the nuisance parameter is targeted towards our primary goal, the parameter of interest.

Specifically, we establish a general asymptotic linearity theorem for collaborative targeted maximum likelihood estimators, which provides us with the influence curve of our estimator, and thereby statistical inference. Fortunately, the influence curve happens to be equal to the efficient influence curve at the collaborative nuisance parameter estimator (its limit) plus a contribution of the nuisance parameter estimator as an estimator of its limit, providing immediate variance estimates. Gains in efficiency, resulting in possible super efficiency, are obtained by estimating the nuisance parameter collaboratively, in relation to the initial estimator. An inspection of the efficient influence curve allows us to precisely define the sufficient and minimal term H(g0, QQ0) one needs to adjust for in g0 to achieve the wished full bias reduction. We propose to estimate this term and force adjustment by this term in the candidate censoring mechanism estimators within the C-TMLE procedure, without relying on its correct specification, thereby preserving and only enhancing the double robustness of the C-TMLE.

The targeted maximum likelihood estimator relies itself on maximizing an empirical mean of a loss function over a fluctuation function, and the derivative of this empirical criterion at zero fluctuation needs to be the empirical mean of the efficient influence curve at the estimator to be fluctuated. We refer to this loss as the log-likelihood loss, but it needs to be understood that this choice can already be targeted (e.g,. van der Laan (2008b)) in the sense that it is typically implied by the efficient influence curve of the target parameter (e.g, one would not use a likelihood based on factors that are not needed to evaluate the target parameter). In addition, we propose to replace this log-likelihood by a penalized log-likelihood, where the penalty is scaled appropriately, has negligible contribution for nicely identifiable target parameters, but blows up for fits that result in extremely variable or biased estimators of the parameter of interest. Even though the penalty’s effect on the Kullback-Leibler dissimilarity is asymptotically negligible for identifiable parameters, for parameters that are borderline identifiable, this penalty can yield dramatic additional finite sample improvements to the C-TMLE. In essence, it builds in a sensible robustness of the resulting C-TMLE as an estimator of the target parameter.

Given the initial estimator, the candidate censoring mechanism estimators, and the corresponding sequence of targeted MLE indexed by these increasingly nonparametric estimators of the censoring mechanism, the log-likelihood or penalized log-likelihood is used for cross-validation selection of the depth of the bias reduction (i.e, for selection among these candidate targeted MLEs) in the C-TMLE algorithm. However, a preferred loss function can be used to build the initial estimator, to build the candidate censoring mechanism estimators, and to select among different collaborative C-TMLEs. In particular, one could use the penalized log-likelihood as preferred loss function.

We propose as a possibly more targeted selection criterion the cross-validated variance of the square of the efficient influence curve (e.g., if target parameter is one-dimensional), where a collaborative estimator is used for the nuisance parameter (censoring mechanism) in the efficient influence curve: i.e. we propose as loss function for a candidate Q the square of the efficient influence curve at its targeted version, using a collaborative estimator of g0. Indeed, equation M396 is a variance of a collaborative double robust estimator of ψ0, indexed by initial estimator Q, using known g0, and is thereby a valid and sensible loss function. In order to also take into account that g0 is estimated, one could also minimize the variance of the actual influence curve of the collaborative double robust targeted MLE using Q as initial and using gn as collaborative estimator. This would imply the same square of influence curve, but now the influence curve equals D* plus an contribution from estimation of g0.

Finally, we are also able to select a targeted loss function for g0 in the CTMLE template, by making its loss-based dissimilarity a measure of goodness of fit of the g0-specific fluctuation function in which the estimator of g0 is used.

The collaborative double robust maximum likelihood estimator utilizes 1) loss-based super learning to obtain the best initial estimator of Q0 (grounded by theory for cross-validation selector), 2) loss-based super learning to generate best estimators of candidate censoring mechanisms adjusting for increasingly large adjustment sets (grounded by theory for cross-validation selector), 3) specification of loss functions that target the needed Q0, g0 for identification of the efficient influence curve, 4) targeted maximum likelihood estimation along a fluctuation function using such a censoring mechanism estimator to remove bias for target parameter (grounded by theory for targeted maximum likelihood estimation), 4) Q0-(penalized)-likelihood based cross-validation to select among such candidate censoring mechanism estimators, and thereby obtain the collaborative estimator of the censoring mechanism for the fluctuation function that removes the bias, while avoids unnecessary loss in variance (grounded by oracle results for cross-validation, collaborative double robustness, and our asymptotic linearity theorem), 5) efficient influence curve based dimension reductions (the minimal sufficient covariate) to be included in the censoring mechanism estimators that allow for effective bias reduction in the T-MLE, if correctly specified, (and will not harm the collaborative double robustness if not) (grounded by theory on efficient influence curve representation and our collaborative double robustness theorem), 6) efficient influence curve based loss function for Q0 to build more targeted candidate censoring mechanism estimators 7) efficient influence curve based loss functions for Q0 to make the initial estimator more targeted, to make the selection among candidate C-TMLE more targeted, resulting in smaller asymptotic variance of the resulting C-TML estimator of the target parameter (grounded by empirical efficiency maximization theory, and our asymptotic linearity theorem for C-TMLE).

To summarize, this article provides a template for loss-based adaptive (super) efficient estimators in semiparametric models that are targeted towards a particular target feature of the distribution of the data, and for which statistical inference is available.

Footnotes

*We like to thank the referees for their helpful comments. This research was supported by NIH grant R01 A1074345-03.

References

  • Andersen PK, Borgan O, Gill RD, Keiding N. Statistical Models Based on Counting Processes. Springer-Verlag; New York: 1993.
  • Bembom O, Petersen ML, Rhee S-Y, Fessel WJ, Sinisi SE, Shafer RW, van der Laan MJ. Biomarker discovery using targeted maximum likelihood estimation: Application to the treatment of antiretroviral resistant hiv infection Statistics in Medicine, page http://www3.interscience.wiley.com/journal/121422393/abstract, 2008. [PubMed]
  • Bembom O, van der Laan MJ, Haight T, Tager IB. Lifetime and current leisure time physical activity and all-cause mortality in an elderly cohort. Epidemiology. 2009 doi: 10.1097/EDE.0b013e31819e3f28. [PubMed] [Cross Ref]
  • Bembom Oliver, van der Laan Mark. Statistical methods for analyzing sequentially randomized trials, commentary on JNCI article Adaptive therapy for androgen independent prostate cancer: A randomized selection trial including four regimens, by Peter F. Thall, C. Logothetis, C. Pagliaro, S. Wen, M.A. Brown, D. Williams, R. Millikan. Journal of the National Cancer Institute. 2007;2007;99(21):1577–1582. doi: 10.1093/jnci/djm185. [PMC free article] [PubMed] [Cross Ref]
  • Bickel PJ, Klaassen CAJ, Ritov Y, Wellner J. Efficient and Adaptive Estimation for Semiparametric Models. Springer-Verlag; 1997.
  • Bryan J, Yu Z, van der Laan MJ. Analysis of longitudinal marginal structural models. Biostatistics. 2003;5(3):361–380. doi: 10.1093/biostatistics/kxg041. [PubMed] [Cross Ref]
  • Dudoit S, van der Laan MJ. Asymptotics of cross-validated risk estimation in estimator selection and performance assessment. Statistical Methodology. 2005;2(2):131–154. doi: 10.1016/j.stamet.2005.02.003. [Cross Ref]
  • Gill R, Robins JM. Causal inference in complex longitudinal studies: continuous case. Ann Stat. 2001;29(6)
  • Gill RD, van der Laan MJ, Robins JM. Coarsening at random: characterizations, conjectures and counter-examples. In: Lin DY, Fleming TR, editors. Proceedings of the First Seattle Symposium in Biostatistics. New York: Springer Verlag; 1997. pp. 255–94.
  • Heitjan DF, Rubin DB. Ignorability and coarse data. Annals of statistics. 1991 Dec;19(4):2244–2253. doi: 10.1214/aos/1176348396. [Cross Ref]
  • Hernan MA, Brumback B, Robins JM. Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. Epidemiology. 2000;11(5):561–570. doi: 10.1097/00001648-200009000-00012. [PubMed] [Cross Ref]
  • Jacobsen M, Keiding N. Coarsening at random in general sample spaces and random censoring in continuous time. Annals of Statistics. 1995;23:774–86. doi: 10.1214/aos/1176324622. [Cross Ref]
  • Kang J, Schafer J. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data (with discussion) Statistical Science. 2007a;22:523–39. doi: 10.1214/07-STS227. [PMC free article] [PubMed] [Cross Ref]
  • Kang J, Schafer J. Rejoinder: Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science. 2007b;22:574–80. doi: 10.1214/07-STS227REJ. [PMC free article] [PubMed] [Cross Ref]
  • Keleş S, van der Laan M, Dudoit S. Asymptotically optimal model selection method for regression on censored outcomes. Technical Report, Division of Biostatistics, UC Berkeley. 2002
  • Moore KL, van der Laan MJ. Covariate adjustment in randomized trials with binary outcomes Technical report 215, Division of Biostatistics, University of California; Berkeley: April2007.
  • Moore KL, van der Laan MJ. Application of time-to-event methods in the assessment of safety in clinical trials. In: Peace Karl E., editor. Design, Summarization, Analysis & Interpretation of Clinical Trials with Time-to-Event Endpoints. Chapman and Hall; 2009.
  • Murphy SA. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B. 2003;65(2)
  • Murphy SA, van der Laan MJ, Robins JM. Marginal mean models for dynamic treatment regimens. Journal of the American Statistical Association. 2001;96:1410–1424. doi: 10.1198/016214501753382327. [PMC free article] [PubMed] [Cross Ref]
  • Neugebauer R, van der Laan MJ. Why prefer double robust estimates. Journal of Statistical Planning and Inference. 2005;129(1–2):405–426. doi: 10.1016/j.jspi.2004.06.060. [Cross Ref]
  • Petersen Maya L, Deeks Steven G, Martin Jeffrey N, van der Laan Mark J. History-adjusted marginal structural models: Time-varying effect modification and dynamic treatment regimens Technical report 199, Division of Biostatistics, University of California; Berkeley: December2005.
  • Polley EC, van der Laan MJ. Predicting optimal treatment assignment based on prognostic factors in cancer patients. In: Peace Karl E., editor. Design, Summarization, Analysis & Interpretation of Clinical Trials with Time-to-Event Endpoints. Chapman and Hall; 2009.
  • Ridgeway G, McCaffrey D. Comment: Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data (with discussion) Statistical Science. 2007;22:540–43. doi: 10.1214/07-STS227C. [PMC free article] [PubMed] [Cross Ref]
  • Robins J, Orallana L, Rotnitzky A. Estimaton and extrapolation of optimal treatment and testing strategies. Statistics in Medicine. 2008;27(23):4678–4721. doi: 10.1002/sim.3301. [PubMed] [Cross Ref]
  • Robins JM, Rotnitzky A. Comment on the Bickel and Kwon article, “Inference for semiparametric models: Some questions and an answer” Statistica Sinica. 2001a;11(4):920–936.
  • Robins JM, Hernan MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology. 2000a;11(5):550–560. doi: 10.1097/00001648-200009000-00011. [PubMed] [Cross Ref]
  • Robins JM, Rotnitzky A, van der Laan MJ. Comment on “On Profile Likelihood” by S.A. Murphy and A.W. van der Vaart. Journal of the American Statistical Association – Theory and Methods. 2000b;450:431–435. doi: 10.2307/2669381. [Cross Ref]
  • Robins JM, Sued M, Lei-Gomez Q, Rotnitzky A. Comment: Performance of double-robust estimators when “inverse probability” weights are highly variable. Statistical Science. 2007;22:544–559. doi: 10.1214/07-STS227D. [Cross Ref]
  • Robins JM. Robust estimation in sequentially ignorable missing data and causal inference models. Proceedings of the American Statistical Association; 2000a.
  • Robins JM. Discussion of “optimal dynamic treatment regimes” by susan a. murphy. Journal of the Royal Statistical Society: Series B. 2003;65(2):355–366.
  • Robins JM. Optimal structural nested models for optimal sequential decisions. Heagerty PJ, Lin DY, editors. Proceedings of the 2nd Seattle symposium in biostatistics. 2005a;179:189–326.
  • Robins JM. Optimal structural nested models for optimal sequential decisions Technical report, Department of Biostatistics, Havard University; 2005b.
  • Robins JM. A new approach to causal inference in mortality studies with sustained exposure periods - application to control of the healthy worker survivor effect. Mathematical Modelling. 1986;7:1393–1512. doi: 10.1016/0270-0255(86)90088-6. [Cross Ref]
  • Robins JM. The analysis of randomized and non-randomized aids treatment trials using a new approach in causal inference in longitudinal studies. In: Sechrest L, Freeman H, Mulley A, editors. Health Service Methodology: A Focus on AIDS. U.S. Public Health Service, National Center for Health SErvices Research; Washington D.C.: 1989. pp. 113–159.
  • Robins JM. Proceeding of the Biopharmaceutical section. American Statistical Association; 1993. Information recovery and bias adjustment in proportional hazards regression analysis of randomized trials using surrogate markers; pp. 24–33.
  • Robins JM. Causal inference from complex longitudinal data. In: Berkane Editor M., editor. Latent Variable Modeling and Applications to Causality. Springer Verlag; New York: 1997a. pp. 69–117.
  • Robins JM. Structural nested failure time models. In: Armitage P, Colton T, Andersen PK, Keiding N, editors. The Encyclopedia of Biostatistics. John Wiley and Sons; Chichester, UK: 1997b.
  • Robins JM. Statistical models in epidemiology, the environment, and clinical trials (Minneapolis, MN, 1997) Springer; New York: 2000b. Marginal structural models versus structural nested models as tools for causal inference; pp. 95–133.
  • Robins JM, Rotnitzky A. Comment on Inference for semiparametric models: some questions and an answer, by Bickel, P.J. and Kwon, J. Statistica Sinica. 2001b;11:920–935.
  • Robins JM, Rotnitzky A. Recovery of information and adjustment for dependent censoring using surrogate markers AIDS Epidemiology, Methodological issues. Bikhäuser1992.
  • Rose S, van der Laan MJ. Simple optimal weighting of cases and controls in case-control studies The International Journal of Biostatistics, page http://www.bepress.com/ijb/vol4/iss1/19/, 2008. doi: 10.2202/1557-4679.1115. [PMC free article] [PubMed] [Cross Ref]
  • Rose S, van der Laan MJ. Why match? investigating matched case-control study designs with causal effect estimation The International Journal of Biostatistics, page http://www.bepress.com/ijb/vol5/iss1/1/, 2009. doi: 10.2202/1557-4679.1127. [PMC free article] [PubMed] [Cross Ref]
  • Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70:41–55. doi: 10.1093/biomet/70.1.41. [Cross Ref]
  • Rosenblum M, Deeks SG, van der Laan MJ, Bangsberg DR. The risk of virologic failure decreases with duration of hiv suppression, at greater than 50% adherence to antiretroviral therapy PLoS ONE 49e7196doi:10.1371/journal.pone.0007196, 2009. doi: 10.1371/journal.pone.0007196. [PMC free article] [PubMed] [Cross Ref]
  • Rubin DB. Matched Sampling for Causal Effects. Cambridge University Press; Cambridge, MA: 2006.
  • Rubin DB, van der Laan MJ. Empirical efficiency maximization: Improved locally efficient covariate adjustment in randomized experiments and survival analysis The International Journal of Biostatistics 41, Article 5, 2008. doi: 10.2202/1557-4679.1084. [PMC free article] [PubMed] [Cross Ref]
  • Sekhon JS. Multivariate and propensity score matching software with automated balance optimization: The matching package for R. Journal of Statistical Sotware, Forthcoming. 2008
  • Sinisi S, van der Laan MJ. The deletion/substitution/addition algorithm in loss function based estimation: Applications in genomics. Journal of Statistical Methods in Molecular Biology. 2004;3(1)
  • Tan Z. Comment: Understanding or, ps and dr. Statistical Science. 2007;22:560–568. doi: 10.1214/07-STS227A. [Cross Ref]
  • Tsiatis A, Davidian M. Comment: Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data (with discussion) Statistical Science. 2007;22:569–73. doi: 10.1214/07-STS227B. [PMC free article] [PubMed] [Cross Ref]
  • Tuglus C, van der Laan MJ. Targeted methods for biomarker discovery, the search for a standard UC Berkeley Working Paper Series, page http://www.bepress.com/ucbbiostat/paper233/, 2008.
  • van der Laan MJ. Causal effect models for intention to treat and realistic individualized treatment rules Technical report 203, Division of Biostatistics, University of California; Berkeley: 2006.
  • van der Laan MJ. Estimation based on case-control designs with known prevalance probability The International Journal of Biostatistics, page http://www.bepress.com/ijb/vol4/iss1/17/, 2008a. doi: 10.2202/1557-4679.1114. [PubMed] [Cross Ref]
  • van der Laan MJ. The construction and analysis of adaptive group sequential designs Technical report 232, Division of Biostatistics, University of California; Berkeley: March2008b.
  • van der Laan MJ, Dudoit S. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: Finite sample oracle inequalities and examples Technical report, Division of Biostatistics, University of California; Berkeley: November2003.
  • van der Laan MJ, Petersen ML. Causal effect models for realistic individualized treatment and intention to treat rules. International Journal of Biostatistics. 2007;3(1) [PMC free article] [PubMed]
  • van der Laan MJ, Robins JM. Unified methods for censored longitudinal data and causality. Springer; New York: 2003.
  • van der Laan MJ, Rubin D. Targeted maximum likelihood learning. The International Journal of Biostatistics. 2006;2(1) doi: 10.2202/1557-4679.1043. [Cross Ref]
  • van der Laan MJ, Dudoit S, Keles S. Asymptotic optimality of likelihood-based cross-validation. Statistical Applications in Genetics and Molecular Biology. 2004;3 doi: 10.2202/1544-6115.1036. [PubMed] [Cross Ref]
  • van der Laan MJ, Petersen ML, Joffe MM. History-adjusted marginal structural models and statically-optimal dynamic treatment regimens. The International Journal of Biostatistics. 2005;1(1):10–20. doi: 10.2202/1557-4679.1003. [Cross Ref]
  • van der Laan MJ, Dudoit S, van der Vaart AW. The cross-validated adaptive epsilon-net estimator. Statistics and Decisions. 2006;24(3):373–395. doi: 10.1524/stnd.2006.24.3.373. [Cross Ref]
  • van der Laan MJ, Polley E, Hubbard A. Super learner Statistical Applications in Genetics and Molecular Biology 6252007. ISSN 1. doi: 10.2202/1544-6115.1309. [PubMed] [Cross Ref]
  • van der Laan MJ, Rose S, Gruber S. Readings on targeted maximum likelihood estimation Technical report, working paper series http://www.bepress.com/ucbbiostat/paper254/, September2009.
  • van der Vaart A, Wellner J. Weak Convergence and Empirical Processes. Springer-Verlag; New York: 1996.
  • van der Vaart AW, Dudoit S, van der Laan MJ. Oracle inequalities for multi-fold cross-validation. Statistics and Decisions. 2006;24(3):351–371. doi: 10.1524/stnd.2006.24.3.351. [Cross Ref]
  • Yu Z, van der Laan MJ. Construction of counterfactuals and the G-computation formula Technical report, Division of Biostatistics, University of California; Berkeley: 2002.
  • Yu Z, van der Laan MJ. Double robust estimation in longitudinal marginal structural models Technical report, Division of Biostatistics, University of California; Berkeley: 2003.

Articles from The International Journal of Biostatistics are provided here courtesy of Berkeley Electronic Press