Recall that we consider two-stage sampling designs where one takes a random sample from a target population, measures *V* on each subject in this first stage, and draws a subsample where one collects additional data. Inclusion in the subsample can be influenced by *V*. This data structure is a missing-data structure on the full-data structure *X* collected in the second-stage. The observed data structure is *O* = (*V*, Δ, Δ*X*), where *V* is included in *X*, and Δ denotes the indicator of inclusion in the second-stage sample. The sample can then be represented as *n* i.i.d. copies *O*_{1}, . . ., *O*_{n} of *O*.

Let

*P*_{X,0} be the true probability distribution of

*X*, and let

^{F} be a statistical model for

*P*_{X,0}. Let Ψ

^{F} :

^{F} → IR

^{d} be the target parameter of the full-data distribution, so that

is the parameter of the true probability distribution of

*X*. We will denote the efficient influence curve of Ψ

^{F} at a full-data distribution

*P*_{X} with

*D*^{F} (

*P*_{X}).

Let

*g*_{Δ,0}(

*δ* *|* *X*) =

*P*_{X,0}(Δ =

*δ* *|* *X*) be the conditional probability distribution of Δ, given

*X*. We assume the missing at random (MAR) assumption which states that

*g*_{Δ,0}(

*δ* *|* *X*) =

*g*_{Δ,0}(

*δ* *|* *V*), i.e., Δ is independent of

*X*, given

*V*. For notational convenience, let Π

_{0}(

*V*)

*g*_{Δ,0}(1

*|* *V*). This missingness mechanism might be known, a model might be available, or no further assumptions are made beyond MAR. Either way, the missingness mechanism can be estimated from the data (Δ

_{i},

*V*_{i}),

*i* = 1, . . .,

*n*, extracted from the observations

*O*_{i},

*i* = 1, . . .,

*n*.

The statistical model

for the probability distribution

*P*_{0} of

*O* is now defined in terms of the full-data statistical model and the model on the missingness mechanism. The efficient influence curve of Ψ

^{F} (

*P*_{X,0}) as an identifiable parameter of

*P*_{0} will be denoted with

*D** (

*P*_{0}) =

*D** (

*P*_{X,0}, Π

_{0}). We wish to estimate

based on a sample of

*n* i.i.d. observations

*O*_{1}, . . .,

*O*_{n} from

*P*_{0} .

3.1. TMLE

The TMLE is a general procedure for estimation of a target parameter of the data-generating distribution in semiparametric models (

van der Laan and Rubin, 2006). It marries the locally efficient double robust properties of estimating function based methodology and the properties of maximum likelihood estimation. TMLEs are loss-based well-defined, efficient, unbiased substitution estimators of the target parameter of the data-generating distribution.

Suppose that, given

*n* i.i.d. observations

*X*_{1}, . . .,

*X*_{n},

is a TMLE of

*P*_{X,0}, and

is the corresponding TMLE of

. Specifically, let

*L*^{F} (

*P*_{X}) (

*X*) be a full-data loss function (e.g., log-likelihood loss function) so that

Let

be an initial estimator of

*P*_{X,0}, possibly a

*L*^{F}-loss based super learner (

van der Laan et al., 2007). In addition, let {

*P*_{X}(

) :

} be a parametric working submodel of

^{F} through

*P*_{X} at

= 0 so that its score at

= 0 equals, or spans, the full-data efficient influence curve:

Such a TMLE

is then defined as follows. For

*k* = 1, . . .,

*K*, one computes the amount of fluctuation:

for

, and one sets

. Here

is defined as the empirical distribution of the full-data

*X*_{1}, . . .,

*X*_{n}, and, for a function

*f* of

*X* and probability distribution

*P*, we used the notation

*Pf* *∫* *f*(

*x*)

*dP*(

*x*) This updating process is iterated until convergence is achieved, i.e.,

*K* is chosen so that

. The final update

is denoted with

, and is called the TMLE of

*P*_{X,0}. By the score condition on the working fluctuation model, it follows that

3.2. IPCW-TMLE

Given the TMLE developed for the full-data structure, we propose estimating

*ψ*_{0} based on

*O*_{1}, . . .,

*O*_{n} with an IPCW-TMLE. This IPCW-TMLE is simply defined by the above procedure with the addition of weights Δ

_{i}*/*Π

_{n}(

*V*_{i}) for observations

*i* = 1, . . .,

*n*, where Π

_{n}(

*V*) is an estimator of Π

_{0}(

*V*)

*g*_{Δ,0}(1

*|* *V*). Thus, this IPCW-TMLE involves the following steps:

IPCW initial estimator Computing an initial IPCW-loss based estimator

(e.g., using super learning,

van der Laan et al., 2007) based on, for example, the IPCW-loss function

Typically, this initial estimator is obtained by providing the initial estimator of *P*_{X,0} in the full-data TMLE the IPCW weights.

IPCW-TMLE For

*k* = 1, . . .,

*K*, one computes the amount of fluctuation:

for

, and one sets

. This updating process is iterated until convergence is achieved, i.e.,

*K* is chosen so that

. The final update is denoted with

, and is called the IPCW-TMLE of

*P*_{X,0}.

Estimator of the target parameter Finally, one evaluates the target parameter

. This is the TMLE of

.

As is apparent from the above definition of IPCW-TMLE, IPCW-TMLE is a targeted minimum loss based estimator (also TMLE), the generalization of TMLE (

van der Laan, 2008a;

van der Laan et al., 2009;

van der Laan and Rose, 2011), but with a loss function defined as IPCW full-data loss function, and a parametric submodel

*P*_{X}(

) with score (Δ/Π

_{0}(

*V*))

*D*^{F} (

*P*_{X}) at

= 0.

Since it solves the IPCW full-data efficient influence curve equation, the IPCW-TMLE has an influence curve equal to

if Π

_{0} is known, and

denotes the limit of

(see next section). Double robustness properties of the full-data efficient influence curve are immediately inherited by the IPCW-TMLE. If Π

_{0}(

*V*) is consistently estimated with a maximum likelihood estimator, the influence curve of the IPCW-TMLE equals

minus its projection on the tangent space of the model used for Π

_{0}. As shown below, if we use a nonparametric maximum likelihood estimator for Π

_{0} and the full-data model is nonparametric, then the IPCW-TMLE solves the actual efficient influence curve equation, so that the IPCW-TMLE is efficient if

. As with any asymptotically linear estimator, an estimate of the asymptotic variance

is given by the empirical variance of the estimated influence curve.

3.2.1. IPCW Full-Data Efficient Influence Curve Equation By the score condition on the working fluctuation model and

, it follows that this IPCW-TMLE solves the ICPW full-data efficient influence curve equation:

If the full-data TMLE is double robust or has other robustness properties, then these properties will be inherited by this IPCW-TMLE under the assumption that Π

_{n} is a consistent estimator of Π

_{0}. If

*V* is discrete (with finite support), then we propose using a nonparametric estimator Π

_{n} of Π

_{0}.

In this case, we have the following important result. If the full-data model is nonparametric,

*V* is discrete, and the missingness mechanism is estimated nonparametrically, then it follows that the IPCW-TMLE actually solves the true efficient influence curve equation. The latter implies that, under appropriate regularity conditions, and if

is consistent for

*P*_{X,0}, the IPCW-TMLE will be an asymptotically efficient estimator of

*ψ*_{0}.

Proof of Result Consider the statistical model

for the observed missing data structure

*O* implied by a nonparametric full-data model

^{F}, the MAR assumption, possibly a model for the missingness mechanism Π

_{0}, and

*V* is discrete. Let Ψ :

→ IR be the statistical target parameter of interest defined by Ψ(

*P*_{PX, Π}) = Ψ

^{F} (

*P*_{X}). The efficient influence curve of of Ψ at

*P*_{0} =

*P*_{PX0,Π0} can be represented as

where

*D*^{F} (

*P*_{X,0}) is the efficient influence curve of the full-data parameter Ψ

^{F} :

^{F} → IR.

The IPCW-TMLE

solves

for any choice of estimator Π

_{n} of Π

_{0}. If Π

_{n} is a nonparametric estimator of Π

_{0}, then it follows that we also have

for any choice of estimator of the regression

. As a consequence, it follows that for nonparametric estimators Π

_{n} of Π

_{0}, and IPCW-TMLE

, the IPCW-TMLE solves the efficient influence curve equation:

We also note that, if we fit Π

_{0} with a logistic regression, use it as an offset, and add a covariate

to update this logistic regression fit of Π

_{0}, iterate this updating process of the missingness mechanism until convergence, then the resulting fit

will also solve:

This follows from the well known fact that the score of a univariate linear logistic regression working model logit Π(

*δ*) = logit Π +

*δC* for the coefficient

*δ* in front of the univariate covariate

*C*(

*V*), equals

*C*(

*V*)(Δ – Π(

*δ*)(

*V*)). For such clever fits of the missingness mechanism we also have that (

,

) solves the efficient influence curve estimating equation:

so that double robustness and asymptotic efficiency can still be derived.

The latter type of IPCW-TMLE is slightly more complex than the regular IPCW-TMLE since it now also requires fitting the regression

. However, this represents a minor increase in complexity since it only involves running a mean regression of the outcome

on

*V*_{i} among the observations with Δ

_{i} = 1.

3.2.2. Risk Difference Example In this section we demonstrate the IPCW-TMLE for the simple full-data structure

*X* = (

*W*,

*A*,

*Y*), with covariate vector

*W*, binary exposure (or treatment)

*A*, and binary outcome

*Y*. The observed data structure for a randomly sampled subject is

*O* = (

*V*, Δ, Δ

*X*), where

*V* =

*Y*. The target parameter of the full-data distribution of

*X* is given by Ψ

^{F} (

*P*_{X,0}) =

*E*_{X,0}[

*E*_{X,0}(

*Y* *|* *A* = 1,

*W*) −

*E*_{X,0}(

*Y* *|* *A* = 0,

*W*)] and the full-data statistical model

^{F} is nonparametric. The full-data efficient influence curve

*D*^{F} (

*Q*_{0},

*g*_{0}) at

*P*_{X,0} is given by

where

*Q*_{0} = (

_{0},

*Q*_{W,0}),

*Q*_{W,0} is the true full-data marginal distribution of

*W*,

_{0}(

*A*,

*W*) =

*E*_{X,0}(

*Y* *|* *A*,

*W*), and

*g*_{0}(

*a* *|* *W*) =

*P*_{X,0}(

*A* =

*a* *|* *W*). The first term will be denoted by

and the second term by

, since these two terms represent components of the full-data efficient influence curve that are elements of the tangent space of the conditional distribution of

*Y*, given

*A*,

*W*, and the marginal distribution of

*W*, respectively. That is,

is the component of the efficient influence curve that equals a score of a parametric fluctuation model of a conditional distribution of

*Y*, given (

*A*,

*W*), and

is a score of a parametric fluctuation model of the marginal distribution of

*W*. Note that

equals a function

*H** (

*A*,

*W*) times the residual (

*Y* −

(

*A*,

*W*)), where

IPCW initial estimator We can estimate the marginal distribution of

*Q*_{W,0} with IPCW-MLE

where

*L*^{F} (

*Q*_{W}) = −log

*Q*_{W} is the log-likelihood loss function for the marginal distribution of

*W*. Note that

*Q*_{W,}_{n} is a discrete distribution that puts mass 1/{

*n*Π

_{n}(

*Y*_{i})} on each observation

*W*_{i} in the sample for which

*W*_{i} is observed (i.e., Δ

_{i} = 1). Suppose that, based on a sample of

*n* i.i.d. observations

*X*_{i}, we estimated

_{0} with loss-based learning using the log-likelihood loss function

*L*^{F} (

)(

*X*) = −log

(

*A*,

*W*)

^{Y} (1 −

(

*A*,

*W*))

^{1}^{−Y}. Given the actual observed data, we can estimate

_{0} with super learning and weights Δ

_{i}*/*Π

_{n}(

*Y*_{i}) for observations

*i* = 1, . . .,

*n*, which corresponds to the same super learner but now based on the IPCW-loss function

Let

*L*^{F} (

*Q*) =

*L*^{F} (

*Q*_{W}) +

*L*^{F} (

) be the full-data loss function for

*Q* = (

,

*Q*_{W}) and let

*L*(

*Q*, Π) =

*L*^{F} (

*Q*)Δ/Π be the corresponding IPCW-loss function.

Similarly, we can estimate

*g*_{0} with loss-based super learning based on the IPCW-log-likelihood loss function

This now provides an initial estimator

and

. This estimator was obtained using the same algorithm for computing the initial estimator for the full-data TMLE, but now assigning weights Δ

_{i}/Π

_{n}(

*Y*_{i}) to each observation. In essence, a full-data loss function

*L*^{F} (

*Q*) for

*Q*_{0} used to obtain an initial estimator for the full-data TMLE has been replaced by the IPCW-loss function

*L*(

*Q*, Π

_{n}) =

*L*^{F} (

*Q*)Δ/Π

_{n}, and, similarly, a full-data loss function

*L*^{F} (

*g*) = −log

*g* has been replaced by

*L*(

*g*, Π

_{n}) =

*L*^{F} (

*g*)Δ

*/*Π

_{n}.

Parametric submodel for full-data TMLE Let

be a parametric submodel through

, and let

be a parametric submodel through the conditional distribution of

*Y*, given

*A*,

*W*, implied by

. This describes a submodel

through

with a two-dimensional fluctuation parameter

= (

_{1},

_{2}). We have that

at

= 0 yields the two scores

and

, and therefore spans the full-data efficient influence curve

, a requirement for the parametric submodel for the full-data TMLE. This parametric submodel and the loss function

*L*^{F} (

*Q*) now defines the full-data TMLE and this same parametric submodel with the IPCW-loss function

*L*(

*Q*, Π) =

*L*^{F} (

*Q*)Δ/Π defines the IPCW-TMLE.

The IPCW-TMLE Define

and let

. Note

_{1,}_{n} = 0 which shows that the IPCW empirical distribution of

*W* is not updated. Note also that

_{2,}_{n} is obtained by performing an IPCW logistic regression of

*Y* on

where

is used as an offset, and extracting the coefficient for

. We then update

with logit

. The updating process converges in one step in this example, so that the IPCW-TMLE is given by

.

Estimator of the target parameter Lastly, one evaluates the target parameter

, where

, by plugging

and

into our substitution estimator

This is the IPCW-TMLE of

.

3.2.3. Right Censoring Suppose our full-data structure is a right-censored data structure and we conduct a nested case-control study. For example, we have that

*X* might be defined as

*X* = (

*W*,

*A*,

, Ξ,

*Y**), where

*W* are covariates,

*A* is an exposure of interest,

= min(

*T*,

*C*),

*T* is the time to the event,

*C* denotes a censoring variable, Ξ =

*I*(

=

*T*) is a failure indicator, and

*Y** = (

≤

*t*, Ξ = 1) is an indicator of having an observed failure by endpoint

*t*. Our missing data structure is given by

*O* = (Δ, Δ

*X*,

, Ξ,

*Y**), where Δ = 1 denotes membership in the nested case-control sample.

A special feature of this right censored data structure is that one will define a case based on a binary random variable

*Y** that is not the outcome of interest. For example,

*Y** could represent observed death by year 5, which would be denoted

*Y** = (

≤ 5 years, Ξ = 1). It is important to stress that the definition of a case (

*Y** = 1) in a nested case-control study within a right censored data structure is therefore different than without right censoring. Let’s say our parameter of interest Ψ

^{F} (

*P*_{X,0}) is the causal risk difference under causal assumptions:

*E*_{X,0}[

*P*_{X,0}(

*T* > 5 |

*A* = 1,

*W*) −

*P*_{X,0}(

*T >* 5 |

*A* = 0,

*W*)].

We define the TMLE for the full-data structure and we then use the IPCW-TMLE for actual missing data structure. In other words, we need a TMLE of

based on

*X*, and then IPCW-TMLE is defined as well. The TMLE of the additive causal effect of treatment on survival, and other parameters, based on the right-censored data structure is presented elsewhere (

Moore and van der Laan, 2009a,

b;

Stitelman and van der Laan, 2010;

van der Laan and Rose, 2011).

3.2.4. Effect Modification The general approach involves defining our full-data structure, for example,

*X* = (

*W*,

*A**,

*A*,

*Y*), and our observed data

*O* = (

*V*, Δ, Δ

*X*), where again

*V* is in

*X*. We are interested in studying the effect modification of a variable denoted

*A**. Our full-data parameter of interest might be

where

_{0}(

*a**,

*a*) =

*E*_{0}(

*Y* *|* *A**=

*a**,

*A* =

*a*,

*W*). The full-data TMLE involves first running an initial regression of

*Y* on

*A**,

*A*, and

*W*. We note that

*A* and

*A** are implicitly assumed to have finite support. The targeting step requires a parametric working submodel to fluctuate the initial estimator and a choice of loss function. We use a clever covariate that will define this parametric working submodel. The clever covariate for

is given by

where

*g*_{0}(

*a**,

*a* |

*W*) =

*P*_{X,0}(

*A* =

*a* *|* *W*)

*P*_{X,0}(

*A** =

*a** |

*A* =

*a*,

*W*), and

*P*_{X,0}(

*A* =

*a* |

*W*) may be known, as in a clinical trial, but

*P*_{X,0}(

*A** =

*a** |

*A* =

*a*,

*W*) must be fitted. The clever covariate for the difference parameter

is the corresponding difference of clever covariates. As loss function one can use the least squares loss function, in which case the working submodel is a linear regression of

*Y* on

*H** using the initial estimator as offset. If

*Y* is binary, or continuous in (0, 1) (e.g., after a linear transformation), then one can use the more robust quasi-log-likelihood loss function (

Gruber and van der Laan, 2010). In the latter case, the working submodel is a logistic linear regression of

*Y* on

*H**, using the initial estimator as offset. Therefore, one can target the parameter with a single clever covariate, or one can target all four parameters with a four dimensional clever covariate, and look at multiple differences. This now defines the full-data TMLE for the desired target parameter

. The desired IPCW-TMLE for the observed data is obtained by assigning weights Δ

_{i}/Π

_{n}(

*Y*_{i}) to each observation, or equivalently, by replacing the full-data loss function in the full-data TMLE by the IPCW-loss function.