The targeted MLE is a semi-parametric efficient substitution estimator of a target parameter Ψ(
P0) of a true distribution
P0
, based on sampling
n i.i.d.
O1, …,
On from
P0. Here
P0 is known to be an element of a semi-parametric statistical model
. We will start with providing a succinct summary of how it works. For more details we refer to our articles on this topic (
van der Laan et al., 2009).
Firstly, one notes that Ψ(
P0) = Ψ(
Q0) only depends on
P0 through a relevant part
Q0 =
Q(
P0) of
P0. Secondly, one proposes a loss function
L(
Q)(
O) so that
Q0 = arg min
Q![[set membership]](/corehtml/pmc/pmcents/x2208.gif)
E0L(
Q)(
O), where

= {
Q(
P) :
P
}. Thirdly, one uses minimum loss-based learning, such as super learning (
van der Laan et al., 2007), fully utilizing the power and optimality results for loss-based cross-validation to select among candidate estimators, to obtain an initial estimator

of
Q0. Fourthly, one proposes a parametric fluctuation

, possibly indexed by nuisance parameter
g0 =
g(
P0), so that
where
D* (
Q0,
g0) is the canonical gradient/efficient influence curve of Ψ :
→ 
at
P0. Fifthly, one computes the amount of fluctuation
where
gn is an estimator of the unknown nuisance parameter
g0. This yields an update

. This updating of an initial estimator

into a next

is iterated till convergence resulting in a

. Since at the last step the amount of fluctuation
εn ≈ 0, this final

will solve the efficient influence curve estimating equation
representing a fundamental ingredient for establishing asymptotic efficiency of

: recall that an estimator is efficient if and only if it is asymptotically linear with influence curve equal to the efficient influence curve
D* (
Q0,
g0). Finally, the targeted MLE of
ψ0 is the substitution estimator

.
Thus we see that the targeted MLE involves constructing a parametric model

through the initial estimator

with parameter
ε representing an amount of fluctuation of the initial estimator, where the score of this fluctuation model at
ε = 0 equals the efficient influence curve. The latter constraint can be satisfied by many parametric models, since it represents only a local constraint of its behavior at zero fluctuation. However, it is very important that the fluctuations stay within the model for the observed data distribution, even if the parameter can be defined on fluctuations that fall outside the assumed observed data model. In particular, in the context of sparse data, a violation of this property can heavily affect the performance of the estimator.
One important strength of the semi-parametric efficient targeted MLE relative to the alternative semi-parametric efficient estimating equation methodology (
van der Laan and Robins, 2003) is that it does respect the global constraints of the observed data model since it is a substitution estimator

with

an estimator of a relevant part
Q0 of the true distribution of the data in the observed data model. The estimating equation methodology does not result in substitution estimators and thereby often ignores important global constraints of the observed data model, though
Tan (2008) introduces a non-parametric likelihood based approach to constructing a double robust estimator that is not a substitution estimator, and offers a comparison with other estimators, including TMLE that is not constrained to remain within the bounds of the observed data model. Ignoring constraints comes at a price in the context of sparsity. Indeed, simulations have confirmed this gain of targeted MLE relative to the efficient estimating equation method in the context of sparsity (
Stitelman and van der Laan, 2010), and it is again demonstrated in this article. However, if the targeted MLE starts violating this principle of being a substitution estimator by allowing

to fall outside the assumed observed data model, this advantage is compromised. Therefore, it is crucial that a fluctuation model is used that is guaranteed to stay within the wished observed data model.
To demonstrate this important consideration of selecting a valid fluctuation model in the construction of targeted MLE, we consider the problem of estimating a causal effect of a binary treatment
A on a continuous outcome
Y, based on observing
n i.i.d. copies of
O = (
W, A, Y) ~
P0, where
W is the set of confounders. Under non-parametric structural equation model (NPSEM)
W =
fW (
UW),
A =
fA (
W,
UA),
Y =
fY (
W,
A,
UY) with a structure on the exogenous variables
U = (
UW,
UA,
UY) satisfying the no unmeasured confounders assumption (
A
Y (
a) |
W for the counterfactuals
Y (
a) defined by this NPSEM), the additive causal effect
E(
Y (1)
– Y (0)) can be identified from the observed data distribution through the following statistical parameter of
P0:
Suppose that it is known that
Y ![[set membership]](/corehtml/pmc/pmcents/x2208.gif)
[
a,
b] for some
a <
b. Alternatively, one might have truncated the original data to fall in such an interval and focus on the causal effect of treatment on this truncated outcome, motivated by the fact that estimating conditional means of unbounded, or very heavy tailed, outcomes requires very large data sets.
Let
Y* = (
Y –
a)/(
b –
a) be the linearly transformed outcome within [0, 1], and define
We note that
An estimate, limit distribution, and confidence interval for Ψ* (
P0) is now immediately mapped into an estimate, limit distribution, and confidence interval for Ψ(
P0), by simple multiplication by (
b –
a). As a consequence, without loss of generality, we can assume
a = 0 and
b = 1 so that
Y ![[set membership]](/corehtml/pmc/pmcents/x2208.gif)
[0, 1].
The efficient influence curve of the statistical parameter Ψ :
→

, defined on a non-parametric statistical model
for
P0, at the true distribution
P0, is given by
where
0(
W,
A) =
E0(
Y |
A,
W), and
Q0 = (
QW,
0) denotes both this conditional mean
0 as well as the marginal distribution
QW of
W. Note that indeed Ψ(
P0) only depends on
P0 through
0 and the marginal distribution of
W. We will use the notation Ψ(
P0) and Ψ(
Q0) interchangeably.
We will now define a targeted MLE of Ψ(
Q0) as follows. Let

be an initial estimator of
0(
W,
A) =
E(
Y |
A,
W) with predicted values in (0, 1). In addition, we estimate
PW with the empirical distribution of
W. Let

denote the resulting initial estimator of
Q0. The targeted MLE step will also require an estimator
gn of
g0 =
PA|W. Only the conditional mean

will be modified by the targeted MLE procedure defined below: this makes sense since the empirical distribution of
W is already a non-parametric maximum likelihood estimator so that no bias gain with respect to the target parameter will be obtained by modifying it.
We can represent the estimator

as

with

. Consider now the fluctuation model
with parameter
ε, indexed by a function
Equivalently, we can write this as

.
Consider now the following loss function for
0:
Note that this is the log-likelihood of the conditional distribution of a binary outcome
Y, but now extended to continuous outcomes in [0
, 1]. (See also
Wedderburn (1974),
McCullagh (1983) for earlier use of logistic regression for continuous outcomes.) It is thus known that this loss function is a valid loss function for the conditional distribution of a binary
Y, but we need that it is a valid loss function for a conditional mean of a continuous
Y ![[set membership]](/corehtml/pmc/pmcents/x2208.gif)
[0
, 1]. We have the following lemma establishing this result about this loss function.
Lemma 1 We have that
where the minimum is taken over all functions of (W, A) which map into (0, 1).
In addition, define the fluctuation function
For any function h we have
Proof: Let
1 be a local minimum and consider the fluctuation
1(
ε) defined above. Then the derivative of
E0L(
1 (
ε)) at
ε = 0 equals zero. However,
Thus, it follows that
But this needs to hold for any function
h(
W,
A), which proves that
1 =
0 a.e.
This proves that
L(

) is a valid loss function for the conditional mean
0. Indeed, we can use
L(

) as loss function to construct an initial estimator of
0, and or use cross-validation to select among candidate targeted maximum likelihood estimators, such as in the collaborative targeted MLE procedure. For the purpose of construction of an initial estimator one could also use a minimum loss-based super learner based on the squared error loss function
L2(

) = (
Y – 
(
W,
A))
2, possibly with weights.
Given an initial estimator

, and our proposed fluctuation function

, we have
giving us the wished first component

of the efficient influence curve

.
Let’s use the log-likelihood loss function, −
logQW, as loss function for the marginal distribution of
W, so that our combined loss function is given by
L(
Q) = −
logQW +
L(

). In addition, we use as fluctuation of the empirical distribution
QWn,

, where

is the remaining component of the efficient influence curve. With these choices we indeed now have that
This shows that we succeeded in defining a loss function for
Q0 = (
QW,
0) and fluctuation function so that the wished derivative (1) indeed yields the efficient influence curve.
The MLE of
ε1 equals zero, so that the update of
QWn equals
QWn itself. The empirical mean of the component

of the efficient influence curve is always equal to zero, due to the fact that we estimate the marginal distribution of
W with the empirical distribution of
W.
The amount of fluctuation of
ε for fluctuating

is given by
This “maximum likelihood” estimator of
ε can be computed with generalized linear regression using the binomial link, i.e. the logistic regression MLE procedure, simply ignoring that the outcome is not binary, which also corresponds with iterative re-weighted least squares estimation using weights 1/

(1 –

).
This provides us with the targeted MLE update

, where the empirical distribution of
W did not get updated, and

did get updated as

. Iterating this procedure now defines the targeted MLE

, but as in the binary outcome case, we have that

since the next MLE

. Thus convergence occurs in one step, so that

. The targeted MLE of
ψ0 is thus given by

. As predicted, we have that the targeted MLE

solves the efficient influence curve estimating equation

.
An inspection of this efficient influence curve,
reveals that there are two potential sources of sparsity. Small values for
g0(
A |
W) and large outlying values of
Y inflate the variance. Enforcing (e.g., known) bounds on
Y and
g0 in the estimation procedure provides a means for controlling these sources of variance. We note that, even if there is strong confounding causing some large values of

, the resulting targeted MLE

remains bounded in (0, 1), so that the targeted MLE

fully respects the global constraints of the observed data model. On the other hand, the augmented IPTW estimator obtained by solving

in
ψ yields the estimator
which can easily fall outside [0, 1] if for some observations
Wi,
gn(1 |
Wi) is close to 1 or 0. This represents the price of not being a substitution estimator.
Contrasting with targeted MLE using linear fluctuation function. Alternatively, we would employ the targeted MLE using the
L2(

) = (
Y –

(
W,
A))
2 loss function, and fluctuation function
0(
ε) =
0 +
εh(
g), so that (1) is still satisfied. In this case, large values of
h(
g) will result in predicted values of
0(
εn) that are out of the bounds [
a,
b]. Therefore, this version of targeted MLE is not respecting the global constraints of the model, i.e., the knowledge that
Y ![[set membership]](/corehtml/pmc/pmcents/x2208.gif)
[
a,
b]. A comparison based on simulated data of the targeted MLE using the logistic fluctuation function and the targeted MLE using this linear fluctuation function is provided in the next section.