Home | About | Journals | Submit | Contact Us | Français |

**|**Int J Biostat**|**PMC3126669

Formats

Article sections

- Abstract
- Erratum
- Introduction
- TMLE for causal effect estimation on a continuous outcome
- Simulation studies for the additive effect of a binary point treatment on a continuous outcome
- Discussion
- References

Authors

Related links

Int J Biostat. 2010 January 1; 6(1): Article 26.

Published online 2010 August 1. doi: 10.2202/1557-4679.1260

PMCID: PMC3126669

Susan Gruber, University of California, Berkeley;

Copyright © 2010 The Berkeley Electronic Press. All rights reserved

This article has been cited by other articles in PMC.

Targeted maximum likelihood estimation of a parameter of a data generating distribution, known to be an element of a semi-parametric model, involves constructing a parametric model through an initial density estimator with parameter representing an amount of fluctuation of the initial density estimator, where the score of this fluctuation model at = 0 equals the efficient influence curve/canonical gradient. The latter constraint can be satisfied by many parametric fluctuation models since it represents only a local constraint of its behavior at zero fluctuation. However, it is very important that the fluctuations stay within the semi-parametric model for the observed data distribution, even if the parameter can be defined on fluctuations that fall outside the assumed observed data model. In particular, in the context of sparse data, by which we mean situations where the Fisher information is low, a violation of this property can heavily affect the performance of the estimator. This paper presents a fluctuation approach that guarantees the fluctuated density estimator remains inside the bounds of the data model. We demonstrate this in the context of estimation of a causal effect of a binary treatment on a continuous outcome that is bounded. It results in a targeted maximum likelihood estimator that inherently respects known bounds, and consequently is more robust in sparse data situations than the targeted MLE using a naive fluctuation model.

When an estimation procedure incorporates weights, observations having large weights relative to the rest heavily influence the point estimate and inflate the variance. Truncating these weights is a common approach to reducing the variance, but it can also introduce bias into the estimate. We present an alternative targeted maximum likelihood estimation (TMLE) approach that dampens the effect of these heavily weighted observations. As a substitution estimator, TMLE respects the global constraints of the observed data model. For example, when outcomes are binary, a fluctuation of an initial density estimate on the logit scale constrains predicted probabilities to be between 0 and 1. This inherent enforcement of bounds has been extended to continuous outcomes. Simulation study results indicate that this approach is on a par with, and many times superior to, fluctuating on the linear scale, and in particular is more robust when there is sparsity in the data.

On September 20, 2010 the following acknowledgment was added for the article:

“This work was supported by NIH grant R01AI074345-04.”

Targeted maximum likelihood estimation (TMLE) yields semi-parametric efficient substitution estimators of parameters in semi-parametric models (van der Laan and Rubin, 2006). In particular, it can be applied to estimating the statistical counterpart of a causal parameter. In this article a new targeted maximum likelihood estimator for estimating a causal effect of a binary treatment on a continuous outcome is introduced. This estimator is more robust than a previously presented TMLE procedure when there is sparsity in the data that decreases the identifiability of the parameter of interest.

Sparsity is defined as low information in the dataset for the purpose of learning the target parameter. Formally, the Fisher information, *I*, is defined as sample size *n* divided by the variance of the efficient influence curve: *I* = *n/var*(*D** (*O*)), where *D** (*O*) is the efficient influence curve of the target parameter at the true data generating distribution. The reciprocal of the variance of the efficient influence curve can be viewed as the information one observation contains for the purpose of learning the target parameter. Since the variance of the efficient influence curve divided by *n* times the variance of an asymptotically efficient estimator converges to 1 when the sample size converges to infinity, one can also think of the information *I* as the reciprocal of the variance of an efficient estimator of the target parameter. Thus, sparsity with respect to a particular target parameter corresponds with small sample size relative to the variance of the efficient influence curve for that target parameter.

Section 2 of the paper provides background on the application of TMLE methodology in the context of sparsity, and its power relative to other semi-parametric efficient estimators by being a substitution estimator respecting global constraints of the semi-parametric model. Even though an estimator can be asymptotically efficient without utilizing global constraints, the global constraints are instrumental in the context of sparsity with respect to the target parameter, motivating the need for semi-parametric efficient *substitution* estimators, and for a careful choice of fluctuation function for the targeted MLE step that fully respects these global constraints. A rigorous demonstration of the proposed targeted MLE of the causal effect of a binary treatment on a bounded continuous outcome follows, and it is contrasted to a targeted MLE that makes use a fluctuation function that does not respect the bounds.

Simulation studies described in Section 3 compare the new TMLE estimator of the causal effect, which relies on a logistic fluctuation of an initial density estimate, with the traditional TMLE estimator, with and without sparsity in the data. Results for other commonly applied estimators, the inverse-probability-of-treatment weighted estimator (IPTW) (Hernan et al., 2000; Robins, 2000b), a double robust augmented IPTW estimator (aug-IPTW) (Robins and Rotnitzky, 2001; Robins et al., 2000; Robins, 2000a) that is efficient but not a substitution estimator, and the maximum likelihood substitution estimator according to a parametric model (MLE) (Robins, 1986) are also presented.

The targeted MLE is a semi-parametric efficient substitution estimator of a target parameter Ψ(*P*_{0}) of a true distribution *P*_{0} **, based on sampling *n* i.i.d. *O*_{1}, …, *O _{n}* from

Firstly, one notes that Ψ(*P*_{0}) = Ψ(*Q*_{0}) only depends on *P*_{0} through a relevant part *Q*_{0} = *Q*(*P*_{0}) of *P*_{0}. Secondly, one proposes a loss function *L*(*Q*)(*O*) so that *Q*_{0} = arg min_{Q}_{}_{}*E*_{0}*L*(*Q*)(*O*), where = {*Q*(*P*) : *P* **}. Thirdly, one uses minimum loss-based learning, such as super learning (van der Laan et al., 2007), fully utilizing the power and optimality results for loss-based cross-validation to select among candidate estimators, to obtain an initial estimator
of *Q*_{0}. Fourthly, one proposes a parametric fluctuation
, possibly indexed by nuisance parameter *g*_{0} = *g*(*P*_{0}), so that

(1)

where *D** (*Q*_{0}, *g*_{0}) is the canonical gradient/efficient influence curve of Ψ : * →* at *P*_{0}. Fifthly, one computes the amount of fluctuation

where *g _{n}* is an estimator of the unknown nuisance parameter

representing a fundamental ingredient for establishing asymptotic efficiency of
: recall that an estimator is efficient if and only if it is asymptotically linear with influence curve equal to the efficient influence curve *D** (*Q*_{0}, *g*_{0}). Finally, the targeted MLE of *ψ*_{0} is the substitution estimator
.

Thus we see that the targeted MLE involves constructing a parametric model
through the initial estimator
with parameter *ε* representing an amount of fluctuation of the initial estimator, where the score of this fluctuation model at *ε* = 0 equals the efficient influence curve. The latter constraint can be satisfied by many parametric models, since it represents only a local constraint of its behavior at zero fluctuation. However, it is very important that the fluctuations stay within the model for the observed data distribution, even if the parameter can be defined on fluctuations that fall outside the assumed observed data model. In particular, in the context of sparse data, a violation of this property can heavily affect the performance of the estimator.

One important strength of the semi-parametric efficient targeted MLE relative to the alternative semi-parametric efficient estimating equation methodology (van der Laan and Robins, 2003) is that it does respect the global constraints of the observed data model since it is a substitution estimator
with
an estimator of a relevant part *Q*_{0} of the true distribution of the data in the observed data model. The estimating equation methodology does not result in substitution estimators and thereby often ignores important global constraints of the observed data model, though Tan (2008) introduces a non-parametric likelihood based approach to constructing a double robust estimator that is not a substitution estimator, and offers a comparison with other estimators, including TMLE that is not constrained to remain within the bounds of the observed data model. Ignoring constraints comes at a price in the context of sparsity. Indeed, simulations have confirmed this gain of targeted MLE relative to the efficient estimating equation method in the context of sparsity (Stitelman and van der Laan, 2010), and it is again demonstrated in this article. However, if the targeted MLE starts violating this principle of being a substitution estimator by allowing
to fall outside the assumed observed data model, this advantage is compromised. Therefore, it is crucial that a fluctuation model is used that is guaranteed to stay within the wished observed data model.

To demonstrate this important consideration of selecting a valid fluctuation model in the construction of targeted MLE, we consider the problem of estimating a causal effect of a binary treatment *A* on a continuous outcome *Y*, based on observing *n* i.i.d. copies of *O* = (*W, A, Y*) ~ *P*_{0}, where *W* is the set of confounders. Under non-parametric structural equation model (NPSEM) *W* = *f _{W}* (

Suppose that it is known that *Y* [*a*, *b*] for some *a* < *b*. Alternatively, one might have truncated the original data to fall in such an interval and focus on the causal effect of treatment on this truncated outcome, motivated by the fact that estimating conditional means of unbounded, or very heavy tailed, outcomes requires very large data sets.

Let *Y** = (*Y* – *a*)/(*b* – *a*) be the linearly transformed outcome within [0, 1], and define

We note that

An estimate, limit distribution, and confidence interval for Ψ* (*P*_{0}) is now immediately mapped into an estimate, limit distribution, and confidence interval for Ψ(*P*_{0}), by simple multiplication by (*b* – *a*). As a consequence, without loss of generality, we can assume *a* = 0 and *b* = 1 so that *Y* [0, 1].

The efficient influence curve of the statistical parameter Ψ : ** → , defined on a non-parametric statistical model ** for *P*_{0}, at the true distribution *P*_{0}, is given by

where _{0}(*W*, *A*) = *E*_{0}(*Y* | *A*, *W*), and *Q*_{0} = (*Q _{W}*,

We will now define a targeted MLE of Ψ(*Q*_{0}) as follows. Let
be an initial estimator of _{0}(*W*, *A*) = *E*(*Y* | *A*, *W*) with predicted values in (0, 1). In addition, we estimate *P _{W}* with the empirical distribution of

We can represent the estimator as with . Consider now the fluctuation model

with parameter *ε*, indexed by a function

Equivalently, we can write this as .

Consider now the following loss function for _{0}:

Note that this is the log-likelihood of the conditional distribution of a binary outcome *Y*, but now extended to continuous outcomes in [0*,* 1]. (See also Wedderburn (1974), McCullagh (1983) for earlier use of logistic regression for continuous outcomes.) It is thus known that this loss function is a valid loss function for the conditional distribution of a binary *Y*, but we need that it is a valid loss function for a conditional mean of a continuous *Y* [0*,* 1]. We have the following lemma establishing this result about this loss function.

**Lemma 1** *We have that*

*where the minimum is taken over all functions of (W, A) which map into* (0, 1). *In addition, define the fluctuation function*

*For any function h we have*

**Proof:** Let _{1} be a local minimum and consider the fluctuation _{1}(*ε*) defined above. Then the derivative of *E*_{0}*L*(_{1} (*ε*)) at *ε* = 0 equals zero. However,

Thus, it follows that

But this needs to hold for any function *h*(*W*, *A*), which proves that _{1} = _{0} a.e.

This proves that *L*() is a valid loss function for the conditional mean _{0}. Indeed, we can use *L*() as loss function to construct an initial estimator of _{0}, and or use cross-validation to select among candidate targeted maximum likelihood estimators, such as in the collaborative targeted MLE procedure. For the purpose of construction of an initial estimator one could also use a minimum loss-based super learner based on the squared error loss function *L*_{2}() = (*Y – *(*W*, *A*))^{2}, possibly with weights.

Given an initial estimator , and our proposed fluctuation function , we have

giving us the wished first component of the efficient influence curve .

Let’s use the log-likelihood loss function, −*logQ _{W}*, as loss function for the marginal distribution of

This shows that we succeeded in defining a loss function for *Q*_{0} = (*Q _{W}*,

The MLE of *ε*_{1} equals zero, so that the update of *Q _{Wn}* equals

The amount of fluctuation of *ε* for fluctuating
is given by

This “maximum likelihood” estimator of *ε* can be computed with generalized linear regression using the binomial link, i.e. the logistic regression MLE procedure, simply ignoring that the outcome is not binary, which also corresponds with iterative re-weighted least squares estimation using weights 1/(1 – ).

This provides us with the targeted MLE update
, where the empirical distribution of *W* did not get updated, and
did get updated as
. Iterating this procedure now defines the targeted MLE
, but as in the binary outcome case, we have that
since the next MLE
. Thus convergence occurs in one step, so that
. The targeted MLE of *ψ*_{0} is thus given by
. As predicted, we have that the targeted MLE
solves the efficient influence curve estimating equation
.

An inspection of this efficient influence curve,

reveals that there are two potential sources of sparsity. Small values for *g*_{0}(*A* | *W*) and large outlying values of *Y* inflate the variance. Enforcing (e.g., known) bounds on *Y* and *g*_{0} in the estimation procedure provides a means for controlling these sources of variance. We note that, even if there is strong confounding causing some large values of
, the resulting targeted MLE
remains bounded in (0, 1), so that the targeted MLE
fully respects the global constraints of the observed data model. On the other hand, the augmented IPTW estimator obtained by solving
in *ψ* yields the estimator

which can easily fall outside [0, 1] if for some observations *W _{i}*,

**Contrasting with targeted MLE using linear fluctuation function.** Alternatively, we would employ the targeted MLE using the *L*_{2}() = (*Y* – (*W*, *A*))^{2} loss function, and fluctuation function ^{0}(*ε*) = ^{0} + *εh*(*g*), so that (1) is still satisfied. In this case, large values of *h*(*g*) will result in predicted values of ^{0}(*ε _{n}*) that are out of the bounds [

Two simulation studies illustrate the effects of employing a logistic vs. linear fluctuation on TMLE estimator performance with and without sparsity in the data, where a high degree of sparsity corresponds to a target parameter that is borderline-identifiable. As above, the parameter of interest is defined as the marginal effect of a binary point treatment on the outcome, *ψ*_{0} = *E _{W}* (

The “traditional” targeted maximum likelihood approach to estimating an additive treatment effect when the outcome is continuous is to fluctuate the initial density estimate on a linear scale. Given
, an initial estimate of the conditional mean of *Y* given (*W*, *A*), the fluctuation function is defined as
and the loss function *L*() is chosen to be the squared error loss function, so that we still have the required constraint (1). The estimate *ε _{n}* can be obtained by estimating

A second TMLE estimate using the logistic fluctuation method described in Section 2 is also obtained. *Y* is transformed into *Y** [0, 1] by shifting and scaling the values. In the simulation setting, *Y* is not bounded, so that we do not have an a priori *a* and *b* bound on *Y*. Instead of truncating *Y* and redefining the target parameter as the causal effect on the truncated *Y*, we still aim to estimate the causal effect on the original *Y*. Therefore, we set *a* = min(*Y*), *b* = max(*Y*), and

An initial estimate,
, is obtained, and then represented as a logistic function of its logit-transformation. Note that logit(*x*) is not defined when *x* = 0 or 1, therefore in practice
is bounded away from 0 and 1 by truncating it at (*α,* (1 – *α*)). We used *α* = 0.005 in these simulation studies, which did not yield appreciably different results than setting *α* = 0.001 or *α* = 0.01. The function
is fluctuated on the logit scale with
, using the same clever covariate, *h _{gn}* (

is the targeted MLE of the wished additive causal effect *ψ*_{0}.

Parameter estimates were also obtained using the augmented inverse probability of treatment weighed estimator (aug-IPTW),

Both the targeted MLE and the augmented IPTW estimator are double robust so that these estimators will be consistent for *ψ*_{0} if either *g _{n}* or
is consistent for

In this simulation study we will use simple parametric MLE’s as initial estimators
and *g _{n}*, even though we recommend the utilization of super learning in practice. The purpose of this simulation is to investigate the performance of the updating step under misspecified and correctly specified
, and for that purpose we can work with parametric MLE fits.

Results from two estimation methods that are not double robust and semi-parametric efficient are included as well. The maximum likelihood estimator according to a parametric model for _{0} (MLE), used as initial estimator in the targeted MLE and augmented IPTW, is included for the sake of evaluating the bias reduction step carried out by these two semi-parametric efficient procedures. Inverse probability of treatment weighted (IPTW) estimators are consistent when *g _{n}*(

Covariates *W*_{1}, *W*_{2}, *W*_{3} were generated as independent binary random variables,

Two treatment mechanisms were defined that differ only in the values of the coefficients for each covariate:

We consider two settings:

We refer to these two treatment mechanisms as *g*_{0,1} and *g*_{0,2}, respectively. The observed outcome *Y* was generated as

For both simulations the true additive causal effect equals one: *ψ*_{0} = 1. In both simulations predicted values for *g _{n}*(

Estimates were obtained for 1000 samples of size *n* = 1000 from each data generating distribution. Treatment assignment probabilities, *g*_{0}(*A* | *W*), were estimated using a correctly specified logistic regression model. A correctly specified main terms regression model was used to obtain
. In addition, a misspecified initial estimate,
, was obtained by regressing *Y* on *A*.

We expect MLE estimates based on
to be unbiased and efficient, while those based on
will be biased. IPTW estimates only depend on consistent estimation of *g*_{0}, thus are identical regardless of how _{0} is estimated. For both simulations *g _{n}* is a consistent estimator, thus it is reasonable to expect unbiased IPTW estimates, with more variation in simulation 2 estimates. The targeted MLE and the augmented IPTW are known to be unbiased if

Table 1 reports the average estimate, bias, empirical variance, and mean squared error (MSE) for each estimator, under different specifications of the initial estimator
. In all cases *g _{n}* is consistent, and bounded at (0.01, 0.99). In simulation 1, when

Estimator performance for simulations 1 and 2 when the initial estimator of _{0} is correct and misspecified. Results are based on 1000 samples of size *n* = 1000.

In simulation 2 all estimators except IPTW are unbiased when _{0} is correctly estimated. In this case, both TMLE estimators have higher variance than aug-IPTW, and all three are more efficient than IPTW, but less efficient than the parametric MLE estimator. Though asymptotically the IPTW estimator is expected to be unbiased in this simulation, since *g _{n}* is a consistent estimator of

When the model for _{0} is misspecified in simulation 2, The MLE estimator is even more biased than it was in simulation 1. The efficiency of all three double-robust efficient estimators suffers in comparison with simulation 1 as well. Nevertheless, TMLE* _{log}*, using the logistic fluctuation, has the lowest MSE of all estimators. Its superiority over TMLE

When an estimation procedure incorporates weights, observations with large weights can heavily influence the point estimate and inflate the variance. Truncating these weights is a common approach to reducing the variance, but it generally introduces bias. The presented TMLE of an additive causal effect of a point treatment intervention, incorporating a logistic fluctuation of the initial conditional mean estimate, dampens the effect of these heavily weighted observations, thereby heavily reducing the reliance on truncation. As a substitution estimator, the proposed TMLE of the additive causal effect respects the global constraints of the observed data model. Simulation study results indicate that this approach is on a par with, and in the context of sparsity often superior to, fluctuating on the linear scale. In particular it is more robust when there is sparsity in the data, outperforming MLE, IPTW, and aug-IPTW.

For the sake of demonstration we considered estimation of the additive causal effect. However, the same targeted MLE, using the logistic fluctuation, can be used to estimate other point-treatment causal effects, including parameters of a marginal structural model. The newly proposed loss function also has applications in prediction of a bounded outcome, and for targeted MLE of the causal effect of a multiple time point intervention in which the final outcome is bounded and continuous. We also pointed out that the proposed fluctuation function and loss function, and corresponding targeted MLE, should also be used for continuous outcomes for which no a priori bounds are known, by simply using the minimal and maximal observed outcome values. In this way, these choices naturally robustify the targeted MLE by enforcing that the updated initial estimator will not predict outcomes outside the observed range.

The TMLE approach presented here using a logistic fluctuation of an initial estimate of the conditional mean of the continuous outcome retains all properties of targeted maximum likelihood estimators, including influence curve-based inference. The method presented here extends to collaborative targeted maximum likelihood estimation without modification.

Susan Gruber, University of California, Berkeley.

Mark J. van der Laan, University of California, Berkeley.

- Dehejia RH, Wahba S. Propensity score matching methods for nonexperimental causal studies. The Review of Economics and Statistics. 2002;84:151–61. doi: 10.1162/003465302317331982. [Cross Ref]
- Hernan MA, Brumback B, Robins JM. Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. Epidemiology. 2000;11(5):561–570. doi: 10.1097/00001648-200009000-00012. [PubMed] [Cross Ref]
- McCullagh P. Quasi-likelihood functions. Annals of Statistics. 1983;11:59–67. doi: 10.1214/aos/1176346056. [Cross Ref]
- Robins JM, Rotnitzky A. Comment on the Bickel and Kwon article, “Inference for semiparametric models: Some questions and an answer” Statistica Sinica. 2001;11(4):920–936.
- Robins JM, Rotnitzky A, van der Laan MJ. Comment on “On Profile Likelihood” by S.A. Murphy and A.W. van der Vaart. Journal of the American Statistical Association – Theory and Methods. 2000;450:431–435. doi: 10.2307/2669381. [Cross Ref]
- Robins JM. A new approach to causal inference in mortality studies with sustained exposure periods - application to control of the healthy worker survivor effect. Mathematical Modelling. 1986;7:1393–1512. doi: 10.1016/0270-0255(86)90088-6. [Cross Ref]
- Robins JM. Robust estimation in sequentially ignorable missing data and causal inference models. Proceedings of the American Statistical Association. 2000a
- Robins JM. Statistical models in epidemiology, the environment, and clinical trials (Minneapolis, MN, 1997) Springer; New York: 2000b. Marginal structural models versus structural nested models as tools for causal inference; pp. 95–133.
- Stitelman OM, van der Laan MJ. University of California; Berkeley: 2010. Collaborative targeted maximum likelihood for time to event data. Technical Report 260, Division of Biostatistics, [PubMed]
- Stukel TA, Fisher ES, Wennberg DE, Alter DA, Gottlieb DJ, Vermeulen MJ. Analysis of observational studies in the presence of treatment selection bias: Effects of invasive cardiac management on AMI survival using propensity score and instrumental variable methods. JAMA. 2007;297:278–85. doi: 10.1001/jama.297.3.278. [PMC free article] [PubMed] [Cross Ref]
- Tan Z. Bounded, efficient, and doubly robust estimation with inverse weighting. Biometrika. 2008;94:1–22.
- van der Laan MJ, Robins JM. Unified methods for censored longitudinal data and causality. Springer; New York: 2003.
- van der Laan MJ, Rubin D. Targeted maximum likelihood learning. The International Journal of Biostatistics. 2006;2(1) doi: 10.2202/1557-4679.1043. [Cross Ref]
- van der Laan MJ, Polley E, Hubbard A. Super learner. Statistical Applications in Genetics and Molecular Biology. 2007;6(25) doi: 10.2202/1544-6115.1309. ISSN 1. [PubMed] [Cross Ref]
- van der Laan MJ, Rose S, Gruber S. University of California; Berkeley: Sep, 2009. Readings in targeted maximum likelihood estimation. Technical report 254, Division of Biostatistics,
- Wedderburn RWM. Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method. Biometrika. 1974;6:1.

Articles from The International Journal of Biostatistics are provided here courtesy of **Berkeley Electronic Press**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's Canada Institute for Scientific and Technical Information in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |