Home | About | Journals | Submit | Contact Us | Français |

**|**Int J Biostat**|**PMC3173607

Formats

Article sections

- Abstract
- Introduction
- Data Structure, Statistical Model, and Parameter of Interest
- The Positivity Assumption
- Estimators of a Mean Outcome when the Outcome is Subject to Missingness
- Simulation Studies
- TMLEs with Machine Learning for Dual Misspecification
- Discussion
- References

Authors

Related links

Int J Biostat. 2011 January 1; 7(1): Article 31.

Published online 2011 August 17. doi: 10.2202/1557-4679.1308

PMCID: PMC3173607

Kristin E. Porter, University of California, Berkeley;

Copyright © 2011 The Berkeley Electronic Press. All rights reserved

This article has been cited by other articles in PMC.

There is an active debate in the literature on censored data about the relative performance of model based maximum likelihood estimators, IPCW-estimators, and a variety of double robust semiparametric efficient estimators. Kang and Schafer (2007) demonstrate the fragility of double robust and IPCW-estimators in a simulation study with positivity violations. They focus on a simple missing data problem with covariates where one desires to estimate the mean of an outcome that is subject to missingness. Responses by Robins, et al. (2007), Tsiatis and Davidian (2007), Tan (2007) and Ridgeway and McCaffrey (2007) further explore the challenges faced by double robust estimators and offer suggestions for improving their stability. In this article, we join the debate by presenting targeted maximum likelihood estimators (TMLEs). We demonstrate that TMLEs that guarantee that the parametric submodel employed by the TMLE procedure respects the global bounds on the continuous outcomes, are especially suitable for dealing with positivity violations because in addition to being double robust and semiparametric efficient, they are substitution estimators. We demonstrate the practical performance of TMLEs relative to other estimators in the simulations designed by Kang and Schafer (2007) and in modified simulations with even greater estimation challenges.

The translation of a scientific question into a statistical estimation problem often involves the formulation of a full-data structure, a target parameter of the full-data probability distribution representing the scientific question of interest, and an observed data structure which can be viewed as a mapping on the full data structure and a censoring variable. One must identify the target parameter of the full-data distribution from the probability distribution of the observed data structure, which often requires particular modeling assumptions such as the coarsening at random assumption on the censoring mechanism (i.e., the conditional distribution of censoring, given the full-data structure). The statistical problem is then reduced to a pure estimation problem defined by the challenge of constructing an estimator of the estimand, defined by the identifiability result for the target parameter of the full-data distribution. The estimator should respect the statistical model implied by the posed assumptions on the censoring mechanism and the full-data distribution.

For semiparametric (e.g., nonparametric) statistical models, many estimators rely in one way or another on the inverse probability of censoring weights (IPCW). Such estimators can be biased and highly variable under practical or theoretical violations of the positivity assumption, which is a support condition on the censoring mechanism that is necessary to establish the identifiability of the target parameter (e.g., Robins (1986, 1987, 1999); Neugebauer and van der Laan (2005); Petersen et al. (2010)). A particular class of estimators are so called double robust estimators (e.g., van der Laan and Robins (2003)). Double robust (DR) estimators, which rely on both IPCW and a model of the full-data distribution, are not necessarily protected from the bias or inflated variance that can result from positivity violations, and in recent literature, there is much debate on the relative performance of DR estimators when the positivity assumption is violated. In particular,Kang and Schafer (2007) (KS) demonstrate the fragility of DR estimators in a simulation study with near, or practical, positivity violations. They focus on a simple missing data problem in which one wishes to estimate the mean of an outcome that is subject to missingness and all possible covariates for predicting missingness are measured. Responses by Robins et al. (2007), Tsiatis and Davidian (2007), Tan (2007) and Ridgeway and McCaffrey (2007) further explore the challenges faced by DR estimators and offer suggestions for improving their stability.

Under regularity conditions, DR estimators are asymptotically unbiased if either the model of the conditional expectation of the outcome given the covariates or the model of the conditional probability of missingness given the covariates is consistent. DR estimators are semiparametric efficient (for the nonparametric model for the full-data distribution) if both of these estimators are consistent. In their article, KS introduce a variety of DR estimators and compare them to non-DR IPCW estimators as well as a simple parametric model based ordinary least squares (OLS) estimator. As the KS simulation has practical positivity violations, some values of both the true and estimated missingness mechanism are very close to zero. In this situation, the IPCW will be extremely large for some observations of the sample. Therefore, DR and non-DR estimators that rely on IPCW may be unreliable. As a result, KS warn against the routine use of estimators that rely on IPCW, including DR estimators. This is in agreement with other literature analyzing the issue. For an overview of the issue, for example, see Robins (1986, 1987, 1999); Robins and Wang (2000); van der Laan and Robins (2003)). For literature showing simulations demonstrating the extreme sparsity bias of IPCW-estimators, see for example, Neugebauer and van der Laan (2005)). Also, Petersen et al. (2010); Wang et al. (2006a); Moore et al. (2009); Cole and Hernan (2008); Kish (1992); Bembom and van der Laan (2008)) have focused on diagnosing violations of the positivity assumptions in response to this concern. Bembom and van der Laan (2008) presented data adaptive selection of the truncation constant to control the influence of weighting. In addition, van der Laan and Petersen (2007) and Petersen et al. (2010) discussed selecting parameters that are relying on realistic assumptions.

The particular simulation in KS also gives rise to a situation in which under dual misspecification, the OLS estimator outperforms all of the presented DR estimators. While this is an interesting issue, it is not the main focus of this article. In our view, dual misspecification brings up the need for other strategies for improving the robustness of estimators in general, such as incorporating data adaptive estimation instead of relying on parametric regression models for the missingness mechanism and the conditional distribution of responses, an idea echoed in the responses by Tsiatis and Davidian (2007) and Ridgeway and McCaffrey (2007), and standardly incorporated in the UC Berkeley literature on targeted maximum likelihood estimation (e.g., van der Laan and Rubin (2006); van der Laan et al. (2009)). In particular, we note that a statistical estimation problem is also defined by the statistical model, which, in this case, is defined by a nonparametric model: such models require data adaptive estimators in order to claim that the estimator is consistent. Nonetheless, we explicitly demonstrate the impact of the utilization of machine learning on the simulation results in a final section of this article.

In their response to the KS paper, Robins et al. (2007) point out that a desirable property of DR estimators is “boundedness,” in that for a finite sample, estimators of the mean response fall in the parameter space with probability 1. Estimators that impose such a restriction can introduce new bias but avoid the challenges of highly variable weights. Robins et al. (2007) discuss ways in which to guarantee that “boundedness” holds and present two classes of bounded estimators–regression DR estimators and bounded Horvitz-Thompson DR estimators. We define examples of these estimators below, and we evaluate their relative performance. The response by Tsiatis and Davidian (2007) offers strategies for constructing estimators that are more robust under the circumstances in the KS simulations. In particular, to address positivity violations, they suggest an estimator that uses IPCW only for observations with missingness mechanism values that are not close to zero, while using regression predictions for the observations with very small missingness mechanism values. One might consider either a hard cutoff for dividing observations or weighting each part of the influence curve by the estimated missingness mechanism. Tan (2007) also points to an improved locally efficient double robust estimator (Tan (2006)) that is able to maintain double robustness as well as provides guaranteed improvement relative to an initial estimator, improving on such type of estimators that had an algebraic similar form but failed to guarantee both properties (Robins et al. (1994), and see also van der Laan and Robins (2003)). Many responders also make valuable suggestions regarding the dual misspecification challenge.

In the current paper, adapted in part from Sekhon et al. (2011), we add targeted maximum likelihood estimators (TMLEs), or more generally, targeted minimum loss based estimators (van der Laan and Rubin (2006)) to the debate on the relative performance of DR estimators under practical violations of the positivity assumption in the particular simple missing data problem set forth by KS. TMLEs involve a two-step procedure in which one first estimates the conditional expectation of the outcome, given the covariates, and then updates this initial estimator, targeting the parameter of interest, rather than the overall conditional mean of the outcome given the covariates. The second step requires specification of a loss-function (e.g., log-likelihood loss function) and a parametric submodel through the initial regression, so that one can fit the parametric sub-model by minimizing the empirical risk (e.g., maximizing the log-likelihood). The estimator of the target parameter is then defined as the corresponding substitution estimator. Because TMLEs are substitution estimators, they not only respect the global bounds of the parameter and data (and thus satisfy the “boundedness” property defined by Robins et al. (2007)), but, even more importantly, they respect the fact that the true parameter value is a particular function of the data generating probability distribution.

TMLEs are double robust and asymptotically efficient. Moreover, TMLEs can incorporate data-adaptive likelihood or loss based estimation procedures to estimate both the conditional expectation of the outcome and the missingness mechanism. The TMLE also allows the incorporation of targeted estimation of the censoring/treatment mechanism, as embodied by the collaborative TMLE (C-TMLE), thereby fully confronting a long standing problem of how to select covariates in the propensity score/missingness mechanism of DR-estimators. In this article, we compare the performance of TMLEs to other DR estimators in the literature using the exact simulation study presented in the KS paper. We also make slight modifications to the KS simulation, in order to make the estimation even more challenging.

The remainder of this article is organized as follows. Section 2 presents notation, which deviates from that presented in KS, for the data structure and parameter of interest. Section 3 formally defines the positivity assumption and gives an overview of causes, diagnostics and responses to violations. Section 4 defines the estimators on which we focus in this paper, including a sample of estimators in the literature and TMLEs. Section 5 compares estimator performance in the original and modified KS simulations. Section 6 then looks at coupling TMLEs with machine learning. Section 7 concludes with a discussion of the findings.

Consider an observed data set consisting of *n* independent and identically distributed (i.i.d) observations of *O* = (*W*, Δ, Δ*Y*) ~ *P*_{0}. *W* is a vector of covariates, and Δ = 1 indicates whether *Y*, a continuous outcome, is observed. *P*_{0} denotes the true distribution of *O*, from which all observations are sampled. We view *O* as a missing data structure on a hypothetical full data structure *X* = (*W*, *Y*), which contains the true, or potential, value of *Y* for all observations, as if no values are missing. We assume *Y* is missing at random (MAR) such that (*P*_{0}Δ = 1 | *X*) = *g*_{0}(1 | *W*). In other words, we assume there are no unobserved confounders of the relationship between missingness Δ and the outcome *Y*.

We define *Q*_{0} = {*Q*_{0,}* _{W}*,

The identifiability of the parameter of interest *μ*(*P*_{0}) requires MAR and adequate support in the data. Regarding the latter, it requires that within each stratum of *W*, there is positive probability that *Y* is not missing. This requirement is often referred to as the positivity assumption. Formally, for our target parameter, the positivity assumption requires that:

(1)

The positivity assumption is specific to the the target parameter. For example, the positivity assumption of the target parameter *E*_{0}{*E*_{0}(*Y* | *A* = 1, *W*) – *E*_{0}(*Y* | *A* = 0, *W*)} of the probability distribution of *O* = (*W, A, Y*), representing the additive causal effect under causal assumptions, requires that within each stratum there is a positive probability for all possible treatment assignments. For example, if *A* is a binary treatment, then positivity requires that 0 < *g*_{0}(*A* = 1 | *W*) < 1. (The assumption is often referred to as the experimental treatment assignment (ETA) assumption for causal parameters.) In addition to being parameter-specific, the positivity assumption is also model-specific. Parametric model assumptions, which extrapolate to regions of the joint distribution of (A,W) that may not be supported in the data, allow for weakening the positivity assumption (Petersen et al. (2010)). However, analysts need to be sure that their parametric assumptions actually hold true, which may be difficult if not impossible.

Violations and near violations of the positivity assumption can arise for two reasons. First, it may be theoretically impossible or highly unlikely for the outcome *Y* to be observed for certain covariate values in the population of interest. The threat to identifiability due to such structural violations of positivity exists regardless of the sample size. Second, given a finite sample, the probability of the outcome being observed for some covariate values might be so small that the observed sample cannot be distinguished from a sample drawn under a theoretical violation of the positivity assumption. The effect of such practical violations of the positivity assumption are sample size specific, and the resulting sparse data bias and inflated variance are often as dramatic as under structural violations.

Several approaches for diagnosing bias due to positivity violations have been suggested (see Petersen et al. (2010) for an overview). Analysts may assess the distribution of Δ within covariate strata (or in the case of causal parameters, the distribution of treatment assignment), but this method is not practical with high dimensional covariate sets or with continuous or multi-level covariates, and also provides no quantitative measure of the resulting sparse-data bias. Analysts may also assess the distribution of the estimated missingness mechanism scores, *g** _{n}* (Δ = 1 |

When censoring probabilities are close to 0 (or 1 in the case of an effect parameter), a common practice is to truncate the probabilities or the resulting inverse probability weights, either at fixed levels or at percentiles (Petersen et al. (2010); Wang et al. (2006a); Moore et al. (2009); Cole and Hernan (2008); Kish (1992); Bembom and van der Laan (2008)). The practice limits the influence of observations with large unbounded weights, which may reduce positivity bias and rein in inflated variance. However, this practice may also introduce bias, due to misspecification of the missingness mechanism *g** _{n}*. The extent to which truncating

As a benchmark, KS compare all estimators in their paper to the ordinary least squares (OLS) estimator. For the target parameter, the OLS estimator is equivalent to the G-computation estimator based on a linear regression model. It is defined as:

where
is a linear regression initial fit of _{0}, and *β** _{n}* is given by:

(Note that in our notation, the subscript * _{n}* refers to an estimation, and the superscript indicates whether the estimation is from an initial fit
, or as we introduce below, a refit
or a fluctuated fit
.) Under violation of the positivity assumption, the OLS estimator, when defined, extrapolates from strata of

KS present comparisons of several DR (and non-DR) estimators. We focus on just a couple of them here. Using our terminology with the terminology and abbreviations from KS in parenthesis the estimators we compare are: the weighted least squares (WLS) estimator (regression estimation with inverse-propensity weighted coefficients, *μ** _{n,WLS}*) and the augmented IPCW (A-IPCW) estimator (regression estimation with residual bias correction,

The WLS estimator is defined as:

where

The A-IPCW estimator, introduced by J.M. Robins and Zhao (1994), is then defined as:

Both of these estimators rely on estimators of _{0} and *g*_{0}. They are consistent if
or *g** _{n}* is consistent, and efficient if both are consistent. Under positivity violations, however, these estimators rely on the consistency of
, and require that

Additionally, in comments on KS, Robins et al. (2007) introduce bounded Horvitz-Thompson (BHT) estimators, which, as the name suggests, are bounded, in that for finite sample sizes the estimates are guaranteed to fall in the parameter space. A BHT estimator is defined as:

This is equivalent to the A-IPTW estimator, but estimating *g*_{0}(1 | *W*) by fitting the following logistic regression model:

and .

We also include another important class of doubly robust, locally efficient, regression-based estimators introduced by Scharfstein et al. (1999), further discussed in Robins (1999) and compared to the TMLE in Rosenblum and van der Laan (2010). This estimator is based on a parametric regression model, which includes a “clever covariate” that incorporates inverse probability weights. The estimator behaves similarly to the TMLE using a linear fluctuation (and is identical if the TMLE using a linear fluctuation uses this clever parametric regression as initial estimator). We use the abbreviation PRC. The estimator is defined as:

where
and *m*_{β}_{,}* _{ε}* (

Cao et al. (2009) presents a DR estimator that achieves minimum variance among a class of DR estimators indexed by all possible linear regressions for the initial estimator, when the estimator of missingness mechanism is correctly specified (see also Rubin and van der Laan (2008) for empirical efficiency maximization), while it preserves the double robustness. They also address the effect of large IPCW by enhancing the missingness mechanism estimator in order to constrain the predicted values. Their estimator is defined as:

Cao’s enhanced missingness mechanism estimator is given by:

Here = [1, *W*], and the parameters *γ* and *δ* are estimated subject to the constraints 0 < *π*(*W*, *δ*, *γ*) < 1 and
. A quasi-Newton method implemented in the *constrOptim.nl* function in the R package *alabama* was used to estimate (*δ** _{n}*,

Tan (2010) presents an augmented likelihood estimator that is a more robust version of estimators originally introduced in Tan (2006) that respect boundedness and is semi-parametric efficient. This estimator is defined as:

where *ω*(*W; *_{step}_{2}) is an enhanced estimate of the missingness mechanism based on an initial estimate, *π** _{ML}* (

and *γ** _{n,ML}* is a maximum likelihood estimator for the propensity score model parameter. An estimate

The targeted maximum likelihood procedure was first introduced in van der Laan and Rubin (2006). For a compilation of current and past work on targeted maximum likelihood estimation, see van der Laan et al. (2009).

In contrast to the estimating equation-based DR estimators defined above (WLS, A-IPCW, BHT, Cao, and Tan), the PRC estimator and TM-LEs are DR *substitution* estimators. TMLEs are based on an update of an initial estimator of *P*_{0} that fluctuates the fit with a fit of a clever parametric submodel. Assuming a valid parametric submodel is selected, TMLEs do not only respect the bounds on the outcome implied by the statistical model or data, but also respect that the true target parameter value is a specified function of the data generating distribution. Due to respecting this information, the TMLE does not only respect the local bounds of the statistical model by being asymptotically (locally) efficient (as the other DR estimators), but also respect the global constraints of the statistical model. Being a substitution estimator is particularly important under sparsity, as implied by violations of the positivity assumption.

Although our target parameter involves a continuous *Y*, to introduce the TMLE for the mean outcome, we begin by defining the TMLE for a binary *Y*. In this case, the TMLE is defined as:

(2)

where we use the logistic regression submodel:

the clever covariate is defined as
, and *ε*, the fluctuation parameter, is estimated by maximum likelihood in which the loss function is thus the log-likelihood loss function:

(3)

Thus *ε** _{n}* is fitted with univariate logistic regression, using the initial regression estimator
as an off-set:

The TMLE of _{0} is defined as
, and
is the corresponding TMLE of *μ*_{0}.

For estimators
and *g** _{n}*, one may specify a parametric model or use machine learning or even super learner, which uses loss-based cross-validation to select weighted combination of candidate estimators (van der Laan et al. (2007)).

Next, consider that *Y* is continuous, but bounded by 0 and 1. In this case, we can implement the same TMLE as we would for binary *Y* in (2). That is, we use the same logistic regression submodel, and the same loss function (3), and the same standard software for logistic regression to fit *ε*, simply ignoring that *Y* is not binary. The same loss function is still valid for the conditional mean _{0} (Wedderburn (1974); Gruber and van der Laan (2010)):

Finally, given a continuous *Y* [*a, b*] we can define *Y*^{*} = (*Y – a*)/(*b–a*) so that *Y*^{*} [0, 1]. Then, let *μ*^{*}(*P*_{0}) = *E*_{0}(*E*_{0}(*Y*^{*} | Δ = 1, *W*)). This approach requires setting a range [*a, b*] for the outcomes *Y*. If such knowledge is available, one simply uses the known values. If *Y* would not be subject to missingness, then one would use the minimum and maximum of the empirical sample which represents a very accurate estimator of the range. In these simulations, *Y* is subject to informative missingness, so that the minimum or maximum of the biased sample represents a biased estimate of the range, resulting in a small unnecessary bias in the TMLE (asymptotically negligible relative to MSE). We enlarged the range of the complete observations on *Y* by setting *a* to 0.9 times the minimum of the observed values, and *b* to 1.1 times the maximum of the observed values, which seemed to remove most of the unnecessary bias. We expect that some improvements can be obtained by incorporating a valid estimator of the range that takes into account the informative missingness, but such second order improvements are outside the scope of this article. We now compute the above TMLE of *μ*^{*}(*P*_{0}), denoted as TMLE_{Y}_{*}, and we use the relation *μ*(*P*_{0}) = (*b – a*)*μ*^{*}(*P*_{0}) + *a*.

We note that the estimator proposed by (Scharfstein et al., 1999) and discussed in the KS debate is a particular special case of a TMLE (Rosenblum and van der Laan (2010)). It defines a clever parametric initial regression for which the update step of the general TMLE-algorithm introduced in van der Laan and Rubin (2006) results in a zero-update, and is thus not needed. Such a TMLE falls in the class of TMLEs defined by an initial regression estimator, a squared error loss function and univariate linear regression sub-model (coding the fluctuations of the initial regression estimator for the TMLE-update step). Such TMLEs for continuous outcomes (contrary to the excellent robustness of the TMLE for binary outcome based on the log-likelihood loss function and logistic regression submodel) suffer from great sensitivity to violations of the positivity assumptions, as was also observed in the simulations presented in the Kang and Schafer debate. As explained in (Gruber and van der Laan (2010)) the problem with this TMLE defined by the squared error loss function and univariate linear regression submodel is that its updates are not subject to any bounds implied by the statistical model or data: that is, it is not using a parametric *sub*-model, an important principle of the general TMLE algorithm. The valid TMLE for continuous outcomes above, defined by the quasi-binary-log-likelihood loss and a univariate logistic regression parametric submodel, was recently presented (Gruber and van der Laan (2010)), and in the latter article it was demonstrated that the previously observed sensitivity of these two estimators to the positivity assumption was due to those specific choices.

Finally, a natural extension of all of the above TMLEs is to make a more sophisticated estimate of *g*_{0}. Therefore, estimator *μ*_{n,C–TMLEY*} is defined by (2) as well, but the algorithm for computing
differs. For the C-TMLE, we generate a sequence of nested-logistic regression model fits of *g*_{0}, *g*_{n}_{,1}, . . . , *g** _{n,K}*, and we create a corresponding sequence of candidate TMLEs
, using

In this section, we compare the performance of TMLEs to the estimating equation-based DR estimators (WLS, A-IPTW, BHT, Cao, TanWLS, TanRV) as well as PRC and OLS, in the context of positivity violations. The goal of the original simulation designed by KS was to highlight the stability problems of DR estimators. We explore the relative performance of the estimators under the original KS simulation and a number of alternative data generating distributions that involve stronger and different types of violations of the positivity assumption. These new simulation settings were designed to provide more diverse and even more challenging test cases for evaluating robustness and thereby finite sample performance of the different estimators.

For the four simulations described below, all estimators were used to estimate *μ*(*P*_{0}) from 250 samples of size 1000. We include TMLE_{Y}_{*} and C-TMLE_{Y}_{*} estimators based on the quasi-log-likelihood loss function and the logistic regression submodel. We evaluated the performance of the estimators by their bias, variance and mean squared error (MSE).

We compared the estimators of *μ*(*P*_{0}) using different specifications of the estimators of _{0} and *g*_{0}. In each of the tables presented below, “Qcgc” indicates that the estimators of both were specified correctly; “Qcgm” indicates that the estimator of _{0} was correctly specified, but the estimator of *g*_{0} was misspecified ; “Qmgc” indicates that the estimator of _{0} was misspecified, but the estimator of *g*_{0} was correctly specified; and “Qmgm” indicates that both estimators were misspecified. For the modified simulations we present results for the “Qmgc” specification only, in order to focus on the performance of each estimator when reliance on *g** _{n}* is essential. Additional results for the other model specifications are available as supplemental materials.

Also, for all estimators, we compared results with no lower bound on *g** _{n}*(1 |

Kang and Schafer (2007) consider *n* i.i.d. units of *O* = (*W*, Δ, Δ*Y*) ~ *P*_{0}, where *W* is a vector of 4 baseline covariates, and Δ is an indicator of whether the continuous outcome, *Y*, is observed. Kang and Schafer are interested in estimating the following parameter:

Let (*Z*_{1}, . . . , *Z*_{4}) be independent normally distributed random variables with mean zero and variance 1. The covariates *W* we actually observe are generated as follows:

The outcome *Y* is generated as:

From this, one can determine that the conditional mean _{0}(*W*) of *Y*, given *W*, which equals the same linear regression in *Z*_{1}(*W*), . . . , *Z*_{4}(*W*), where *Z** _{j}*(

With this data generating mechanism, the average response rate is 0.50. Also, the true population mean is 210, while the mean among respondents is 200. These values indicate a small selection bias.

In these simulations, a linear main term model in the main terms (*W*_{1}, . . . , *W*_{4}) for either the outcome-regression or missingness mechanism is misspecified, while a linear main term model in the main terms (*Z*_{1}(*W*), . . . , *Z*_{4}(*W*)) would be correctly specified. Note that in the KS simulation, there are finite sample violations of the positivity assumption. Specifically, we find *g*_{0}(Δ = 1 | *W*) [0.01, 0.98] and the estimated missingness probabilities *g** _{n}*(Δ = 1 |

Figure 1 and Table 1 present the simulation results without any bounding of *g** _{n}*. Tan’s estimator imposes internal bounds on the estimated missingness mechanism, however we report performance of TanWLS and TanRV estimators when given an initial estimate

Figure 2 and Table 2 compare the results for each estimator when *g** _{n}* is bounded from below at 0.025. Bounding

In the KS simulation, when _{0} or *g*_{0} are misspecified the misspecifications are small, and the selection bias is small. Therefore, we modified the KS simulation in order to increase the degree of misspecification and selection bias. This creates a greater challenge for estimators, and better highlights their relative performance.

As before, let *Z** _{j}* be i.i.d.

From this one can determine the true regression function _{0}(*W*) = *E*_{0}(*E*(*Y* | *Z*) | *W*). The missingness indicator is generated as follows:

A misspecified fit is now obtained by fitting a linear or logistic main term regression in *W*_{1}, . . . , *W*_{4}, while a correct fit is obtained by providing the user with the terms *Z*_{1}, . . . , *Z*_{4}, and fitting a linear or logistic main term regression in *Z*_{1}, . . . , *Z*_{4}. With these modifications, the population mean is again 210, but the mean among respondents is 184.4. With these modifications, we have a higher degree of practical violation of the positivity assumption: *g*_{0}(Δ = 1 | *W*) [1.1 × 10^{−5}, 0.99] while the estimated probabilities, *g** _{n}*(Δ = 1 |

Figure 3 and Table 3 presents results for misspecified
without bounding *g** _{n}* and with

Sampling distribution of (*μ*_{n} – *μ*_{0}) with *g*_{n} bounded at 0.025, Modification 1 of Kang and Schafer simulation.

For this simulation, we made one additional change to Modification 1: we set the coefficient in front of *Z*_{4} in the true regression of *Y* on *Z* equal to zero. Therefore, while *Z*_{4} is still associated with missingness, it is not associated with the outcome, and is thus not a confounder. Given (*W*_{1}, . . . , *W*_{3}), *W*_{4} is not associated with the outcome either, and therefore as misspecified regression model of _{0}(*W*) we use a main term regression in (*W*_{1}, *W*_{2}, *W*_{3}).

This modification to the KS simulation enables us to take the debate on the relative performance of DR estimators one step further, by addressing a second key challenge of the estimators: that they often include non-confounders in the censoring mechanism estimator. Though such an estimator remains asymptotically unbiased, this unnecessary inclusion can increase asymptotic variance, and may unnecessarily introduce positivity violations leading to finite sample bias and inflated variance (Neugebauer and van der Laan, 2005; Petersen et al., 2010).

Figure 4 and Table 4 reveal that C-TMLE_{Y}_{*} has superior performance relative to estimating equation-based DR estimators when not all covariates are associated with *Y*. As discussed earlier, the C-TMLE algorithm provides an innovative black-box approach for estimating the censoring mechanism, preferring covariates that are associated with the outcome and censoring, without “data-snooping.”

Sampling distribution of (*μ*_{n} – *μ*_{0}) with *g*_{n} bounded at 0.025, Modification 2 of Kang and Schafer simulation.

In some rare cases, a C-TMLE can be a super efficient estimator because they use a collaborative estimator *g** _{n}* that takes into account the fit of the initial estimator
(we refer to Rotnitzky et al. (2010) and van der Laan and Gruber (2009) for a detailed discussion). As a consequence, it is of particular interest to investigate the behavior of C-TMLE

The KS simulation with dual misspecification (*Qmgm*) can illustrate the benefits of coupling data-adaptive (super) learning with targeted maximum likelihood estimation. C-TMLE_{Y}_{*} constrained to use a main terms regression model with misspecified covariates (*W*_{1}, *W*_{2}, *W*_{3}, *W*_{4}) has smaller variance than *μ** _{n,OLS}*, but is more biased. The MSE of the TMLE

We coupled super learning with TMLE_{Y}_{*} and C-TMLE_{Y}_{*} to estimate both _{0} and *g*_{0}. For C-TMLE_{Y}_{*}, four missingness-mechanism score-based covariates were created based on different truncation levels of the propensity score estimate *g** _{n}*(1 |

An important aspect of super learning is to ensure that the library of prediction algorithms includes a variety of approaches for fitting the true function _{0} and *g*_{0}. For example, it is sensible to include a main terms regression algorithm in the super learner library. Should that algorithm happen to be correct, the super learner will behave as the main terms regression algorithm. It is also recommended to include algorithms that search over a space of higher order polynomials, non-linear models, and, for example, cubic splines. For binary outcome regression, as required for fitting *g*_{0}, classification algorithms such as classification and regression trees (Breiman et al., 1984), support vector machines (Cortes and Vapnik, 1995)), and *k*-nearest-neighbor algorithms (Friedman (1994)), could be added to the library. The point of super-learning is that we cannot know in advance which procedure will be most successful for a given prediction problem. Super learning relies on the oracle property of V-fold cross-validation to asymptotically select the optimal convex combination of estimates obtained from these disparate procedures (van der Laan and Dudoit (2003); van der Laan et al. (2004), van der Laan et al. (2007)).

Consider the misspecified scenario proposed by KS. The true full-data distribution and the missingness mechanism are captured by main terms linear regression of the outcome on *Z*_{1}, *Z*_{2}, *Z*_{3}, *Z*_{4}. This simple model is virtually impossible to discover through the usual model selection approaches when the observed data consists of misspecified covariates *O* = (*W*_{1}, *W*_{2}, *W*_{3}, *W*_{4}, Δ, Δ*Y*), given

This complexity illustrates the importance of including prediction algorithms that attack the estimation problem from a variety of directions. The super learner library we employed contained the algorithms listed below. The analysis was carried out in the R statistical programming environment v2.10.1 (Team, 2010), using algorithms included in the base installation or in the indicated package.

**glm**(base) main terms linear regression.**step**(base) stepwise forward and backward selection using the AIC criterion (Hastie and Pregibon, 1992).**ipredbagg**(ipred) bagging for classification, regression and survival trees (Peters and Hothorn, 2009; Breiman, 1996).**DSA**(DSA) Deletion/Selection/Addition algorithm for searching over a space of polynomial models or order*k*(*k*set to 2). (Neugebauer and Bullard, 2010; Sinisi and van der Laan, 2004)**earth**(earth) Building a regression model using multivariate adaptive regression splines (MARS) (Milborrow, 2009; Friedman, 1991, 1993).**loess**(stats) Local polynomial regression fitting (W. S. Cleveland and Shyu, 1992).**nnet**(nnet) Single-hidden-layer neural network for classification (Venables and Ripley, 2002; Ripley, 1996).**svm**(e1071) Support vector machine for regression and classification (Dimitriadou et al., 2010; Chang and Lin, 2001).*k***-nearest-neighbors**^{*}(class) classification using most common outcome among identified*k*nearest nodes (*k*set to 10) (Venables and Ripley, 2002; Friedman, 1994)

Table 6 reports the results when super learning is incorporated into TMLE_{Y}_{*} and C-TMLE_{Y}_{*} estimation procedures, based on 250 samples of size 1000, with predicted values for *g** _{n}*(1 |

By mapping continuous outcomes into [0,1] and using a logistic fluctuation, TMLE_{Y}_{*} and C-TMLE_{Y}_{*} are more robust to violations of the positivity assumption than the TMLEs using the linear fluctuation function. By being a substitution estimator, it follows that the impact of a single observation on TMLE_{Y}_{*} is bounded by 1*/n* while many of the other estimators do not have such a robustness property. We show that C-TMLE_{Y}_{*} has superior performance relative to estimating equation-based DR estimators when there are covariates that are strongly associated with the missingness indicator, while weakly or not at all associated with the outcome *Y*. The C-TMLE algorithm provides an innovative approach for estimating the censoring mechanism, preferring covariates that are associated with the outcome *Y* and missingness, Δ. C-TMLEs avoid data snooping concerns because the estimation procedure is fully specified before the analyst observes any data (or at least, not any data beyond some ancillary statistics). Even in cases in which *all* observed covariates are associated with *Y*, C-TMLE still performs well.

Related work is also being done with respect to other parameters of interest. For example, both Cao et al. (2009) and Tan (2006) include discussions on applying their estimators to causal effect parameters. In addition, Freedman and Berk (2008), focus on a causal effect parameter, and demonstrate that DR estimators (and the WLS estimator in particular) can increase variance and bias when IPCW are large.

Overall, comparisons of estimators, beyond theoretical studies of asymptotics as well as robustness, will need to be based on large scale simulation studies including all available estimators, and cannot be tailored towards one particular simulation setting. Future research should be concerned with setting up such a large scale objective comparison based on publicly available software, and we are looking forward to contributing to such an effort.

The research underlying TMLEs was motivated, in part, by the goal of increasing the stability of DR estimators, and the KS simulations provide a demonstration of the merits of TMLEs under violations of the positivity assumption. TMLEs are estimators defined by the choice of loss function, and parametric submodel, both chosen so that the linear span of the scores at zero fluctuation with respect to the loss function includes the efficient influence curve/efficient score. All such TMLEs are double robust, asymptotically efficient under correct specification, and substitution estimators, but the choice of submodel can affect the finite sample robustness if the submodel does not respect any bounds such as the linear regression submodel for the TMLE. In addition, TMLEs can be combined with super learning and empirical efficiency maximization (Rubin and van der Laan (2008) and van der Laan and Gruber (2009)) to further enhance their performance in practice. We hope that by showing that these estimators perform well in simulations and settings created by other researchers for the purposes of showing the weaknesses of DR estimators, as well as in modified simulations that make estimation even more challenging, we provide probative evidence in support of TMLEs. Of course, much can happen in finite samples, and we look forward to further exploring how these estimators perform in other settings.

^{*}binary outcomes only, added to library for estimating *g*

**Author Notes:** Kristin E. Porter and Susan Gruber contributed equally to this work. We would like to thank Zhiqiang Tan and Weihua Cao for providing us with software to implement their estimators. This work was supported by NIH grant R01AI074345-04 and by a National Institutes of Health NRSA Trainee appointment on grant number T32 HG 00047.

Kristin E. Porter, University of California, Berkeley.

Susan Gruber, University of California, Berkeley.

Mark J. van der Laan, University of California, Berkeley.

Jasjeet S. Sekhon, University of California, Berkeley.

- Bembom O, van der Laan MJ. Data-adaptive selection of the truncation level for inverse-probability-of-treatment-weighted estimators Technical Report 230, Division of Biostatstics, University of California, Berkeley, 2008. URL www.bepress.com/ucbbiostat/paper230/
- Breiman L. Bagging predictors. Machine Learning. 1996;24:123–140. doi: 10.1007/BF00058655. [Cross Ref]
- Breiman L, Friedman JH, Olshen R, Stone CJ. Classification and regression trees. 1984. The Wadsworth statistics/probability series. Wadsworth International Group.
- Cao W, Tsiatis AA, Davidian M. Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika. 2009;96(3):723–734. doi: 10.1093/biomet/asp033. [PMC free article] [PubMed] [Cross Ref]
- Chang C-C, Lin C-J. LIBSVM: a library for support vector machines (version 2.31) Technical report, 2001. http://www.csie.ntu.edu.tw/cjlin/papers/libsvm2.ps.gz.
- Cole SR, Hernan MA. Constructing inverse probability weights for marginal structural models. American Journal of Epidemiology. 2008;168:656–664. doi: 10.1093/aje/kwn164. [PMC free article] [PubMed] [Cross Ref]
- Cortes C, Vapnik V. Support-vector networks. Machine Learning. 20:273–297. December 1995.
- Dimitriadou E, Hornik K, Leisch F, Meyer D, Weingessel A. e1071: Misc Functions of the Department of Statistics (e1071), TU Wien. 2010. URL http://CRAN.R-project.org/package=e1071. R package version 1.5–24.
- Freedman DA, Berk RA. Weighting regressions by propensity scores. Evaluation Review. 2008;32(4):392–409. doi: 10.1177/0193841X08317586. [PubMed] [Cross Ref]
- Friedman JH. Multivariate adaptive regression splines. The Annals of Statistics. 1991;19(1):1–67. doi: 10.1214/aos/1176347963. ISSN 00905364. URL http://www.jstor.org/stable/2241837. [Cross Ref]
- Friedman JH. Fast MARS Technical report, Department of Statistics. Stanford University; 1993.
- Friedman JH. 1994. Flexible metric nearest neighbor classification Technical report, Department of Statistics, Stanford University.
- Geyer CJ. trust: Trust Region Optimization. 2009. URL http://CRAN.R-project.org/package=trust. R package version 0.1-2.
- Gruber S, van der Laan MJ. UC Berkeley: 2010a. A targeted maximum likelihood estimator of a causal effect on a bounded continuous outcome. Technical Report 265. [PMC free article] [PubMed]
- Gruber S, van der Laan MJ. An application of collaborative targeted maximum likelihood estimation in causal inference and genomics. The International Journal of Biostatistics. 2010b;6(1) doi: 10.2202/1557-4679.1182. (18) [PMC free article] [PubMed] [Cross Ref]
- Hastie TJ, Pregibon D. Generalized linear models. In: Chambers JM, Hastie TJ, editors. Statistical Models in S. Wadsworth & Brooks/Cole; 1992. chapter 6.
- Rotnitzky A, Robins JM, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. J Am Statist Assoc. 1994;89:846–66. doi: 10.2307/2290910. [Cross Ref]
- Kang J, Schafer J. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data (with discussion) Statistical Science. 2007;22:523–39. doi: 10.1214/07-STS227. [PMC free article] [PubMed] [Cross Ref]
- Kish L. Weighting for unequal
*p*. Journal of Official Statistics. 1992;8:183–200._{i} - Milborrow S. earth: Multivariate Adaptive Regression Spline Models. 2009. URL http://CRAN.R-project.org/package=earth. R package version 2.4-0.
- Moore KL, Neugebauer RS, van der Laan MJ, Tager IB. Causal inference in epidemiological studies with strong confounding Technical Report 255, Division of Biostatistics. University of California; Berkeley: 2009. URL www.bepress.com/ucbbiostat/paper255/
- Neugebauer R, Bullard J. DSA: Deletion/Substitution/Addition algorithm. 2010. URL http://www.stat.berkeley.edu/~laan/Software/. R package version 3.1.4.
- Neugebauer R, van der Laan MJ. Why prefer double robust estimators in causal inference? Journal of Statistical Planning and Inference. 2005;129(1–2):405–426. doi: 10.1016/j.jspi.2004.06.060. [Cross Ref]
- Peters A, Hothorn T. ipred: Improved Predictors. 2009. URL http://CRAN.R-project.org/package=ipred. R package version 0.8-8.
- Petersen ML, Porter K, Gruber S, Wang Y, van der Laan MJ. Diagnosing and responding to violations in the positivity assumption. Statistical Methods in Medical Research. 2010 doi: 10.1177/0962280210386207. [PubMed] [Cross Ref]
- Ridgeway G, McCaffrey D. Comment: Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data (with discussion) Statistical Science. 2007;22:540–43. doi: 10.1214/07-STS227C. [PMC free article] [PubMed] [Cross Ref]
- Ripley BD. Pattern recognition and neural networks. Cambridge University Press; Cambridge, New York: 1996.
- Robins JM, Sued M, Lei-Gomez Q, Rotnitzky A. Comment: Performance of double-robust estimators when “inverse probability” weights are highly variable. Statistical Science. 2007;22:544–559. doi: 10.1214/07-STS227D. [Cross Ref]
- Robins JM. A new approach to causal inference in mortality studies with sustained exposure periods - application to control of the healthy worker survivor effect. Mathematical Modelling. 1986;7:1393–1512. doi: 10.1016/0270-0255(86)90088-6. [Cross Ref]
- Robins JM. Addendum to: “A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect” [Math. Modelling
**7**(1986), no. 9–12; MR 87m:92078] Comput Math Appl. 1987;14(9–12):923–945. doi: 10.1016/0898-1221(87)90238-0. ISSN 0097-4943. [Cross Ref] - Robins JM. Proceedings of the American Statistical Association: Section on Bayesian Statistical Science. 1999. Robust estimation in sequentially ignorable missing data and causal inference models; pp. 6–10.
- Robins JM. Commentary on using inverse weighting and predictive inference to estimate the effecs of time-varying treatments on the discrete-time hazard. Statistics in Medicine. 1999;21:1663–1680. doi: 10.1002/sim.1110. [PubMed] [Cross Ref]
- Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association. 1994 Sep;89(427):846–66. doi: 10.2307/2290910. [Cross Ref]
- Rosenblum M, van der Laan MJ. Targeted maximum likelihood estimation of the parameter of a marginal structural model. The International Journal of Biostatistics. 2010;6(19) doi: 10.2202/1557-4679.1238. [PMC free article] [PubMed] [Cross Ref]
- Rotnitzky A, Li L, Li X. A note on overadjustment in inverse probability weighted estimation. Biometrika. 2010;97(4):997–1001. doi: 10.1093/biomet/asq049. [PMC free article] [PubMed] [Cross Ref]
- Rubin DB, van der Laan MJ. Empirical efficiency maximization: Improved locally efficient covariate adjustment in randomized experiments and survival analysis. The International Journal of Biostatistics. 2008;4(1) doi: 10.2202/1557-4679.1084. Article 5. [PMC free article] [PubMed] [Cross Ref]
- Scharfstein DO, Rotnitzky A, Robins JM. Adjusting for non-ignorable drop-out using semiparametric nonresponse models, (with discussion and rejoinder) Journal of the American Statistical Association. 1999;(94):1096–1120. 1121–1146. doi: 10.2307/2669923. [Cross Ref]
- Sekhon JS, Gruber S, Porter K, van der Laan MJ. Propensity-score-based estimators and C-TMLE. In: van der Laan MJ, Rose S, editors. Targeted Learning: Prediction and Causal Inference for Observational and Experimental Data. Springer; New York: 2011. chapter 21.
- Sinisi S, van der Laan MJ. The Deletion/Substitution/Addition algorithm in loss function based estimation: Applications in genomics. Journal of Statistical Methods in Molecular Biology. 2004;3(1)
- Tan Z. A distributional approach for causal inference using propensity scores. J Am Statist Assoc. 2006;101:1619–37. doi: 10.1198/016214506000000023. [Cross Ref]
- Tan Z. Comment: Understanding OR, PS and DR. Statistical Science. 2007;22:560–568. doi: 10.1214/07-STS227A. [Cross Ref]
- Tan Zhiqiang. Bounded, efficient and doubly robust estimation with inverse weighting. Biometrika. 2010;97(3):661–682. doi: 10.1093/biomet/asq035. [Cross Ref]
- R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing; Vienna, Austria: 2010.
- Tsiatis A, Davidian M. Comment: Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data (with discussion) Statistical Science. 2007;22:569–73. doi: 10.1214/07-STS227B. [PMC free article] [PubMed] [Cross Ref]
- van der Laan MJ, Dudoit S. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: Finite sample oracle inequalities and examples Technical report, Division of Biostatistics. University of California; Berkeley: Nov, 2003.
- van der Laan MJ, Gruber S. Collaborative double robust penalized targeted maximum likelihood estimation. The International Journal of Biostatistics. 2009 [PMC free article] [PubMed]
- van der Laan MJ, Robins JM. Unified methods for censored longitudinal data and causality. Springer; New York: 2003.
- van der Laan MJ, Rubin D. Targeted maximum likelihood learning. The International Journal of Biostatistics. 2006;2(1) doi: 10.2202/1557-4679.1043. [Cross Ref]
- van der Laan MJ, Dudoit S, van der Vaart AW. University of California; Berkeley: Feb, 2004. The cross-validated adaptive epsilon-net estimator Technical report 142, Division of Biostatistics.
- van der Laan MJ, Polley E, Hubbard A. Super learner. Statistical Applications in Genetics and Molecular Biology. 2007;6(25) doi: 10.2202/1544-6115.1309. ISSN 1. [PubMed] [Cross Ref]
- van der Laan MJ, Rose S, Gruber S. Readings on targeted maximum likelihood estimation. 2009. Technical report, working paper series http://www.bepress.com/ucbbiostat/paper254.
- Varadhan R. alabama: Constrained nonlinear optimization. 2010. URL http://CRAN.R-project.org/package=alabama. R package version 2010.10-1.
- Venables WN, Ripley BD. Modern applied statistics with S. 4th edition. Springer; New York: 2002.
- Grosse E, Cleveland WS, Shyu WM. Local regression models. In: Chambers JM, Hastie TJ, editors. Statistical Models in S. Wadsworth & Brooks/Cole; 1992. chapter 6.
- Wang Y, Petersen M, Bangsberg D, van der Laan MJ. University of California; Berkeley: 2006a. Diagnosing bias in the inverse probability of treatment weighted estimator resulting from violation of experimental treatment assignment. Technical Report 211, Division of Biostatistics.
- Wang Y, Petersen M, van der Laan MJ. University of California; Berkeley: 2006b. A statistical method for diagnosing ETA bias in IPTW estimators. Technical report, Division of Biostatistics.
- Wedderburn RWM. Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method. Biometrika. 1974;61

Articles from The International Journal of Biostatistics are provided here courtesy of **Berkeley Electronic Press**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's Canada Institute for Scientific and Technical Information in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |