In the absence of missing data, local polynomial kernel estimating equations have been proposed by

Carroll, Ruppert, and Welsh (1998) as an extension of local likelihood estimation. When the data are not fully observed, one naive estimation approach is to simply solve the local polynomial kernel estimating equations using only completely observed units. However, as we show in Theorem 1 in Section 4, the resulting estimator

_{naive}(

*z*) is generally inconsistent under MAR, except when: (a) the conditional mean of

*E*(

*Y*|

*Z*,

**U**) depends at most on

*Z* or, (b) the selection probability Pr(

*R* = 1|

*Z*,

**U**) depends at most on

*Z*. This result is not surprising once we connect our inferential problem to causal inference objectives and relate it to well-known facts in causality. The MAR assumption (

2) is equivalent to the assumption of no unmeasured confounding (

Robins et al. 1999) or ignorability (

Rubin 1976) for the potential outcome under treatment

*R* = 1 in the subpopulation with

*Z* =

*z*. This assumption stipulates that, conditional on

*Z* =

*z*,

**U** are the only variables that can simultaneously be (i) correlates of the outcome within treatment level and (ii) predictors of treatment

*R* = 1. When (a) or (b) holds, either (i) or (ii) is violated. In such case, the effect of

*R* = 1 on

*Y* is unconfounded and consequently naive conventional, that is, unadjusted, estimators of the association of

*Y* with

*R* = 1 conditional on

*Z* =

*z* are consistent estimators of the causal estimand of interest. In fact, when (b) holds but (a) is false, the naive estimator will be consistent but inefficient because it fails to exploit the information about

*E*(

*Y*|

*Z* =

*z*) in the auxiliary variables

**U**. Thus, even in such setting it is desirable to develop alternative, more efficient, estimation procedures. The Augmented Inverse Probability Weighted (AIPW) kernel estimators developed in this paper address this issue.

When the outcomes are missing at random,

Robins, Rotnitzky, and Zhao (1995) and

Rotnitzky and Robins (1995) proposed an inverse probability weighted (IPW) estimating equation for parametric regression, that is, when

*θ*(·) is parametrically modeled as

*θ*(·;

*ν*) indexed by a finite dimensional parameter vector

*ν*, where

*ν* **R**^{k}.

Robins and Rotnitzky (1995) showed that one can improve the efficiency of the IPW estimator by adding to the IPW estimating function a parametric augmentation term. We extend their idea and propose a class of AIPW kernel estimating equations for estimating the nonparametric function

*θ*(·). We weight the units with complete data by either the inverse of the true selection probability

*π*_{i}_{0} = Pr(

*R*_{i} = 1|

*Z*_{i},

**U**_{i}) (if known, for instance, as in two-stage sampling designs) or the inverse of an estimator of it, and add an adequately chosen augmentation term. We show that, just as for estimation of a parametric model for

*θ*(·), inclusion of the augmentation term can lead to efficiency improvement for estimation of the nonparametric regression function

*θ*(·). Unlike parametric regression, the augmentation term depends on a kernel function.

Specifically, let

*K*_{h}(

*s*) =

*h*^{−1}*K*(

*s*/

*h*), where

*K*(·) is a mean-zero density function. Without loss of generality, we here focus on local linear kernel estimators. For any scalar

*x*, define

**G**(

*x*) = (1,

*x*)

^{T} and

*α* = (

*α*_{0},

*α*_{1})

^{T}. For any target point

*z*, the local linear kernel estimator approximates

*θ*(

*Z*_{i}) in the neighborhood of

*z* by a linear function

**G**(

*Z*_{i} −

*z*)

^{T}
*α*. Let

*μ*(·) =

*g*^{−1}(·). Suppose we postulate a working variance model var(

*Y*_{i}|

*Z*_{i}) =

*V*[

*μ*{

*θ*(

*Z*_{i})};

*ζ*], where

*ζ* **R**^{r} is an unknown finite dimensional parameter and

*V*(·, ·) is a known working variance function. To estimate

*π*_{i}_{0} we postulate a parametric model

where

*π*(

*Z*,

**U**;

*τ*) is a known smooth function of an unknown finite dimensional parameter vector

*τ* **R**^{k}. For example, we can assume a logistic model

, where

. We compute

, the maximum likelihood estimator of

*τ* under model (

3) and then we estimate

*π*_{i}_{0} with

_{i} =

*π* (

*Z*_{i},

**U**_{i};

). Then we define the augmented inverse probability weighted (AIPW) kernel estimating equations as

with

is the first derivative of

*μ*(·) evaluated at

**G**(

*Z*_{i} −

*z*)

^{T}
*α*,

*δ*(

*Z*_{i},

**U**_{i}) is any arbitrary, user-specified, possibly data-dependent, function of

*Z*_{i} and

**U**_{i}, and

*V*_{i} =

*V*[

*μ*{

**G**(

*Z*_{i} −

*z*)

^{T}*α*};

*ζ*]. As

*ζ* is unknown in practice, we estimate it using the inverse probability weighted moment equations

, where

, and

_{j}(

*ζ*) = {

_{0,}_{j}(

*ζ*),

_{1,}_{j}(

*ζ*)}

^{T} solve (

4) with

*z* =

*Z*_{j},

*j* = 1, …,

*n*. Denote the resulting estimator by

. The AIPW estimator of

*θ*(

*z*) is

_{AIPW} (

*z*) =

_{0,AIPW}(

) where

_{AIPW} = {

_{0,AIPW}(

),

_{1,AIPW}(

)} solves (

4) with

*V*_{i} replaced by

*V*[

*μ*{

**G**(

*Z*_{i} −

*z*)

^{T}
*α*};

].

In the AIPW kernel estimating

equations (4), the term

*U*_{IPW,}_{i}(

*α*) is zero for subjects with missing outcomes and for those with observed outcomes it is simply equal to their usual contribution to the local kernel regression estimating equations weighted by the inverse of their probability of observing the outcome given their auxiliaries and covariates. The term

*A*_{i}(

*α*), which is often referred to as an augmentation term, differs from that used in parametric regression [equations (38) and (39),

Robins, Rotnitzky, and Zhao 1994] in that it additionally includes the kernel function

*K*_{h}(·), and in that it approximates

*μ*{

*θ*(

*Z*_{i})} =

*g*^{−1}{

*θ*(

*Z*_{i})} by the local polynomial

*μ*{

**G**(

*Z*_{i} −

*z*)

^{T}
*α*}.

Two key properties, formally proved in Section 4, make the AIPW kernel estimating equation methodology appealing, namely: (1) exploitation of the information in the auxiliary variables of subjects with missing outcomes and (2) double robustness.

Informally, property (

1) is seen because both the subjects with complete data and those with missing outcomes in a local neighborhood of

*Z* =

*z* have a nonnegligible contribution to the AIPW kernel estimating equations. Consider the alternative IPW kernel estimator

_{IPW}(

*z*), which is obtained by simply solving the IPW kernel estimating equations Σ

_{i} U_{IPW,}_{i}(

*α*) = 0, that is, ignoring the augmentation term in the estimating

equations (4). Although

_{IPW}(

*z*) depends on the auxiliary variables

**U** of the units with missing outcomes through the estimators

that define the

_{i}’s, this information is asymptotically negligible. Specifically, in Theorem 2, we show that when the support of

*Z* is compact, under regularity conditions, the asymptotic distribution of

_{IPW}(

*z*) as

*h* → 0,

*n* → ∞ and

*nh* → ∞ is the same regardless of whether one uses the true

*π*_{i}_{0} (and hence do not use auxiliary data of incomplete units) or the fitted value

_{i} computed under a correctly specified parametric model (

3). This is different from inference under a parametric regression model for

*E*(

*Y*|

*Z*) where, as noted by

Robins, Rotnitzky, and Zhao (1994,

1995), estimation of the missingness probabilities helps improve the efficiency in estimation of regression coefficients. The reason is that the convergence of the ML estimator of

*π*_{i}_{0} under a parametric model is at the

-rate while nonparametric estimation of

*θ*(

*z*) is at a slower rate. To see this note that only the

*O*(

*nh*) units that have values of

*Z* in a neighborhood of

*z* of width

*O*(

*h*) contribute to the IPW kernel estimating equations for

*E*(

*Y*|

*Z* =

*z*), so only the auxiliary variables of these units are relevant. However, as

*n* → ∞, the data of these units could not enter into the IPW kernel estimating equations via the estimation of

*π*_{i}_{0} through the estimation of the finite dimensional parameter

*τ*. This is so because for computing

parametrically all

*n* units are used and the contribution of the

*O*(

*nh*) relevant units is asymptotically negligible. The above discussions suggest that compared to the IPW kernel estimator, the AIPW kernel estimator of

*θ*(

*z*) can better explore the information in the auxiliary variables of subjects with missing outcomes.

To construct AIPW estimators with property (

2), the double-robustness, we specify a parametric model

where

*η* is an unknown finite dimensional parameter vector, and we estimate

*η* using the method of moments estimator

based on data from completely observed units. Under the MAR assumption (

2),

is

-consistent for

*η*, provided model (

6) is correctly specified (

Little and Rubin 2002). We then compute

_{AIPW}(

*z*) using

*δ*(

*Z*_{i},

**U**_{i}) =

*δ*(

*Z*_{i},

**U**_{i};

). In Theorem 3 in Section 4, we show that such estimator

_{AIPW}(

*z*) is doubly robust, that is, it is consistent when either model (

3) for

*π*_{i}_{0} is correct or model (

6) for

*E*(

*Y*_{i}|

*Z*_{i},

**U**_{i}) is correct, but not necessarily both. The practical consequence of double-robustness is that it gives data analysts two opportunities of carrying out valid inference about

*θ*(

*z*), one for each of the possibly correctly specified models (

6) or (

3). In contrast, as shown in Theorem 1 in Section 4, consistency of the IPW kernel estimator

_{IPW}(

*z*) requires that the selection probability model (

3) for

*π*_{i}_{0} must be correctly specified. One may question the possibility that the fully parametric model (

6) for

*E*(

*Y*_{i}|

*Z*_{i},

**U**_{i}) is correct when in fact the model of scientific interest for

*E*(

*Y*_{i}|

*Z*_{i}) is left fully nonparametric precisely because of the lack of knowledge about the dependence of the mean of

*Y* on

*Z*. This valid concern is dissipated when it is understood that model (

6) is only a working model that simply serves to enhance the chances of getting nearly correct (and indeed, nearly efficient) inference. Aside from this, it should also be noted that it is possible that data analysts may have refined knowledge of the conditional dependence of

*Y* on

*Z* within level of

**U**, but not marginally over

**U**.

In addition, in Section 4 we show that the preceding double-robust estimator

_{AIPW}(

*z*) has an additional desirable property. Specifically, if model (

6) is correctly specified then the double-robust estimator

_{AIPW}(

*z*) has the smallest asymptotic variance among all estimators solving AIPW kernel estimating equations with

*π*_{i}_{0} either known or estimated from a correctly specified parametric model (

3). That is, the asymptotic variance of the resulting double-robust estimator

_{AIPW}(

*z*) that uses

*δ*(

*Z*_{i},

**U**_{i}) =

*δ*(

*Z*_{i},

**U**_{i};

) with

a

-consistent estimator of

*η* under a correct model (

6), is less than or equal to that of an AIPW kernel estimator using any other arbitrary function

*δ*(

*Z*_{i},

**U**_{i}) when the selection probability model (

3) is correct.

Remark 1 Our estimators

_{AIPW}(

*z*) use the IPW method of moments estimator of the variance parameter

*ζ*. Although one could construct an AIPW method of moments estimator of

*ζ*, this is unnecessary because improving the efficiency in estimation of the parameters

*ζ* does not help improve the efficiency in estimation of the nonparametric function

*θ*(

*z*). This is in accordance to estimation of parametric regression models for

*E*(

*Y*|

*Z*), where it is well known that the efficiency of two-stage weighted least squares is unaffected by the choice of

-consistent estimator of var(

*Y*|

*Z*) at the first stage. In fact, Theorem 3 in Section 4 asserts that the efficiency with which

*θ*(

*z*) is estimated is unaltered even if the working model for var(

*Y*|

*Z*) is incorrectly specified. This is in contrast to parametric regression models where incorrect modeling of var(

*Y*|

*Z*) results in inefficient estimators of the regression parameters. The reason is that nonparametric regression is local and variability in a diminishing neighbor of

*z* is constant asymptotically.