The apparent contradiction arises because of the vagueness of the statement about the efficiency gains induced by including

*L* in the propensity score estimators, which does not explicitly mention the assumptions required for its validity. To explain the contradiction, let

denote the model defined by Assumptions 1–3, let

denote the model defined by Assumptions 1–4 and let

denote Assumptions 1–3 and 5.

Both

and

are consistent for

*E*(

*Y*_{1}) under model

or

but only

is consistent for

*E*(

*Y*_{1}) under model

.

The estimator

is asymptotically efficient under model

and under model

but

is asymptotically efficient under model

. These efficiency results are best understood by examining the likelihood

where

Model

imposes restrictions on the law of (

*Y*_{1}*, L, A*) but not on the distribution

*f*_{A,Y,L} of the observed data (

*Y, L, A*) (

Gill et al., 1997) and hence is a nonparametric model for the observables. Because the estimator

is the plug-in estimator of

*μ* =

*E*{

*E*(

*Y* |

*A* = 1

*, L*)}

*,* it is the maximum likelihood estimator of

*μ* under the nonparametric model

.

Model

restricts the law

*f*_{A}_{|}_{L} entering the second term on the right-hand side of

(5) since Assumption 5 postulates that

*f*_{A}_{|}_{L} =

*f*_{A}. Because by

(2),

*μ* depends only on the components of the law entering in the

_{1}_{,n}-part of the likelihood

(5), the maximum likelihood estimators of

*μ* under models

and

must agree. Thus,

is the maximum likelihood estimator of

*μ* under model

and consequently asymptotically efficient, i.e. avar(

) is equal to the semiparametric variance bound for

*μ* under the model. We let avar(·) denote the variance of the limiting distribution, hereafter.

Model

imposes the restriction

*f*_{Y}
_{|}
_{A}_{=1}_{,L} =

*f*_{Y}
_{|}
_{A}_{=1} and hence it restricts the law

*f*_{Y}
_{|}
_{A,L} in

_{1}_{,n}. The estimator

is not the maximum likelihood estimator under model

because it does not exploit this restriction. In fact, under model

,

is asymptotically efficient. Furthermore,

is asymptotically strictly more efficient than

unless Assumption 5 also holds. Proof of these results can be found in the online

Supplementary Material. We are now ready to explain the contradiction.

Given an arbitrary function

*d*(

*l*) and any

*π* (

*l*)

*,* let

_{d} (

*π*) denote the solution to

The following Lemma, a corollary of the theory laid out in

Robins et al. (1994), states the precise result of the theory of inverse probability weighted estimation that the gain in efficiency of

over

appears to contradict.

L

emma 1.

*Given one of the models , or for the observables, let * (

*l*)

*and π̃*(

*l*)

*be the maximum likelihood estimators of f*_{A}_{|}_{L} (1 |

*l*)

*under two nested models for f*_{A}_{|}_{L}
*that are correctly specified under the assumptions of the given model. Then √ n*{

_{d} (

) −

*μ*}

*and* √

*n*{

_{d} (

*π̃*) −

*μ*}

*converge to mean zero normal distributions. If * (

*l*)

*is the estimator of f*_{A}_{|}_{L} (1 |

*l*)

*under the larger model, then*
Observe that because

solves

(3) and

solves

(4) we can write

=

_{d1}(

) and

=

_{d1}(

*π̃*) with

*d*_{1}(

*l*) = 1

*, * (

*l*) =

*E*_{n}(

*A* |

*L* =

*l*) and

*π̃*(

*l*) =

*E*_{n}(

*A*). The improved efficiency of

over

*,* i.e. the fact that generally avar(

) is strictly smaller than avar(

)

*,* under model

does not contradict Lemma 1 because

*π̃*(

*l*) does not meet its premise. Specifically, Lemma 1 makes the premise that

*π̃*(

*l*) is computed under a model for

*f*_{A}_{|}_{L} that is correctly specified under the given model, in the case of our concern, model

. However,

*π̃*(

*l*) =

*E*_{n}(

*A*) is the fitted value under a model for

*f*_{A}_{|}_{L} that assumes that

*A* and

*L* are independent, an assumption not made by model

.

The efficiency gains conferred by

over

under model

can be deduced from the general theory of efficient inverse probability estimation in semiparametric models for missing data (

Robins et al., proposition 8.1, 1994). In the

Supplementary Material we apply this theory to show that: (a)

is asymptotically equivalent to

_{d2}(

) with

*d*_{2}(

*l*) =

*E*(

*A* |

*L* =

*l*) and (b)

_{d2}(

), and therefore

, is semiparametric efficient under

.

In conclusion, the fallacy arises because the claim about efficiency gains assumes an explicit model for the law of (

*A, L, Y*) and it requires that both propensity score models be correct under the given model. However,

*E*_{n}(

*A*) is the efficient propensity score estimator under a model not implied by model

, so the efficiency claim does not apply.