The apparent contradiction arises because of the vagueness of the statement about the efficiency gains induced by including
L in the propensity score estimators, which does not explicitly mention the assumptions required for its validity. To explain the contradiction, let
![[mathematical script A]](/corehtml/pmc/pmcents/x1D49C.gif)
denote the model defined by Assumptions 1–3, let

denote the model defined by Assumptions 1–4 and let

denote Assumptions 1–3 and 5.
Both
![[mu]](/corehtml/pmc/pmcents/mgrcirc.gif)
and
![[mu]](/corehtml/pmc/pmcents/mgrtilde.gif)
are consistent for
E(
Y1) under model

or

but only
![[mu]](/corehtml/pmc/pmcents/mgrcirc.gif)
is consistent for
E(
Y1) under model
![[mathematical script A]](/corehtml/pmc/pmcents/x1D49C.gif)
.
The estimator
![[mu]](/corehtml/pmc/pmcents/mgrcirc.gif)
is asymptotically efficient under model
![[mathematical script A]](/corehtml/pmc/pmcents/x1D49C.gif)
and under model

but
![[mu]](/corehtml/pmc/pmcents/mgrtilde.gif)
is asymptotically efficient under model

. These efficiency results are best understood by examining the likelihood
where
Model
![[mathematical script A]](/corehtml/pmc/pmcents/x1D49C.gif)
imposes restrictions on the law of (
Y1, L, A) but not on the distribution
fA,Y,L of the observed data (
Y, L, A) (
Gill et al., 1997) and hence is a nonparametric model for the observables. Because the estimator
![[mu]](/corehtml/pmc/pmcents/mgrcirc.gif)
is the plug-in estimator of
μ =
E{
E(
Y |
A = 1
, L)}
, it is the maximum likelihood estimator of
μ under the nonparametric model
![[mathematical script A]](/corehtml/pmc/pmcents/x1D49C.gif)
.
Model

restricts the law
fA|L entering the second term on the right-hand side of
(5) since Assumption 5 postulates that
fA|L =
fA. Because by
(2),
μ depends only on the components of the law entering in the
1,n-part of the likelihood
(5), the maximum likelihood estimators of
μ under models
![[mathematical script A]](/corehtml/pmc/pmcents/x1D49C.gif)
and

must agree. Thus,
![[mu]](/corehtml/pmc/pmcents/mgrcirc.gif)
is the maximum likelihood estimator of
μ under model

and consequently asymptotically efficient, i.e. avar(
![[mu]](/corehtml/pmc/pmcents/mgrcirc.gif)
) is equal to the semiparametric variance bound for
μ under the model. We let avar(·) denote the variance of the limiting distribution, hereafter.
Model

imposes the restriction
fY
|
A=1,L =
fY
|
A=1 and hence it restricts the law
fY
|
A,L in
1,n. The estimator
![[mu]](/corehtml/pmc/pmcents/mgrcirc.gif)
is not the maximum likelihood estimator under model

because it does not exploit this restriction. In fact, under model

,
![[mu]](/corehtml/pmc/pmcents/mgrtilde.gif)
is asymptotically efficient. Furthermore,
![[mu]](/corehtml/pmc/pmcents/mgrtilde.gif)
is asymptotically strictly more efficient than
![[mu]](/corehtml/pmc/pmcents/mgrcirc.gif)
unless Assumption 5 also holds. Proof of these results can be found in the online
Supplementary Material. We are now ready to explain the contradiction.
Given an arbitrary function
d(
l) and any
π (
l)
, let
d (
π) denote the solution to
The following Lemma, a corollary of the theory laid out in
Robins et al. (1994), states the precise result of the theory of inverse probability weighted estimation that the gain in efficiency of
![[mu]](/corehtml/pmc/pmcents/mgrtilde.gif)
over
![[mu]](/corehtml/pmc/pmcents/mgrcirc.gif)
appears to contradict.
L
emma 1.
Given one of the models
,
or
for the observables, let ![[pi]](/corehtml/pmc/pmcents/picirc.gif)
(
l)
and π̃(
l)
be the maximum likelihood estimators of fA|L (1 |
l)
under two nested models for fA|L
that are correctly specified under the assumptions of the given model. Then √ n{
d (
![[pi]](/corehtml/pmc/pmcents/picirc.gif)
) −
μ}
and √
n{
d (
π̃) −
μ}
converge to mean zero normal distributions. If ![[pi]](/corehtml/pmc/pmcents/picirc.gif)
(
l)
is the estimator of fA|L (1 |
l)
under the larger model, then
Observe that because
![[mu]](/corehtml/pmc/pmcents/mgrcirc.gif)
solves
(3) and
![[mu]](/corehtml/pmc/pmcents/mgrtilde.gif)
solves
(4) we can write
![[mu]](/corehtml/pmc/pmcents/mgrcirc.gif)
=
d1(
![[pi]](/corehtml/pmc/pmcents/picirc.gif)
) and
![[mu]](/corehtml/pmc/pmcents/mgrtilde.gif)
=
d1(
π̃) with
d1(
l) = 1
, ![[pi]](/corehtml/pmc/pmcents/picirc.gif)
(
l) =
En(
A |
L =
l) and
π̃(
l) =
En(
A). The improved efficiency of
![[mu]](/corehtml/pmc/pmcents/mgrtilde.gif)
over
, i.e. the fact that generally avar(
![[mu]](/corehtml/pmc/pmcents/mgrtilde.gif)
) is strictly smaller than avar(
![[mu]](/corehtml/pmc/pmcents/mgrcirc.gif)
)
, under model

does not contradict Lemma 1 because
π̃(
l) does not meet its premise. Specifically, Lemma 1 makes the premise that
π̃(
l) is computed under a model for
fA|L that is correctly specified under the given model, in the case of our concern, model

. However,
π̃(
l) =
En(
A) is the fitted value under a model for
fA|L that assumes that
A and
L are independent, an assumption not made by model

.
The efficiency gains conferred by
![[mu]](/corehtml/pmc/pmcents/mgrtilde.gif)
over
![[mu]](/corehtml/pmc/pmcents/mgrcirc.gif)
under model

can be deduced from the general theory of efficient inverse probability estimation in semiparametric models for missing data (
Robins et al., proposition 8.1, 1994). In the
Supplementary Material we apply this theory to show that: (a)
![[mu]](/corehtml/pmc/pmcents/mgrtilde.gif)
is asymptotically equivalent to
d2(
![[pi]](/corehtml/pmc/pmcents/picirc.gif)
) with
d2(
l) =
E(
A |
L =
l) and (b)
d2(
![[pi]](/corehtml/pmc/pmcents/picirc.gif)
), and therefore
![[mu]](/corehtml/pmc/pmcents/mgrtilde.gif)
, is semiparametric efficient under

.
In conclusion, the fallacy arises because the claim about efficiency gains assumes an explicit model for the law of (
A, L, Y) and it requires that both propensity score models be correct under the given model. However,
En(
A) is the efficient propensity score estimator under a model not implied by model

, so the efficiency claim does not apply.