Search tips
Search criteria 


Logo of biometLink to Publisher's site
Biometrika. 2010 December; 97(4): 997–1001.
Published online 2010 July 31. doi:  10.1093/biomet/asq049
PMCID: PMC3371719

A note on overadjustment in inverse probability weighted estimation

Andrea Rotnitzky
Di Tella University, Sáenz Valiente 1010, Buenos Aires, Argentina, ; ude.dravrah.hpsh@aerdna
Lingling Li
Department of Population Medicine, Harvard Medical School, Harvard Pilgrim Health Care Institute, Boston, Massachusetts 02115, U.S.A., ude.dravrah.tsop@il_gnilgnil


Standardized means, commonly used in observational studies in epidemiology to adjust for potential confounders, are equal to inverse probability weighted means with inverse weights equal to the empirical propensity scores. More refined standardization corresponds with empirical propensity scores computed under more flexible models. Unnecessary standardization induces efficiency loss. However, according to the theory of inverse probability weighted estimation, propensity scores estimated under more flexible models induce improvement in the precision of inverse probability weighted means. This apparent contradiction is clarified by explicitly stating the assumptions under which the improvement in precision is attained.

Some key words: Causal inference, Propensity score, Standardized mean

1. Introduction

Often, epidemiological studies aim to evaluate the causal effect of a discrete exposure on an outcome. In observational studies systematic bias due to confounding is a serious concern. For this reason, investigators routinely collect and adjust for a large number of confounding factors in data analyses. A common analytic strategy is to categorize the confounders and then to compare the exposure group-specific standardized means. These are exposure group-specific weighted means of the outcome across levels of the categorized confounders with weights equal to the empirical probabilities of the categorized confounders in the entire sample. It is well known that overcategorization, i.e. unnecessary categorization, may induce efficiency losses. This issue is essentially the same as the well-understood increase in variance induced by adding in a linear regression model covariates that have no partial correlation with the outcome (Cochran, 1968). It has been studied in a number of nonlinear regression settings, e.g. Mantel & Haenszel (1959), Breslow (1982), Gail (1988), Robinson & Jewell (1991), Neuhauhaser & Becher (1997) and De Stavola & Cox (2008), and has been empirically analyzed for standardized means in Brookhart et al. (2006).

The issue, however, appears to contradict well-known facts in the theory of inverse probability weighted estimation. Specifically, a standardized mean is equal to a so-called inverse probability of treatment weighted mean. More precisely, it is equal to a group-specific mean of the outcome weighted by the inverse of the empirical propensity score. An empirical propensity score is the maximum likelihood estimate of the true propensity score, i.e. of the probability of being in the exposure group given the confounders, under a saturated model for the probability of exposure given the categorized confounder. The apparent contradiction is that more refined categorization corresponds to more flexible models for the propensity score, and according to the theory of inverse probability estimation, the use of more flexible propensity score models induces an improvement in the precision of inverse probability means, and not a decrease in precision as regression theory indicates.

The purpose of this note is to clarify this apparent contradiction showing that indeed, efficiency losses induced by unnecessarily refined categorizations do not contradict, and indeed are a consequence of, the theory of inverse probability estimation.

2. The apparent contradiction

Consider a cohort study in which a discrete exposure variable A, an outcome Y and a vector of pre-exposure covariates X are measured for each of n subjects drawn at random from a study population. Although the typical goal of such a study is the evaluation of the exposure effect on the outcome, i.e. a comparison across exposure levels, the issues in this note are best understood by considering inference about the outcome mean at one specific exposure level. Thus, we will assume that A is binary and that the goal is to estimate the outcome mean at exposure level A = 1. Consider a categorization of X into J strata and let L denote the polytomous variable that records the stratum, a subject with covariates X belongs to. The standardized mean at exposure level A = 1 and with categorized variable L is

μ^[equivalent]En{En(Y|A=1,  L)},

where throughout for any U and V,

En(U)[equivalent]n1i=1nUi,   En(U|A=1,  V)[equivalent](i:Ai=1,Vi=VUi)/(i:Ai=1, Vi=Vn1).

For standardized means to be informative about the causal effects certain assumptions need to hold. The issue is best articulated within the potential outcomes framework. Let Ya be the subject’s potential outcome if, perhaps contrary to fact, he is exposed to A = a. Contrasts comparing E(Y1) and E (Y0) quantify the causal effect of exposure. The standardized mean [mu] is consistent for μ [equivalent] E(Y1) under the following assumptions.

  • Assumption 1. Consistency: Y = YA.
  • Assumption 2. Positivity: pr(A = 1 | L) > 0.
  • Assumption 3. No unmeasured confounders: Y1 and A are conditionally independent given L, because in such a case
    μ=E{E (Y|A=1,  L)}.

The apparent contradiction discussed in this note refers to the asymptotic behaviour of [mu] under two categorizations, one more refined than the other. The essence of the matter is best understood by considering the extreme case contrasting the asymptotic behaviour of the adjusted average [mu] with that of the crude unadjusted average,


Our discussion focusses on this comparison. The well-known risk of bias induced by underadjustment, i.e. by failure to adjust for an important confounder, is vividly unmasked in this extreme case: [mu] does not generally converge in probability to E(Y1). Formally, [mu] converges to E(Y1 | A = 1) which is not generally equal to E(Y1) because Y1 and A may share the common determinant L. Consistency of [mu] requires that, in addition to Assumptions 1–3, at least one of the following two independencies hold.

  • Assumption 4. The variables Y and L are conditionally independent given A = 1.
  • Assumption 5. The variables A and L are independent.

In the Appendix we show that [mu] solves the inverse probability weighted estimating equation


whereas [mu] solves the inverse probability weighted estimating equation


whence the apparent contradiction emerges. Specifically, both En(A | L) and En(a) can be regarded as efficient estimators of the propensity score π (l) [equivalent] E(A | L), the former under a saturated model on L and the latter under the smaller model that assumes independence of A and L. According to the theory of inverse probability estimation, inclusion of covariate L in an efficiently estimated model for the propensity score should not be detrimental to the efficiency with which E(Y1) is estimated even if the covariate is not needed for bias correction. This appears to contradict the fact that under Assumptions 1–3, [mu] is more efficient than [mu] when Assumption 4 holds and Assumption 5 fails.

3. Explaining the apparent contradiction

The apparent contradiction arises because of the vagueness of the statement about the efficiency gains induced by including L in the propensity score estimators, which does not explicitly mention the assumptions required for its validity. To explain the contradiction, let [mathematical script A] denote the model defined by Assumptions 1–3, let B denote the model defined by Assumptions 1–4 and let C denote Assumptions 1–3 and 5.

Both [mu] and [mu] are consistent for E(Y1) under model B or C but only [mu] is consistent for E(Y1) under model [mathematical script A].

The estimator [mu] is asymptotically efficient under model [mathematical script A] and under model C but [mu] is asymptotically efficient under model B. These efficiency results are best understood by examining the likelihood

Ln(fA, Y, L)=L1, n(fL, fY|A,L)L2,n(fA | L),



Model [mathematical script A] imposes restrictions on the law of (Y1, L, A) but not on the distribution fA,Y,L of the observed data (Y, L, A) (Gill et al., 1997) and hence is a nonparametric model for the observables. Because the estimator [mu] is the plug-in estimator of μ = E{E(Y | A = 1, L)}, it is the maximum likelihood estimator of μ under the nonparametric model [mathematical script A].

Model C restricts the law fA|L entering the second term on the right-hand side of (5) since Assumption 5 postulates that fA|L = fA. Because by (2), μ depends only on the components of the law entering in the L1,n-part of the likelihood (5), the maximum likelihood estimators of μ under models [mathematical script A] and C must agree. Thus, [mu] is the maximum likelihood estimator of μ under model C and consequently asymptotically efficient, i.e. avar([mu]) is equal to the semiparametric variance bound for μ under the model. We let avar(·) denote the variance of the limiting distribution, hereafter.

Model B imposes the restriction fY | A=1,L = fY | A=1 and hence it restricts the law fY | A,L in L1,n. The estimator [mu] is not the maximum likelihood estimator under model B because it does not exploit this restriction. In fact, under model B, [mu] is asymptotically efficient. Furthermore, [mu] is asymptotically strictly more efficient than [mu] unless Assumption 5 also holds. Proof of these results can be found in the online Supplementary Material. We are now ready to explain the contradiction.

Given an arbitrary function d(l) and any π (l), let [mu]d (π) denote the solution to


The following Lemma, a corollary of the theory laid out in Robins et al. (1994), states the precise result of the theory of inverse probability weighted estimation that the gain in efficiency of [mu] over [mu] appears to contradict.

Lemma 1. Given one of the models [mathematical script A], B or C for the observables, let [pi] (l) and π̃(l) be the maximum likelihood estimators of fA|L (1 | l) under two nested models for fA|L that are correctly specified under the assumptions of the given model. Then √ n{[mu]d ([pi]) − μ} andn{[mu]d (π̃) − μ} converge to mean zero normal distributions. If [pi] (l) is the estimator of fA|L (1 | l) under the larger model, then

avar{μ^d(π^)}[less-than-or-eq, slant]avar{μ^d(π˜)}.

Observe that because [mu] solves (3) and [mu] solves (4) we can write [mu] = [mu]d1([pi]) and [mu] = [mu]d1(π̃) with d1(l) = 1, [pi] (l) = En(A | L = l) and π̃(l) = En(A). The improved efficiency of [mu] over [mu], i.e. the fact that generally avar([mu]) is strictly smaller than avar([mu]), under model B does not contradict Lemma 1 because π̃(l) does not meet its premise. Specifically, Lemma 1 makes the premise that π̃(l) is computed under a model for fA|L that is correctly specified under the given model, in the case of our concern, model B. However, π̃(l) = En(A) is the fitted value under a model for fA|L that assumes that A and L are independent, an assumption not made by model B.

The efficiency gains conferred by [mu] over [mu] under model B can be deduced from the general theory of efficient inverse probability estimation in semiparametric models for missing data (Robins et al., proposition 8.1, 1994). In the Supplementary Material we apply this theory to show that: (a) [mu] is asymptotically equivalent to [mu]d2([pi]) with d2(l) = E(A | L = l) and (b) [mu] d2([pi]), and therefore [mu], is semiparametric efficient under B.

In conclusion, the fallacy arises because the claim about efficiency gains assumes an explicit model for the law of (A, L, Y) and it requires that both propensity score models be correct under the given model. However, En(A) is the efficient propensity score estimator under a model not implied by model B, so the efficiency claim does not apply.

4. Concluding remarks

Our analysis extends to inference in marginal structural mean models for the effect of a, possibly polytomous, exposure A given, a possibly strict, subset Z of the confounders L. These models assume that E(Ya | Z) = m(a, Z; β), where m(·) is known and β unknown. Estimators of β are obtained by solving (6) with A/π (l) replaced by an estimator of 1/ fA|L (A | L), μ replaced by m(A, Z; β) and with d(l) of the dimension of β. When Assumption 5 holds, using 1/ fA(A) where fA(A) = En{Ia(A)} and Ia(A) is the indicator that A = a yields consistent and asymptotically normal estimators of β that are generally more efficient than those obtained using 1/ fA|L (A | L) where fA|L (A | l) = En{Ia(A) |L = l}. Once again, this raises an apparent contradiction with inverse probability weighted estimation which can be explained as in § 3.


Andrea Rotnitzky was funded by a grant from the National Institutes of Health, U.S.A. The authors wish to thank two referees and the associate editors for helpful comments.


For any given law f (l, a, y), define the new law f *(l, a, y) = f (l)I1(a) f (y | a, l). Then E{E(Y | A = 1, L)} = E*(Y) where E(·) and E*(·) denote expectations under f and f * respectively. But, f *(l, a, y)/ f (l, a, y) = I1(a)/ f (a | l), so E*(Y) = E{I1(a)Y / f (A | L)} thus proving that E{E(Y | A = 1, L)} = E{AY/ f (1 | L)} for any f and A binary. That [mu] solving (3) also admits the representation (1) follows by applying this result when f is the empirical law.

Supplementary material

Supplementary Material is available at Biometrika online.


  • Brookhart MA, Schneeweiss AL, Rothman KJ, Glynn RJ, Avorn J, Sturmer T. Variable selection for propensity score models. Am J Epidemiol. 2006;163:1149–56. [PMC free article] [PubMed]
  • Breslow N. Design and analysis of case control studies. Annual Rev. Public Health. 1982;3:29–54. [PubMed]
  • Cochran WC. The effectiveness of adjusting by subclassification in removing bias in observational studies. Biometrics. 1968;24:295–313. [PubMed]
  • De Stavola HL, Cox DR. On the consequences of overstratification. Biometrika. 2008;95:992–6.
  • Gail M. The effect of pooling across strata in perfectly balanced studies. Biometrics. 1988;44:151–62.
  • Gill R, van der Laan M, Robins JM. Coarsening at random: characterizations, conjectures and counterexamples. In: Lin D, Fleming T, editors. Proc 1st Seattle Symp Biostatist. New York: Springer; 1997. pp. 255–94.
  • Mantel N, Haenszel W. Statistical aspects of the analysis of data from retrospective studies of disease. J Nat Cancer Inst. 1959;22:719–48. [PubMed]
  • Neuhausaer M, Becher H. Improved odds ratio estimation by post-hoc stratification of case-control data. Statist Med. 1997;16:993–1004. [PubMed]
  • Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. J Am Statist Assoc. 1994;89:846–66.
  • Robinson L, Jewell NP. Some surprising results about covariate adjustment in logistic regression models. Int Statist Rev. 1991;59:227–40.

Articles from Biometrika are provided here courtesy of Oxford University Press