Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Stat Med. Author manuscript; available in PMC 2012 February 20.
Published in final edited form as:
PMCID: PMC3059519

Double-Robust Estimation of an Exposure-Outcome Odds Ratio Adjusting for Confounding in Cohort and Case-control Studies


Modern epidemiologic studies often aim to evaluate the causal effect of a point exposure on the risk of a disease from cohort or case-control observational data. Because confounding bias is of serious concern in such non-experimental studies, investigators routinely adjust for a large number of potential confounders in a logistic regression analysis of the effect of exposure on disease outcome. Unfortunately, when confounders are not correctly modeled, standard logistic regression is likely biased in its estimate of the effect of exposure, potentially leading to erroneous conclusions. We partially resolve this serious limitation of standard logistic regression analysis with a new iterative approach that we call ProRetroSpective estimation, which carefully combines standard logistic regression with a logistic regression analysis in which exposure is the dependent variable and the outcome and confounders are the independent variables. As a result, we obtain a correct estimate of the exposure-outcome odds ratio, if either the standard logistic regression of the outcome given exposure and confounding factors is correct, or the regression model of exposure given the outcome and confounding factors is correct but not necessarily both, that is, it is double-robust. In fact, it also has certain advantadgeous efficiency properties. The approach is general in that it applies to both cohort and case-control studies whether the design of the study is matched or unmatched on a subset of covariates. Finally, an application illustrates the methods using data from the National Cancer Institute's Black/White Cancer Survival Study.

1 Introduction

A common aim of epidemiologic studies is to evaluate the causal effect of a dichotomous point exposure A on the risk of a disease outcome Y using observational data. In non-experimental studies, systematic bias due to confounding is a serious concern. For this reason, investigators routinely collect and adjust for a large number of confounding factors in data analyses. The current paper concerns a setting in which Y is dichotomous, and the observed data is either from a cohort study or from a case-control study, either one possibly matched on covariates L1, …, Ls, and in which additional confounders Ls+1, …, Lp are possibly observed.

Logistic regression has a long tradition in epidemiologic research as the standard analytic tool for modeling the effect of A on Y within levels of L = (L1, …, Lp), whether data is collected in a cohort study or in a case control study [1, 2]. In an unmatched cohort study, where the variables (Y, A, L) are recorded on each of n subjects drawn at random from a study population of interest, the analyst often fits a simple logistic regression model

logit{Pr(Y=1[mid ]A,L;ψ,η)}=η1+η2TL+ψA

for the conditional probability Pr(Y = 1[mid ]A, L) of the occurrence of the outcome given exposure level A and confounders L in the study population, where logit(x) = log{x/(l − x)} is the logit link function, and η = (η1, η2). Model (1) assumes homogeneity of the log-odds ratio function

or(L)=log{Pr(Y=1[mid ]A=1,L)×Pr(Y=0[mid ]A=0,L)Pr(Y=1[mid ]A=0,L)×Pr(Y=0[mid ]A=1,L)}

i.e. (1) assumes that for any individual with covariates L, the log-odds ratio or (L) is equal to a constant ψ0, the same for all values of L, so that ψ0 is the true value of ψ in model (1). In this latter model, the primary scientific interest is usually in ψ0 since this quantifies, on the log-odds ratio scale, the dependence of the outcome Y on the exposure A within levels of the confounders L in the study population.

Two assumptions are typically required for the log-odds ratio or (L) to have a causal interpretation. Such assumptions are best articulated within a potential outcomes framework. Specifically, let Y1 be the subject's potential outcome if, perhaps contrary to fact, he is exposed and Y0 be the subject's potential outcome if he is unexposed. The first assumption, known as the consistency assumption, states that a person's observed outcome Y is equal to his potential outcome Ya for the level of exposure a = A he actually experienced. The second is the assumption of no unmeasured confounding which states that within levels of the confounders, the observed exposure A is randomly drawn, and thus, is necessarily independent of the potential outcomes Y0 and Y1 given the confounders. Under these two assumptions [3] or (L) has a causal interpretation because it is equal to

or*(L)=log{Pr(Y1=1[mid ]L)×Pr(Y0=0[mid ]L)Pr(Y1=0[mid ]L)×Pr(Y0=1[mid ]L)}

The quantity inside the logarithm, known as the causal odds ratio, quantifies on the odds scale, the causal effect of exposure vs no exposure in subjects with value of the confounder equal to L. Specifically, it is equal to the ratio of the odds of a positive value of the outcome for subjects with value of the confounder equal to L in two hypothetical worlds, the first in which they are all exposed, and the second in which they are all unexposed.

In the event that unmeasured confounding is present, or (L) is not equal to or* (L), and generally does not have a causal interpretation. However, it remains a well defined summary measure of the partial association between exposure and outcome after adjustment for the measured covariates L and it often constitutes the primary target of inference in epidemiological analyses. Our discussion in this paper concerns inference about the parameter or (L) regardless of whether or not it has a meaningful causal interpretation.

Often the analysis is based on more flexible logistic regression models than the linear logistic regression model (1), that relax the assumption of homogeneity of the odds-ratio across levels of the confounder. In such models the constant ψ in equation (1) is replaced by some non-constant function or (L; ψ), e. g. the model or (L; ψ) = ψ1 + ψ2L1, where ψ = (ψ1, ψ2) and L1 is the first component of L, allows for effect modification by L1 of the effect of A on Y on the log-odds ratio scale, thus yielding the equation

logit{Pr(Y=1[mid ]A,L;ψ,η)}=η1+η2TL+or(L;ψ)A=η1+η2TL+ψ1A+ψ2L1A

In models (1) and (3), the regression function η1+η2TL is of interest only to the extent that it allows to estimate the true value ψ0 of the parameter ψ in the model for the target odds ratio function or (L) by standard logistic regression as implemented in virtually all existing statistical software packages. This regression function constitutes a model for the baseline log-odds function

oddsY(L)=log{Pr(Y=1[mid ]A=0,L)Pr(Y=0[mid ]A=0,L)}

in that its presence in equations (1) and (3) implies the assumption that oddsY (L) is equal to η1+η2TL for some unknown value (η1, η2). Such model rules out interactions between the components of L. However, just like for or (L), more flexible models for oddsY (L) can be contemplated by replacing in equations (1) and (3) the function η1+η2TL with a more elaborate function oddsY (L; η), for example one that depends on cross-products and powers of the components of the vector L.

To allow for generality in our exposition, henceforth, we focus our discussion on models that admit the general form

logit{Pr(Y=1[mid ]A,L;ψ,η)}=oddsY(L;η)+or  (L;ψ)A

Note that the simple model (4) is recovered by setting or (L; ψ) = ψ and oddsY η1+η2TL.

When the target is estimation of ψ0, an analysis based on model (1) or, more generally, model (4), has an important limitation. Specifically, if the assumption that oddsY (L) is linear in L is false, the logistic regression model (1) is mis-specified and the resulting maximum likelihood estimator of ψ0 is likely biased. Likewise, the maximum likelihood estimator of ψ0 is generally biased if the model oddsY (L; η) for oddsY (L) is mis-specified. This lack of robustness was recently investigated and confirmed in simulation studies [4] for the logistic linear model (1).

Recently, Tchetgen Tchetgen et al [5] showed that for estimating the parameters ψ of a model or (L; ψ) for the log-odds ratio function or (L) with data from an unmatched prospective cohort design there exist options, other than maximum likelihood under model (4), that confer partial protection against mis-specification of the model oddsY (L; η). Under their methodology, the analyst not only places a model oddsY (L; η) for the baseline log-odds function oddsY (L) for the outcome but also a model oddsA (L; α) for the population baseline log-odds function for exposure

oddsA(L)=logPr(A=1[mid ]Y=0,L)Pr(A=0[mid ]Y=0,L)

In this formula, Pr(A = 1[mid ]Y, L) denotes the proportion of subjects in the study population with exposure among those with outcome Y and confounders L. Thus, under the Tchetgen Tchetgen et al [5] methodology the analyst needs to model also how the proportion of exposed subjects in the study population changes with the values of the confounders L among subjects that did not experience the outcome Y. However, under this methodology, the analyst need not be right about both models. To obtain an unbiased (more precisely, consistent) estimator of ψ0, he just needs to be right about one of the models oddsY (L; η) or oddsA (L; α), but need not know which of the two is right. Because one rarely knows which if any of the two models holds, these methods are preferred over maximum likelihood fitting the logistic regression model (4), as they provide the analyst with two separate chances to obtain a consistent estimator of ψ0.

Stated more precisely, the methods of Tchetgen Tchetgen et al [5] yield a specific class of estimators of ψ0 that are consistent if one, but not necessarily both of the following is true:

  1. the model oddsA (L; α) is correctly specified even if the model oddsY (L; η) is incorrectly specified,
  2. the model oddsY (L; η) is correctly specified even if the model oddsA (L; α) is incorrectly specified

Estimators that, like those proposed by Tchetgen Tchetgen et al [5], are consistent provided one correctly specifies one of two nuisance models (in our problem, the nuisance models are oddsY (L; η) or oddsA (L; α)), without knowing which of the two is right, are called double-robust in the union model defined by the union of both nuisance models. The nuisance models are usually referred to as “working models”.

Tchetgen Tchetgen et. al [5] further characterized an optimal estimator in their specific class of double-robust estimators, which when both nuisance models oddsA (L; α) and oddsY (L; η) happen to be right, is highly efficient in that no other double-robust estimator of ψ0 in the same union model, can have smaller large sample variance. Unfortunately, the computation of their estimators, and most notably that of the efficient one, requires solving a non-standard estimating equation, a task which cannot be easily performed without special software, thus seriously impeding its routine use.

The goal of this paper is three-fold. The first is to resolve the aforementioned computational obstacles of the methods of Tchetgen Tchetgen et. al. [5] by giving a novel and simple implementation of their optimal double-robust estimator via an iterative procedure which solely involves the use of standard logistic regression software. We provide a new SAS [6] macro called to carry out this algorithm.

The second goal relates to the analysis of cohort studies matched on baseline covariates and of case-control studies, matched or unmatched on baseline covariates. Under any of these designs or (L) remains an estimable target, i.e. it is identified from the sampling distribution of the data. Our second goal is to show that under any of these designs there exist estimators of the parameters ψ of models or (L; ψ) for or (L) that are double-robust in union models that involve two specific nuisance components of the distribution of the sampled data. Furthermore, we show that these estimators can be implemented with the macro

The third goal of the paper is to illustrate the methods with a data example. For this purpose, we use the ProRetroSpective algorithm to obtain double-robust estimates of racial differences in tumor grade among women with endometrial cancer in a re-analysis of data from the National Cancer Institute's Black/White Cancer Survival Study.

In section 2 we describe the algorithm that implements the double-robust estimator of ψ0 under unmatched cohort designs. In section 3 we show that the double-robust methodology extends to matched cohort studies and to case-control studies, matched or unmatched on covariates. In section 4, we give a data example. We finish in section 5 with some concluding remarks.

2 The ProRetroSpective algorithm

The algorithm to implement the double-robust estimator of Tchetgen Tchetgen et. al. [5], which for ease of reference we call ProRetroSpective, consists of three stages. Stage 1 involves the fitting of a logistic regression (4) where or (L; ψ) is the model for the log-odds ratio of substantive interest and oddsY (L; η) is the working model for the baseline log-odds for the outcome oddsY (L). Stage 2 involves fitting another logistic regression, this being a model for the probability of exposure given the outcome and confounders,

logit{Pr(A=1[mid ]Y,L;ψ,α)}=oddsA(L;α)+or(L;ψ)Y

in which oddsA (L; α) is the working model for the baseline log-odds for exposure oddsA (L); for example, in the case where or (L; ψ) = ψ and oddsA(L;α)=α1+α2L, model (5) reduces to the model

logit{Pr(A=1[mid ]Y,L;ψ,α)}=α1+α2L+ψY.

where α=(α1,α2). Stage 3 involves a recursion, each step of the recursion involves the fit of a new logistic regression with outcome Y, a vector of specially constructed covariates and an offset. In this model both the covariate vector and the offset incorporate the estimates obtained in Stages 1 and 2 and at the previous step of the recursion, and are specially constructed to yield upon convergence of the recursion, the double robust estimator of Tchetgen Tchetgen et al [5] throughout denoted by [psi]dr. We have implemented this procedure for the simple case where or (L; ψ) = ψ, in the SAS macro given in the appendix.

Interestingly, stage 2 of the algorithm involves the fit of model (5) for the probability of exposure given the outcome and covariates. When oddsA (L; α) is a correct model for the baseline odds function for exposure oddsA (L), model (5) is a correctly specified model for this conditional probability. To see this note that, by the well known invariance property of the odds ratio function, or (L) defined in equation (2) can also be written as

or(L)=log{Pr(A=1[mid ]Y=1,L)×Pr(A=0[mid ]Y=0,L)Pr(A=1[mid ]Y=0,L)×Pr(A=0[mid ]Y=1,L)},

Then, we can simply revert the roles of A and Y and argue as in the preceding section to conclude that model (5) is correctly specified if oddsA (L; α) and or (L; ψ) are correctly specified. Notably, a consequence of this fact is that the maximum likelihood estimate of ψ0 in the retrospective model (5) is, under standard regularity conditions, consistent and approximately normal, provided that the working model oddsA (L; α) is correct. However, just as was the case for the fit of model (4), bias in the estimation of ψ0 is bound to result from a mis-specified working model for oddsA (L; α).

Remarkably, the logistic regressions (4) and (5) offer two genuine opportunities to estimate ψ0 correctly because oddsA (L; α) and oddsY (L; η) can always be specified coherently, i.e. the specification of a working model oddsA (L) does not rule out the validity of any working model for oddsY (L); and vice-versa. The reason for the coherence of any models for oddsA (L) and oddsY (L) is that the baseline probability functions Pr(Y = 1[mid ]A = 0, L) = (1 + exp[−oddsY (L)}])−1 and Pr(A = 1[mid ]Y = 0, L) = (1 + exp[−oddsA (L)}])−1 are variation independent [7]. Unfortunately, an analyst who separately fits the logistic regressions (4) and (5) for a given data set, will not know which if any of the two fits to report, since he will not know which, if any, of models oddsA (L; α) and oddsY (L; η) is correct, particularly when the corresponding estimates of ψ0 differ beyond sampling variability. The double-robust estimator [psi]dr implemented in the algorithm described below, resolves this difficulty in that it yields an estimator that in large samples is unbiased (more precisely, consistent) provided one of models (4) or (5) is correctly specified but not necessarily both hold.

The algorithm

  1. Obtain the maximum likelihood estimator, say ([psi]p, [eta w/ hat]p), of (ψ, η) by regressing Y on A and L using the regression model (4), then proceed to the next step.
  2. Obtain the maximum likelihood estimator, say ([psi]r, [alpha]r), of (ψ, α) by regressing A on Y and L using the regression model (5), then proceed to the next step.
  3. Set ψ^dr(0)=0, and j = 0 and carry out the following recursive algorithm:
    1. Set j = j + 1
    2. Compute, for each subject i in the sample
      off seti(j)=oddsY(Li;η^p)+or(Li;ψ^dr(j1))×Ai
      Xi(j)=[partial differential]or(Li;ψ)[partial differential]ψ[mid ]ψ=ψ^dr(j1)×(Aidi(j))
      where for every value of (ψ, η, α),
    3. Obtain the maximum likelihood estimator, say [nu with circumflex](j), of the regression coefficients ν in a logistic regression model with outcome Yi, covariate vectors Xi(j), constructed offsets off seti(j), i = 1, …, n, and no intercept.
    4. If the absolute value of each of the components of [nu with circumflex](j) falls below a user-specified threshold, say 0.0001 or smaller, then set ψ^dr=ψ^dr(j) and exit the algorithm. Otherwise, set ψ^dr(j)=ψ^dr(j1)+ν^(j) and go to step (a).

An informal argument as to why the algorithm returns a double-robust estimator of ψ is as follows. Consider fitting the logistic regression model (4) in two stages. In stage 1, one computes the estimators ([eta w/ hat]p, [psi]p) from the usual fit of the logistic regression model, one keeps [eta w/ hat]p but discards [psi]p. In stage 2, one pretends that in model (4) η = [eta w/ hat]p is fixed and known and one estimates ψ with, say [psi], from a weighted logistic regression fit with certain weights ωi (ψ) that depend on the covariates (Li, Ai) but not on the outcome Yi. That is, in stage 2, [psi] solves the weighted logistic regression equation

i=1nωi(ψ)[partial differential]or(Li;ψ)[partial differential]ψ[Yiexpit{oddsY(Li;η^p)+or(Li;ψ)Ai}]=0

where the function expit(x)={1 + exp − (x)}−1. The resulting estimator of ([eta w/ hat]p, [psi]) is consistent for (η, ψ) if model (4) is correctly specified. Specifically, [eta w/ hat]p is consistent for η because it is the usual logistic regression estimator η. On the other hand, if ωi (ψ) were exactly Ai, then [psi] would indeed coincide with the usual logistic regression [psi]p and hence would be consistent for ψ. The fact that ωi (ψ) is not Ai but depends on the covariates cannot invalidate the consistency of [psi]; just as in any regression problem, weighting by functions of the covariates affects the efficiency but not the consistency of the regression estimators.

The preceding argument is the essence for the consistency of the estimator [psi]dr returned by the ProRetroSpective algorithm. Specifically, the estimator [nu with circumflex](j) of step 3.c is the value of ν that solves the logistic regression score equation

i=1nXi(j){Yiexpit[off seti(j)+νXi(j)]}0

The algorithm stops precisely at an iteration, say the Jth one, at which [nu with circumflex](J) ≈ 0. Then when j = J, replacing off seti(J) and Xi(J) by their values and ν by 0 in equation (8), we observe that ψ^dr(J), and hence the value [psi]dr returned by the algorithm, solves precisely equation (7) with ωi(ψ) = Aid(Li; ψ, [eta w/ hat]p, [alpha]r). That is, [psi]dr is just like [psi] in the preceding two-stage procedure for a specific choice of weights ωi(ψ) and consequently, it is consistent for ψ when model oddsY (Li; η) is correct (since correct specification of this model ensures correct specification of model (4) given that model or (L; ψ) is assumed to be correct).

Next, we argue that [psi]dr converges in probability to the true value of ψ even if model oddsY (Li; η) is incorrect, provided that the model oddsA (Li; α) used in the construction of d(Li; ψ, η, α) is correctly specified. To do so, consider the following thought experiment. Suppose that in implementing the ProRetroSpective algorithm we reversed the roles played by A and Y, and therefore we also reversed the roles played by their associated log-odds models oddsA (Li; α) and oddsY (Li; η) in stages 1,2 and 3 of the procedure. Then, clearly, by the invariance property of the odds ratio function, and by arguing as above, at convergence of the algorithm, we would obtain a “reversed” estimator ψ^drrev which solves the following reversed estimating equation:

i=1nωirev(ψ)[partial differential]or(Li;ψ)[partial differential]ψ[Aiexpit{oddsA(Li;α^r)+or(Li;ψ)Ai}]=0

where ωirev(ψ)=Yidrev(Ai;ψ,α^r,η^p) and drev(Li; ψ, α, η) is defined as d(ψ, η, α) with the roles of oddsY (Li; η) and oddsA (Li; α) reversed. Then, our discussion above leads to the conclusion that ψ^drrev converges to the true ψ provided that oddsA (Li; α) is correctly specified, irrespective of the validity of model oddsY (Li; η).

Our proof of double robustness now hinges on the remarkable fact that mainly due to the special form of d(Li; ψ, η, α), and therefore of drev(Li; ψ, α, η), the estimators [psi]dr and ψ^drrev are exactly equal. The double robustness property of [psi]dr immediately follows when ψ^dr=ψ^drrev since [psi]dr converges in probability to the true value of ψ when model oddsY (L; η) is correct regardless of the validity of model oddsA (L; α) and ψ^drrev converges in probability to the true value of ψ when model oddsA (L; α) is correct regardless of the validity of model oddsY (L; η).

That [psi]dr and ψ^drrev are equal is a consequence of the following two facts. First, a close examination of equations (7) and (9) reveals that ultimately [psi]dr solves i=1nUi(ψ;η^p,α^r)=0 and ψ^drrev solves i=1nUirev(ψ;η^p,α^r)=0 where

Ui(ψ;η,α)={Aid(Li;ψ,η,α)}[partial differential]or(Li;ψ)[partial differential]ψ[Yiexpit{oddsY(Li;η)+or(Li;ψ)Ai}] and Uirev(ψ;η,α) is its reverse analog, i.e.

Uirev(ψ;η,α)={Yidrev(Li;ψ,α,η)}[partial differential]or(Li;ψ)[partial differential]ψ[Aiexpit{oddsA(Li;α)+or(Li;ψ)Yi}]

Second, as shown in the appendix, our special choice of d(Li; ψ, η, α) ensures that Ui(ψ; η, α) and Uirev(ψ;η,α) are one and the same; i.e. Ui(ψ;η,α)=Uirev(ψ;η,α). In fact, Ui(ψ; η, α), and consequently Uirev(ψ;η,α), is the efficient score for ψ in the model that only specifies the parametric form for or (L) and, in light of the results of Robins and Rotnitzky [8], also in the union model.

In light of this discussion, we note that interestingly, the analyst will obtain exactly the same answer for her estimate of ψ (up to numerical precision), whether she implements the ProRetroSpective algorithm as given in its original form, or if she chooses the reversed version of the algorithm, where the roles of A and Y are reversed.

The estimator [psi]dr returned by the preceding algorithm is approximately normal [5], therefore inference on ψ0 can be based on Wald-type confidence intervals and statistical tests constructed with the non-parametric bootstrap estimator of variance. The bootstrap is needed to account for the additional uncertainty due to estimation of the parameters α and η.

3 Double-robust estimation for other epidemiological study designs

In this section we argue that the double-robust methodology implemented with the ProRetroSpective algorithm is useful for analyzing data from three other common epidemiological designs, namely matched cohort study designs, and case-control studies, matched or unmatched on a subset of the confounders. As summarized by Rothman and Greenland [1], matching generally refers to the selection of a reference series-unexposed subjects in a cohort study or controls in a case-control study- that is identical, or nearly so, to the index series with respect to the distribution of one or more potentially confounding factors. In the following, we make the crucial assumption that the matching covariates do not grow in dimension with the sample size.

Under each of these designs, the sampling distribution, say ps(A, L, Y), of the exposure, confounders and disease is distorted. That is, the sampled data does not follow the distribution, say p(A, L, Y), in the study population. However, it is well known that under any of these three designs, the log-odds ratio function calculated under the sampling distribution, i.e.

ors(L)=log{Prs(Y=1[mid ]A=1,L)×Prs(Y=0[mid ]A=0,L)Prs(Y=1[mid ]A=0,L)×Prs(Y=0[mid ]A=1,L)}=log{Prs(A=1[mid ]Y=1,L)×Prs(A=0[mid ]Y=0,L)Prs(A=1[mid ]Y=0,L)×Prs(A=0[mid ]Y=1,L)}

remains equal to the log-odds ratio function or (L) in the population. This implies that, in principle, any of these designs entails the possibility to unbiasedly, more precisely consistently, estimate the parameter ψ0 of a model or (L; ψ) for the population log-odds ratio function or (L).

Because under a matched cohort design the conditional probability of disease given exposure status and confounders in the sample, Prs (Y = 1[mid ]A, L), remains equal to its counterpart Pr (Y = 1[mid ]A, L) in the population, then just as in the unmatched cohort case, one possible strategy to estimate ψ0 is to postulate a model (4) for Pr (Y = 1[mid ]A, L) and to estimate the model parameters by maximum likelihood. Then, arguing exactly as in the unmatched cohort situation, the maximum likelihood estimator of ψ is consistent if model oddsY (L; η) is a correctly specified model for the population baseline log-odds function oddsY (L) but can be severely biased if this model is incorrect.

The opposite occurs for analyzing a case-control design, whether matched on a subset of confounders or not. Specifically, when data is obtained from these designs, the sampling conditional probability of exposure given disease status and confounders, Prs (A = 1[mid ]Y, L), remains equal to its counterpart Pr (A = 1[mid ]Y, L) in the population. Then, one estimation strategy is to postulate a model (5) for Pr (A = 1[mid ]Y, L) and to estimate the model parameters by maximum likelihood. The maximum likelihood estimator of ψ is consistent if model oddsA (L; α) is a correctly specified model for the population baseline log-odds oddsA (L) but can be severely biased if this model is incorrect.

Just as for the analysis of an unmatched cohort study, the lack of robustness of maximum likelihood estimation can be partially remedied by moving away from this methodology and adopting instead a double-robust approach. In fact, the methods developed by Tchetgen Tchetgen et al. [5] and, in particular, the ProRetroSpective algorithm, can be applied to derive a double-robust estimator of ψ0. Specifically, when data arises from any of the three aforementioned designs, the preceding algorithm can be applied step by step, to yield an estimator [psi]dr that is consistent for ψ0 so long as, of the two working models oddsA (L; α) and oddsY (L; η) that appear in the logistic regression models of stages 1 and 2 of the algorithm, at least one is a correctly specified model for the corresponding sampling baseline log-odds functions.

oddsAs(L)=logPrs(A=1[mid ]Y=0,L)Prs(A=0[mid ]Y=0,L)andoddsYs(L)=logPrs(Y=1[mid ]A=0,L)Prs(Y=0[mid ]A=0,L)

When data come from a matched cohort design oddsYs (L) = oddsY (L), then the preceding statement amounts to the assertion that the ProRetroSpective algorithm returns an estimator [psi]dr that is consistent so long as the analyst is able to correctly specify either a model oddsY (L; η) for the population baseline log-odds of disease oddsY (L) or a model oddsA (L; α) for the sampling baseline log-odds of exposure, but not necessarily both.

Likewise, because under a case-control design, whether matched or unmatched, oddsAs (L) = oddsA (L); then the ProRetroSpective algorithm returns an estimator [psi]dr that is consistent so long as the analyst is able to correctly specify either a model oddsY (L; η) for the population baseline log-odds of exposure oddsA (L) or a model oddsY (L; α) for the sampling baseline log-odds of disease, but not necessarily both.

In essence, the double-robustness of [psi]dr is justified because regardless of the sampling design, the theory developed in Tchetgen Tchetgen et al. [5] establishes that estimator [psi]dr returned by the ProRetroSpective algorithm, is a double-robust estimator of the parameter ψ0 of a model or (L; ψ) for the sampling odds ratio function ors (L), in that it is a consistent estimator of this parameter so long as either oddsA (L; α) is a correctly specified model for the baseline sampling log-odds of exposure oddsAs (L) or oddsY (L; η) is a correctly specified model for the baseline sampling log-odds of disease oddsYs (L), but not necessarily both. The reason then why the algorithm can be used to estimate the parameters of a model for the population odds ratio function or (L) under the three designs considered in this section is because, as indicated earlier, under these designs the population and the sampling odds ratio functions are the same.

In general, we also recommend using the nonparametric bootstrap to estimate the covariance of [psi]dr under any of the three aforementioned sampling designs. It is interesting to note that the large sample variance of [psi]dr often simplifies when both oddsA (L; α) and oddsY (L; η) are correct. We illustrate this point with the model from the introduction where or (L; ψ) = ψ and the study design is an unmatched cohort; then the variance of [psi]dr is approximately given by

n1{E{1Pr(A=Y=1[mid ]L)+1Pr(A=0,Y=1[mid ]L)+1Pr(Y=1,A=0[mid ]L)+1Pr(A=Y=0[mid ]L)}1}1

Interestingly, the expression in the preceding display is similar in functional form to the variance of the log-odds ratio in a simple 2 × 2 table, and in fact, when L is absent, reduces exactly to the latter. An explicit expression for Pr(Y = y, A = α[mid ]L) as a function of oddsA (L), oddsY (L) and or (L) is given in Appendix B, which can be used to form an empirical version of (10). However, this empirical variance estimator should not be used to estimate the variance of [psi]dr because it is only consistent if both working models are correct.

4 A data example

In this section, we illustrate the double-robust methods with a re-analysis of data on an unmatched cohort study of 207 white and 81 black randomly sampled postmenopausal women diagnosed with primary cancer of the uterine corpus between 1985 and 1987, from the National Cancer Institute's Black/White Cancer Survival Study (BWCSS). In the original analysis of these data, the primary aim of Hill et al. [9] was to investigate whether black women with endometrial cancer have more advanced disease and less favorable tumor grade than do white women. In that vein, the authors report a moderate but non-statistically significant racial difference in tumor grade after control of important explanatory variables, which they postulate may reflect true biologic variation between blacks and whites and may explain, in part, the observation that blacks with endometrial cancer have a worse prognosis. An extensive description of the design, methodology and analysis of the original study is given in the original manuscript [9], and therefore, is not reproduced here. For our purposes, we consider the estimation of the hypothesized association on the odds ratio scale between race (A = 1 for black vs A = 0 for white) and the presence of poorly differentiated tumors (Y = 1 for poorly differentiated tumors vs Y = 0 for moderately or well differentiated tumors), adjusting for the following explanatory factors L for differences in grade between blacks and whites: age, histologic subtype, use of estrogens and smoking status.

For comparison, we consider the three analytical methods discussed in section 2, which we use to estimate the odds ratio association between race and the risk of poorly differentiated tumors, within levels of L. Briefly, recall that the first approach, which also corresponds to the approach taken by Hill et al. [9], consists of fitting the logistic regression of the indicator of poorly differentiated tumors on race and potential confounders using model (1). Whereas, the second approach fits the logistic regression of the race indicator on the indicator of poorly differentiated tumors and potential confounders using the model given by equation (5), with the choice oddsA(L;α)=α1+α2L, where α=(α1,α2). Finally, we obtain a doubly robust estimator, which combines these two regression models, by implementing the ProRetroSpective algorithm.

To illustrate the double-robustness property, we consider three working models. In the first model M1, we set α2 = 0 (thus assuming L is independent of A given Y = 0) and estimate η by maximum likelihood in the first stage of the algorithm. In the second model M2, we set η2 = 0 (thus assuming L is independent of Y given A = 0) and obtain α by maximum likelihood in the first stage of the algorithm. In the third model M3, both α and η are estimated by maximum likelihood in the first stage of the algorithm. Bootstrap standard errors of all estimators were also obtained (based on 500 data sets sampled with replacement).

In model M1, as we would expect, the crude reverse analysis which does not account for measured confounders, differs from the prospective and double-robust estimators: exp([psi]reverse,M1) = 2.2 (95%CI= 1.13-3.93) for the prospective logistic regression approach, exp([psi]prosp,M1) =1.88 (95%CI=0.87-4.05) for the reverse regression approach and exp([psi]dr,M1) = 1.88 (95%CI=0.87-4.04) for the double-robust approach. The opposite holds in model M2 which gives exp([psi]reverse,M2) = 1.89 (95%CI= 0.87-4.09) for the prospective approach, exp([psi]prosp,M2) =2.2 (95%CI=1.13-3.93) for the reverse approach and exp([psi]dr,M2) = 1.88 (95%CI=0.87-4.04) for the double-robust approach In model M3, the three methods appear to agree on their estimated association between race and tumor grade, yielding estimates [psi]prosp,M3= [psi]prosp,M1 and [psi]reverse,M3 = [psi]reverse,M2 for the prospective and reverse regression respectively and the following double-robust point estimate and Wald -type confidence interval for the odds ratio of interest: exp([psi]dr,M3) = 1.88 (95%CI=0.87-4.04). These three highly concordant estimators generally agree with previously published results of an elevated (though marginally statistically significant) risk of low grade tumors experienced by black women when compared to white women in this study [9]. Nonetheless, these results should be interpreted with caution and should only be taken as an illustrative example of the methods, because the reported (borderline) association cannot necessarily be interpreted as a causal effect without making strong assumptions; for instance, it may not be realistic to assume that L contains all common correlates of A and Y in this data set.

5 Extensions and concluding remarks

The present paper has focused on the implementation of a new estimator of the parameters of a model for the log odds ratio function or (L), in the simple case of a binary exposure. An extension of the proposed ProRetroSpective iterative procedure to allow for categorical and continuous exposure, as well as for polytomous outcome is also of interest and will be addressed elsewhere.

Missing covariate data is a feature common to most epidemiological studies, and it is now widely recognized that the routine use of standard complete-case analysis -which excludes subjects with any missing variable- can yield highly inefficient estimators of ψ0 in logistic regression analysis, and be severely biased when missingness depends on the outcome or correlates of it. In a companion manuscript, we consider the estimation of ψ0 when Y, and a subset Lobs of the confounders L are always observed but A and a subset Lmis of L are missing, not necessarily simultaneously, in some study participants. We show that a straightforward implementation of the approach described here enjoys certain advantageous robustness and efficiency properties in both cohort and case-control studies under the assumption that the data are missing at random. Specifically, we show that in such a setting, there exist estimators of ψ0 that are double-robust in certain union models and indeed they confer more protection against model mis-specification than other recently proposed double-robust estimators. These estimators can be implemented with the macro using only data from subjects with complete observations, i.e. by performing a complete-case ProRetroSpective analysis.

Interestingly, a referee asked whether our algorithm could be framed as a specific application of targeted maximum likelihood (TML) estimation for semi-parametric logistic regression as described in van der Laan and colleagues in an unpublished paper (Readings in Targeted Maximum Likelihood Estimation, 2009, page 621 available at It turns out that, similar to our method, in the absence of model mis-specification, the TML algorithm of Van der Laan also solves the efficient score equations for semi-parametric logistic regression which was first derived by Bickel et al. [10]. However, unlike Bickel et al [10] and van der Laan et al, in this paper, we give a parametrization of the law of (Y, A) given L in terms of models or (L; [psi]), oddsA (L; α) and oddsY (L; η), which when used to evaluate the efficient score, delivers an estimator with a double robust property. We should emphasize that under the standard parametrization which replaces the reverse baseline regression Pr (A = 1[mid ]Y = 0, L; α) (i:e: oddsA (L; α)) with a variation independent model for the propensity score Pr (A = 1[mid ]L), Tchetgen Tchetgen et al [5] formally established that no useful double robust estimator exist except possibly under the null hypothesis of no treatment effect, which is not surprising since under the null, Pr (A = 1[mid ]Y = 0, L) = Pr (A = 1[mid ], L). Thus, unless targeted maximum likelihood is implemented using our proposed parametrization, it will generally not share the double robust property of our estimator. This point which was not recognized by van der Laan et al also highlights the fact that targeted maximum likelihood estimation cannot in itself reveal which parametrization of the observed data likelihood yields a double robust estimator when used to solve the efficient score.


The authors would like to thank Jamie Robins for helpful comments, and Roger Logan for programming assistance.

Sources of financial support: Andrea Rotnitzky was partially funded by NIH grant GM48704

Appendix A: ProRetroSpec SAS Macro

ProRetroSpec Macro
/* ****************************ProRetroSpec Macro*****************
Author :
 Tchetgen Tchetgen Eric
 Depts Epidemiology and Biostatistics
 Harvard University
 email for inquiries: ude.dravrah.hpsh@egtehcte
Input to the macro: data=Name of data set;
 y=name of binary outcome variable;
 a=name of binary exposure variable;
 oddsY = list of variables to include in the outcome regression model;
 oddsA=list of variables to include in the exposure regression model;
 epsilon= stopping rule of the algorithm, default value is 0.0001
Output of the macro:
 prospective=log-odds ratio estimated by a regression of Y on A and L;
 retrospective=log-odds ratio estimated by a regression of A on Y and L;
 ProRetroSpective=doubly robust estimator of the log-odds ratio,
 eps2=stopping value should be less than epsilon*/
%macro ProRetroSpec (data=,y=, a=, oddsY=, oddsA=,epsilon=0.0001, bsamp=200);
ods listing close;
data boot;
 do sample=1 to &bsamp;
  do i=1 to nobs;
   pt=round (ranuni(12)*nobs);
   set &data nobs=nobs point=pt;
%do j=0 %to &bsamp;
 %if &j=0 %then %do;
  data _cero_b; set &data;
%else %do;
 data _cero_b; set boot;
  if sample=&j then output;
proc logistic descending outest=psi_y(keep= &a) covout data =_cero_b;
 model &y= &a &oddsY;
 output out =predy xbeta=ybeta;
proc logistic descending outest=psi_a (keep=&y) data =_cero_b;
 model &a = &y &oddsA;
 output out =preda xbeta=abeta;
data psi_epsilon;
data a_coeff (rename=(&a=a_coeff));
 set psi_y;
 if _n_=1 then output;
data y_coeff (rename=(&y=y_coeff));
 set psi_a;
 if _n_=1 then output;
data data_j; merge y_coeff a_coeff preda predy psi_epsilon;
 retain a_c y_c eps psi 0;
 if_n_=1 then a_c=a_coeff;
 if_n_=1 then y_c=y_coeff;
 if _n_=1 then eps=epsilon;
 if _n_=1 then psi=psi_j;
 drop a_coeff y_coeff epsilon psi_j;
 if &a=1 then ybeta= ybeta-a_c;
 if &y=1 then abeta=abeta-y_c;
 a tilda=&a-num/denum;
%let err=1;
%let finish=10;
%do %while(%sysevalf(&err ge &epsilon));
 proc logistic descending outest=psi_pr(keep= a_tilda) data =data_j;
  model &y= a_tilda/offset=y_offset noint;
 data psi_epsilon(keep=epsilon);set psi_pr;
  if _n_=1 then do;
   epsilon = a_tilda;
 data data_j;merge data_j(drop=eps) psi_epsilon;
  retain eps 0;
  if _n_=1 then eps=epsilon;
  eps2 =abs(eps);
  call symput(‘err’,eps2);
data results; set data_j;
 if_n_=1 then output;
%if &j=0 %then %do;
 data results0; set results;
%if &j = 1 %then %do;
 data trial; set results; run;
%if %eval(&j)> 1 %then %do;
 data trial;set results trial;
  sample = %eval(&j);
data all_results; merge %if &j>1 %then %do; trial %end; results0;retain mean1 mean2 mean3;
if _n_=1 then do;
ods listing;
Proc print data=results0;
 var prospective retrospective ProRetroSpective eps2;
%if %eval(&bsamp)>0 %then %do;
 proc means data=all_results mean; var prospective_variance retrospective_variance ProRetroSpective_variance; run;
Macro call for the Data Example of Section 4:
%ProRetroSpec(data=cero, y=grade2, a=race, oddsY=btype1 btype2 estrogen age smoking, oddsA=btype1 btype2 estrogen age smoking
 ,epsilon=0.0001, bsamp=500);

Appendix B

Explicit expression for Pr(Y = y, A = a[mid ]L) as a function of oddsA (L), oddsY (L) and or (L). The desired expression follows upon replacing Pr(A = a[mid ]Y = 0, L) with [expit {oddsA (L)}]a × [1−expit {oddsA (L)}]1 − aand Pr(Y = a[mid ]A = 0, L) with [expit {oddsY (L)}]y [1 − expit {oddsY (L)}]1−yin the formula

Pr(Y=y,A=a[mid ]L)=Pr(A=a[mid ]Y=0,L)Pr(Y=y[mid ]A=0,L)eayor(L)a=01y=01Pr(A=a[mid ]Y=0,L)Pr(Y=y[mid ]A=0,L)eayor(L)

where expit(x) = {1 + exp (−x)}−1.

Appendix C

Proof that Ui(ψ;η,α)=Uirev(ψ;η,α)

Some algebra gives

d(L;ψ,η,α)=E{Avar(Y[mid ]A,L;ψ,η)[mid ]L;ψ,η,α}E{var(Y[mid ]A,L;ψ,η)[mid ]L;ψ,η,α}

with expectations and variances evaluated under the joint density of (Y, A) given L; say f (Y, A[mid ]L;ψ, η, α), induced by the models oddsY (L; η), oddsA (L; η) and or (L; ψ). Note that

d(Li;ψ,η,α)=E*(A[mid ]L;ψ,η,α)

is a conditional expectation with respect to another conditional law of A given L, namely the law

f*(A[mid ]L;ψ,η,α)=f(A[mid ]L;ψ,η,α)var(Y[mid ]A,L;ψ,η)ca(L;ψ,η,α)

where ca (L; ψ,η, α) = E {var (Y[mid ]A, L;ψ,η) [mid ]L; ψ,η, α}.

Next, observe that for any binary random variable Z and arbitrary vector V, it holds that

ZE(Z[mid ]V)=(1)1Zvar(Z[mid ]L)f(Z[mid ]L)

Furthermore, observe that

Ui(ψ;η,α)={Aid(Li;ψ;η,α)}[partial differential]or(Li;)[partial differential]ψ[Yiexpit{oddsY(Li;η)+or(Li;ψ)Ai}]={AiE*(A[mid ]L;ψ,η,α)}[partial differential]or(Li;ψ)[partial differential]ψ{YiE(Yi[mid ]Ai,Li;η,ψ)}

Then, applying identity (11) to the residuals {AiE* (A[mid ]L; ψ, η, α)} and {YiE (Yi[mid ]Ai; Li; η, ψ)} we obtain that

Ui(ψ;η,α)=[partial differential]or(Li;ψ)[partial differential]ψ(1)Ai+Yivar*(Ai[mid ]Li;ψ,η,α)f*(Ai[mid ]Li;ψ,η,α)var(Yi[mid ]Ai,Li;ψ,η,α)f(Yi[mid ]Ai,Li;ψ,α)={[partial differential]or(Li;ψ)[partial differential]ψ(1)Ai+Yi1f(Yi,Ai[mid ]Li;ψ,η,α)}var*(Ai[mid ]Li;ψ,η,α)ca(L;ψ,η,α)

Likewise, arguing by symmetry, reversing the roles of Yiand Aigives

Uirev(ψ;η,α)={[partial differential]or(Li;ψ)[partial differential]ψ(1)Ai+Yi1f(Yi,Ai[mid ]Li;ψ,η,α)}var*(Yi[mid ]Li;ψ,η,α)cy(L;ψ,η,α)

where var* (Yi[mid ]Li; ψ,η, α) is the conditional variance of Yigiven Liunder the law

f*(Y[mid ]L;ψ,η,α)=f(Y[mid ]L;ψ,η,α)var(A[mid ]Y,L;ψ,η)cy(L;ψ,η,α)

with cy (L; ψ, η, α) = E {var (A[mid ]Y, L; ψ, η) [mid ]L; ψ, η,α}.

Then, our proof that Ui(ψ;η,α)=Uirev(ψ;η,α) is concluded if we show that

var*(Ai[mid ]Li;ψ,η,α)ca(L;ψ,η,α)=var*(Yi[mid ]Li;ψ,η,α)cy(L;ψ,η,α)

or equivalently, that

var*(AiLi;ψ,η,α)E[var(Y[mid ]A,L;ψ,η)[mid ]L;ψ,η,α]=var*(Yi[mid ]Liψ,η,α)E[var(A[mid ]Y,L;α,η)[mid ]L;ψ,α]

Because the identity must hold for each Li, and because the law f (Y;A[mid ]L; ψ, η, α) is arbitrary, the last identity is proved if we show that for any pair of binary random variables (A, Y)

var*(A)var*(Y)=E{var(A[mid ]Y)}E{var(Y[mid ]A)}

where var* (A) stands for the variance of A under the law f* (A) = f (A) var (Y[mid ]A) =/E [var (Y[mid ]A)] and var* (Y) stands for the variance of Y under the law f* (Y) = f (Y) var (A[mid ]Y) =/E [var (A[mid ]Y)].

To prove identity (12) we first notice that when A and Y are binary it holds that

E{var(A[mid ]Y)}E{var(Y[mid ]A)}=var(A)var(Y)

(This can be easily checked by noticing that E (A[mid ]Y) is linear in Y so, E {var (A[mid ]Y)} = var (A) − cov (Y, A)2 /var (Y) and likewise E {var (Y [mid ] A)} = var (Y) − cov (Y, A)2 /var (A)). Next, by definition of f* (A),

var*(A)=var(A)var(Y[mid ]A=1)var(Y[mid ]A=0)/E[var(Y[mid ]A)]2


var*(A)var*(Y)=var(A)var(Y[mid ]A=1)var(Y[mid ]A=0)E[var(A[mid ]Y)]2var(Y)var(A[mid ]Y=1)var(A[mid ]Y=0)E[var(Y[mid ]A)]2=var(A)3var(Y[mid ]A=1)var(Y[mid ]A=0)var(Y)3var(A[mid ]Y=1)var(A[mid ]Y=0)

and the proof is concluded by noticing that

var(Y[mid ]A=1)var(Y[mid ]A=0)var(A[mid ]Y=1)var(A[mid ]Y=0)=var(Y)2var(A)2


1. Rothman K, Greenland S. Modern Epidemiology. Third Edition. Lippincott Williams and Wilkins; Philadelphia: 2008.
2. Breslow NE, Day NE. Statistical Methods in cancer research Vol I:The analysis of case-control data. Lyon: IARC; 1980.
3. Robins J, Hernan M. In: Estimation of the Causal Effects of Time-Varying Exposure.s Longitudinal Data Analysis. Longitudinal Data Analysis. Fitzmaurice G, Davidian M, Verbeke G, Molenberghs, editors. Chapman and Hall/CRC; 2008.
4. Tchetgen Tchetgen E, Robins J. The semi-parametric case-only estimator. Biometrics. 2010 In press. [PMC free article] [PubMed]
5. Tchetgen Tchetgen E, Robins J, Rotnitzy A. On doubly robust estimation in a semi-parametric odds ratio model. Biometrika. 2010;97(1):171–180. [PMC free article] [PubMed]
6. SAS Institute, Inc. SAS/STAT User's Guide, Version 8. Cary NC: 1999.
7. Chen YH. A Semi-parametric Odds Ratio Model for Measuring Association. Biometrics. 2007;63(2):413–421(9). [PubMed]
8. Robins JM, Rotnitzky A. Comment on the Bickel and Kwon article, “Inference for semiparametric models: Some questions and an answer” Statistica Sinica. 2001;11(4):920–936.
9. Hill HA, Coates RJ, Austin H, et al. Racial differences in tumor grade among women with endometrial cancer. Gynecol Oncol. 1995 Feb;56(2):154–63. [PubMed]
10. Bickel P, Klassen C, Ritov Y, Wellner J. Efficient and Adaptive Estimation for Semi-parametric Models. Springer; New York: 1993.