Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC3059519

Formats

Article sections

- Abstract
- 1 Introduction
- 2 The ProRetroSpective algorithm
- 3 Double-robust estimation for other epidemiological study designs
- 4 A data example
- 5 Extensions and concluding remarks
- References

Authors

Related links

Stat Med. Author manuscript; available in PMC 2012 February 20.

Published in final edited form as:

PMCID: PMC3059519

NIHMSID: NIHMS237813

Corresponding author: Eric J. Tchetgen Tchetgen, Department of Epidemiology, Harvard School of Public Health 677 Huntington Avenue, Boston, MA 02115

Modern epidemiologic studies often aim to evaluate the causal effect of a point exposure on the risk of a disease from cohort or case-control observational data. Because confounding bias is of serious concern in such non-experimental studies, investigators routinely adjust for a large number of potential confounders in a logistic regression analysis of the effect of exposure on disease outcome. Unfortunately, when confounders are not correctly modeled, standard logistic regression is likely biased in its estimate of the effect of exposure, potentially leading to erroneous conclusions. We partially resolve this serious limitation of standard logistic regression analysis with a new iterative approach that we call ProRetroSpective estimation, which carefully combines standard logistic regression with a logistic regression analysis in which exposure is the dependent variable and the outcome and confounders are the independent variables. As a result, we obtain a correct estimate of the exposure-outcome odds ratio, if either the standard logistic regression of the outcome given exposure and confounding factors is correct, or the regression model of exposure given the outcome and confounding factors is correct but not necessarily both, that is, it is double-robust. In fact, it also has certain advantadgeous efficiency properties. The approach is general in that it applies to both cohort and case-control studies whether the design of the study is matched or unmatched on a subset of covariates. Finally, an application illustrates the methods using data from the National Cancer Institute's Black/White Cancer Survival Study.

A common aim of epidemiologic studies is to evaluate the causal effect of a dichotomous point exposure *A* on the risk of a disease outcome *Y* using observational data. In non-experimental studies, systematic bias due to confounding is a serious concern. For this reason, investigators routinely collect and adjust for a large number of confounding factors in data analyses. The current paper concerns a setting in which *Y* is dichotomous, and the observed data is either from a cohort study or from a case-control study, either one possibly matched on covariates *L*_{1}, …, *L _{s}*, and in which additional confounders

Logistic regression has a long tradition in epidemiologic research as the standard analytic tool for modeling the effect of *A* on *Y* within levels of *L* = (*L*_{1}, …, *L _{p}*), whether data is collected in a cohort study or in a case control study [1, 2]. In an unmatched cohort study, where the variables (

$$\text{logit}\{Pr(Y=1A,L;\psi ,\eta )\}={\eta}_{1}+{\eta}_{2}^{T}L+\psi A$$

(1)

for the conditional probability Pr(*Y* = 1*A*, *L*) of the occurrence of the outcome given exposure level *A* and confounders *L* in the study population, where logit(*x*) = log{*x*/(l − *x*)} is the logit link function, and *η* = (*η*_{1}, *η*_{2}). Model (1) assumes homogeneity of the log-odds ratio function

$$\mathbf{\text{or}}\phantom{\rule{0.2em}{0ex}}(L)=log\{\frac{Pr(Y=1A=1,L)\times Pr(Y=0A=0,L)Pr(Y=1A=0,L)\times Pr(Y=0A=1,L)\}}{}$$

(2)

i.e. (1) assumes that for any individual with covariates *L*, the log-odds ratio **or** (*L*) is equal to a constant *ψ*_{0}, the same for all values of *L*, so that *ψ*_{0} is the true value of *ψ* in model (1). In this latter model, the primary scientific interest is usually in *ψ*_{0} since this quantifies, on the log-odds ratio scale, the dependence of the outcome *Y* on the exposure *A* within levels of the confounders *L* in the study population.

Two assumptions are typically required for the log-odds ratio **or** (*L*) to have a causal interpretation. Such assumptions are best articulated within a potential outcomes framework. Specifically, let *Y*_{1} be the subject's potential outcome if, perhaps contrary to fact, he is exposed and *Y*_{0} be the subject's potential outcome if he is unexposed. The first assumption, known as the consistency assumption, states that a person's observed outcome *Y* is equal to his potential outcome *Y _{a}* for the level of exposure

$${\mathbf{\text{or}}}^{}$$

The quantity inside the logarithm, known as the causal odds ratio, quantifies on the odds scale, the causal effect of exposure vs no exposure in subjects with value of the confounder equal to *L*. Specifically, it is equal to the ratio of the odds of a positive value of the outcome for subjects with value of the confounder equal to *L* in two hypothetical worlds, the first in which they are all exposed, and the second in which they are all unexposed.

In the event that unmeasured confounding is present, **or** (*L*) is not equal to **or*** (*L*), and generally does not have a causal interpretation. However, it remains a well defined summary measure of the partial association between exposure and outcome after adjustment for the measured covariates *L* and it often constitutes the primary target of inference in epidemiological analyses. Our discussion in this paper concerns inference about the parameter **or** (*L*) regardless of whether or not it has a meaningful causal interpretation.

Often the analysis is based on more flexible logistic regression models than the linear logistic regression model (1), that relax the assumption of homogeneity of the odds-ratio across levels of the confounder. In such models the constant *ψ* in equation (1) is replaced by some non-constant function **or** (*L; ψ*), e. g. the model **or** (*L; ψ*) = *ψ*_{1} + *ψ*_{2}*L*_{1}, where *ψ* = (*ψ*_{1}, *ψ*_{2}) and *L*_{1} is the first component of *L*, allows for effect modification by *L*_{1} of the effect of *A* on *Y* on the log-odds ratio scale, thus yielding the equation

$$\begin{array}{l}\text{logit}\{Pr(Y=1A,L;\psi ,\eta )\}={\eta}_{1}+{\eta}_{2}^{T}L+\mathbf{\text{or}}(L;\psi )A\hfill \hfill & \hfill & ={\eta}_{1}+{\eta}_{2}^{T}L+{\psi}_{1}A+{\psi}_{2}{L}_{1}A\hfill \end{array}$$

(3)

In models (1) and (3), the regression function
${\eta}_{1}+{\eta}_{2}^{T}L$ is of interest only to the extent that it allows to estimate the true value *ψ*_{0} of the parameter *ψ* in the model for the target odds ratio function **or** (*L*) by standard logistic regression as implemented in virtually all existing statistical software packages. This regression function constitutes a model for the baseline log-odds function

$$\mathbf{\text{oddsY}}\phantom{\rule{0.2em}{0ex}}(L)=log\{\frac{Pr(Y=1A=0,L)Pr(Y=0A=0,L)}{\}}$$

in that its presence in equations (1) and (3) implies the assumption that **oddsY** (*L*) is equal to
${\eta}_{1}+{\eta}_{2}^{T}L$ for some unknown value (*η*_{1}, *η*_{2}). Such model rules out interactions between the components of *L*. However, just like for **or** (*L*), more flexible models for **oddsY** (*L*) can be contemplated by replacing in equations (1) and (3) the function
${\eta}_{1}+{\eta}_{2}^{T}L$ with a more elaborate function **oddsY** (*L; η*), for example one that depends on cross-products and powers of the components of the vector *L*.

To allow for generality in our exposition, henceforth, we focus our discussion on models that admit the general form

$$\text{logit}\{Pr(Y=1A,L;\psi ,\eta )\}=\mathbf{\text{oddsY}}\phantom{\rule{0.2em}{0ex}}(L;\eta )+\mathbf{\text{or}}(L;\psi )\phantom{\rule{0.2em}{0ex}}A$$

(4)

Note that the simple model (4) is recovered by setting **or** (*L; ψ*) = *ψ* and **oddsY**
${\eta}_{1}+{\eta}_{2}^{T}L$.

When the target is estimation of *ψ*_{0}, an analysis based on model (1) or, more generally, model (4), has an important limitation. Specifically, if the assumption that **oddsY** (*L*) is linear in *L* is false, the logistic regression model (1) is mis-specified and the resulting maximum likelihood estimator of *ψ*_{0} is likely biased. Likewise, the maximum likelihood estimator of *ψ*_{0} is generally biased if the model **oddsY** (*L; η*) for **oddsY** (*L*) is mis-specified. This lack of robustness was recently investigated and confirmed in simulation studies [4] for the logistic linear model (1).

Recently, Tchetgen Tchetgen et al [5] showed that for estimating the parameters *ψ* of a model **or** (*L; ψ*) for the log-odds ratio function **or** (*L*) with data from an unmatched prospective cohort design there exist options, other than maximum likelihood under model (4), that confer partial protection against mis-specification of the model **oddsY** (*L; η*). Under their methodology, the analyst not only places a model **oddsY** (*L; η*) for the baseline log-odds function **oddsY** (*L*) for the outcome but also a model **oddsA** (*L; α*) for the population baseline log-odds function for exposure

$$\mathbf{\text{oddsA}}\phantom{\rule{0.2em}{0ex}}(L)=log\frac{Pr(A=1Y=0,L)Pr(A=0Y=0,L)}{}$$

In this formula, Pr(*A* = 1*Y*, *L*) denotes the proportion of subjects in the study population with exposure among those with outcome *Y* and confounders *L*. Thus, under the Tchetgen Tchetgen et al [5] methodology the analyst needs to model also how the proportion of exposed subjects in the study population changes with the values of the confounders *L* among subjects that did not experience the outcome *Y*. However, under this methodology, the analyst need not be right about both models. To obtain an unbiased (more precisely, consistent) estimator of *ψ*_{0}, he just needs to be right about one of the models **oddsY** (*L; η*) or **oddsA** (*L; α*), but need not know which of the two is right. Because one rarely knows which if any of the two models holds, these methods are preferred over maximum likelihood fitting the logistic regression model (4), as they provide the analyst with two separate chances to obtain a consistent estimator of *ψ*_{0}.

Stated more precisely, the methods of Tchetgen Tchetgen et al [5] yield a specific class of estimators of *ψ*_{0} that are consistent if one, but not necessarily both of the following is true:

- the model
**oddsA**(*L; α*) is correctly specified even if the model**oddsY**(*L; η*) is incorrectly specified, - the model
**oddsY**(*L; η*) is correctly specified even if the model**oddsA**(*L; α*) is incorrectly specified

Estimators that, like those proposed by Tchetgen Tchetgen et al [5], are consistent provided one correctly specifies one of two nuisance models (in our problem, the nuisance models are **oddsY** (*L; η*) or **oddsA** (*L; α*)), without knowing which of the two is right, are called double-robust in the union model defined by the union of both nuisance models. The nuisance models are usually referred to as “working models”.

Tchetgen Tchetgen et. al [5] further characterized an optimal estimator in their specific class of double-robust estimators, which when both nuisance models **oddsA** (*L; α*) and **oddsY** (*L; η*) happen to be right, is highly efficient in that no other double-robust estimator of *ψ*_{0} in the same union model, can have smaller large sample variance. Unfortunately, the computation of their estimators, and most notably that of the efficient one, requires solving a non-standard estimating equation, a task which cannot be easily performed without special software, thus seriously impeding its routine use.

The goal of this paper is three-fold. The first is to resolve the aforementioned computational obstacles of the methods of Tchetgen Tchetgen et. al. [5] by giving a novel and simple implementation of their optimal double-robust estimator via an iterative procedure which solely involves the use of standard logistic regression software. We provide a new SAS [6] macro called *ProRetroSpec.sas* to carry out this algorithm.

The second goal relates to the analysis of cohort studies matched on baseline covariates and of case-control studies, matched or unmatched on baseline covariates. Under any of these designs **or** (*L*) remains an estimable target, i.e. it is identified from the sampling distribution of the data. Our second goal is to show that under any of these designs there exist estimators of the parameters *ψ* of models **or** (*L; ψ*) for **or** (*L*) that are double-robust in union models that involve two specific nuisance components of the distribution of the sampled data. Furthermore, we show that these estimators can be implemented with the macro *ProRetroSpec.sas.*

The third goal of the paper is to illustrate the methods with a data example. For this purpose, we use the ProRetroSpective algorithm to obtain double-robust estimates of racial differences in tumor grade among women with endometrial cancer in a re-analysis of data from the National Cancer Institute's Black/White Cancer Survival Study.

In section 2 we describe the algorithm that implements the double-robust estimator of *ψ*_{0} under unmatched cohort designs. In section 3 we show that the double-robust methodology extends to matched cohort studies and to case-control studies, matched or unmatched on covariates. In section 4, we give a data example. We finish in section 5 with some concluding remarks.

The algorithm to implement the double-robust estimator of Tchetgen Tchetgen et. al. [5], which for ease of reference we call ProRetroSpective, consists of three stages. Stage 1 involves the fitting of a logistic regression (4) where **or** (*L; ψ*) is the model for the log-odds ratio of substantive interest and **oddsY** (*L; η*) is the working model for the baseline log-odds for the outcome **oddsY** (*L*). Stage 2 involves fitting another logistic regression, this being a model for the probability of exposure given the outcome and confounders,

$$\text{logit}\{Pr(A=1Y,L;\psi ,\alpha )\}=\mathbf{\text{oddsA}}(L;\alpha )+\mathbf{\text{or}}(L;\psi )\phantom{\rule{0.2em}{0ex}}Y$$

(5)

in which **oddsA** (*L; α*) is the working model for the baseline log-odds for exposure **oddsA** (*L*); for example, in the case where **or** (*L; ψ*) = *ψ* and
$\mathbf{\text{oddsA}}\phantom{\rule{0.2em}{0ex}}(L;\alpha )={\alpha}_{1}+{\alpha}_{2}^{\prime}L$, model (5) reduces to the model

$$\mathit{\text{logit}}\{Pr(A=1Y,L;\psi ,\alpha )\}={\alpha}_{1}+{\alpha}_{2}^{\prime}L+\psi Y.$$

where
$\alpha =({\alpha}_{1},{\alpha}_{2}^{\prime})$. Stage 3 involves a recursion, each step of the recursion involves the fit of a new logistic regression with outcome *Y*, a vector of specially constructed covariates and an offset. In this model both the covariate vector and the offset incorporate the estimates obtained in Stages 1 and 2 and at the previous step of the recursion, and are specially constructed to yield upon convergence of the recursion, the double robust estimator of Tchetgen Tchetgen et al [5] throughout denoted by * _{dr}*. We have implemented this procedure for the simple case where

Interestingly, stage 2 of the algorithm involves the fit of model (5) for the probability of exposure given the outcome and covariates. When **oddsA** (*L; α*) is a correct model for the baseline odds function for exposure **oddsA** (*L*), model (5) is a correctly specified model for this conditional probability. To see this note that, by the well known invariance property of the odds ratio function, **or** (*L*) defined in equation (2) can also be written as

$$\mathbf{\text{or}}\phantom{\rule{0.2em}{0ex}}(L)=log\{\frac{Pr(A=1Y=1,L)\times Pr(A=0Y=0,L)Pr(A=1Y=0,L)\times Pr(A=0Y=1,L)\},}{}$$

(6)

Then, we can simply revert the roles of *A* and *Y* and argue as in the preceding section to conclude that model (5) is correctly specified if **oddsA** (*L; α*) and **or** (*L; ψ*) are correctly specified. Notably, a consequence of this fact is that the maximum likelihood estimate of *ψ*_{0} in the retrospective model (5) is, under standard regularity conditions, consistent and approximately normal, provided that the working model **oddsA** (*L; α*) is correct. However, just as was the case for the fit of model (4), bias in the estimation of *ψ*_{0} is bound to result from a mis-specified working model for **oddsA** (*L; α*).

Remarkably, the logistic regressions (4) and (5) offer two genuine opportunities to estimate *ψ*_{0} correctly because **oddsA** (*L; α*) and **oddsY** (*L; η*) can always be specified coherently, i.e. the specification of a working model **oddsA** (*L*) does not rule out the validity of any working model for **oddsY** (*L*); and vice-versa. The reason for the coherence of any models for **oddsA** (*L*) and **oddsY** (*L*) is that the baseline probability functions Pr(*Y* = 1*A* = 0, *L*) = (1 + exp[−**oddsY** (*L*)}])^{−1} and Pr(*A* = 1*Y* = 0, *L*) = (1 + exp[−**oddsA** (*L*)}])^{−1} are variation independent [7]. Unfortunately, an analyst who separately fits the logistic regressions (4) and (5) for a given data set, will not know which if any of the two fits to report, since he will not know which, if any, of models **oddsA** (*L; α*) and **oddsY** (*L; η*) is correct, particularly when the corresponding estimates of *ψ*_{0} differ beyond sampling variability. The double-robust estimator * _{dr}* implemented in the algorithm described below, resolves this difficulty in that it yields an estimator that in large samples is unbiased (more precisely, consistent) provided one of models (4) or (5) is correctly specified but not necessarily both hold.

- Obtain the maximum likelihood estimator, say (
,_{p}), of (_{p}*ψ*,*η*) by regressing*Y*on*A*and*L*using the regression model (4), then proceed to the next step. - Obtain the maximum likelihood estimator, say (
,_{r}), of (_{r}*ψ*,*α*) by regressing*A*on*Y*and*L*using the regression model (5), then proceed to the next step. - Set ${\widehat{\mathit{\psi}}}_{\mathit{\text{dr}}}^{(0)}=0$, and
*j*= 0 and carry out the following recursive algorithm:- Set
*j*=*j*+ 1 - Compute, for each subject
*i*in the sample$${\mathit{\text{off set}}}_{i}^{(j)}=\mathbf{\text{oddsY}}\phantom{\rule{0.2em}{0ex}}({L}_{i};{\widehat{\eta}}_{p})+\mathbf{\text{or}}\phantom{\rule{0.2em}{0ex}}\left({L}_{i};{\widehat{\psi}}_{\mathit{\text{dr}}}^{(j-1)}\right)\times {A}_{i}$$$${d}_{i}^{(j)}=d\left({L}_{i};{\widehat{\psi}}_{\mathit{\text{dr}}}^{(j-1)},{\widehat{\eta}}_{p},{\widehat{\alpha}}_{r}\right)$$and$${X}_{i}^{(j)}={\frac{\mathbf{\text{or}}\phantom{\rule{0.2em}{0ex}}({L}_{i};\psi )\psi}{}\psi ={\widehat{\psi}}_{\mathit{\text{dr}}}^{(j-1)}\times \left({A}_{i}-{d}_{i}^{(j)}\right)}_{}$$where for every value of (*ψ*,*η*,*α*),$$d({L}_{i};\psi ,\eta ,\alpha )={\left[1+\frac{exp\{\mathbf{\text{or}}({L}_{i};\psi )\}\times [1+exp\{\mathbf{\text{or}}({L}_{i};\psi )+\mathbf{\text{oddsY}}({L}_{i};\eta )\}]}{exp\{\mathbf{\text{oddsA}}({L}_{i};\alpha )\}\times [1+exp\{\mathbf{\text{oddsY}}({L}_{i};\eta )\}]}\right]}^{-1}$$ - Obtain the maximum likelihood estimator, say
^{(}^{j}^{)}, of the regression coefficients*ν*in a logistic regression model with outcome*Y*, covariate vectors ${X}_{i}^{(j)}$, constructed offsets ${\mathit{\text{off set}}}_{i}^{(j)}$,_{i}*i*= 1, …,*n*, and no intercept. - If the absolute value of each of the components of
^{(}^{j}^{)}falls below a user-specified threshold, say 0.0001 or smaller, then set ${\widehat{\psi}}_{\mathit{\text{dr}}}={\widehat{\psi}}_{\mathit{\text{dr}}}^{(j)}$ and exit the algorithm. Otherwise, set ${\widehat{\psi}}_{\mathit{\text{dr}}}^{(j)}={\widehat{\psi}}_{\mathit{\text{dr}}}^{(j-1)}+{\widehat{\nu}}^{(j)}$ and go to step (a).

An informal argument as to why the algorithm returns a double-robust estimator of *ψ* is as follows. Consider fitting the logistic regression model (4) in two stages. In stage 1, one computes the estimators (* _{p}*,

$$\sum _{i=1}^{n}{\omega}_{i}(\psi )\frac{\mathbf{\text{or}}({L}_{i};\psi )\psi [{Y}_{i}-\text{expit}\{\mathbf{\text{oddsY}}({L}_{i};{\widehat{\eta}}_{p})+\mathbf{\text{or}}({L}_{i};\psi ){A}_{i}\}]=0}{}$$

(7)

where the function expit(*x*)={1 + exp − (*x*)}^{−1}. The resulting estimator of (* _{p}*, ) is consistent for (

The preceding argument is the essence for the consistency of the estimator * _{dr}* returned by the ProRetroSpective algorithm. Specifically, the estimator

$$\sum _{i=1}^{n}{X}_{i}^{(j)}\left\{{Y}_{i}-\text{expit}\left[{\mathit{\text{off set}}}_{i}^{(j)}+\nu \prime {X}_{i}^{(j)}\right]\right\}\approx 0$$

(8)

The algorithm stops precisely at an iteration, say the *J ^{th}* one, at which

Next, we argue that * _{dr}* converges in probability to the true value of

$$\sum _{i=1}^{n}{\omega}_{i}^{\mathit{\text{rev}}}\left(\psi \right)\frac{\mathbf{\text{or}}\left({L}_{i};\psi \right)\psi \left[{A}_{i}-\text{expit}\left\{\mathbf{\text{oddsA}}\left({L}_{i};{\widehat{\alpha}}_{r}\right)+\mathbf{\text{or}}\left({L}_{i};\psi \right){A}_{i}\right\}\right]=0}{}$$

(9)

where
${\omega}_{i}^{\mathit{\text{rev}}}(\psi )={Y}_{i}-{d}^{\mathit{\text{rev}}}({A}_{i};\psi ,{\widehat{\alpha}}_{r},{\widehat{\eta}}_{p})$ and *d ^{rev}*(

Our proof of double robustness now hinges on the remarkable fact that mainly due to the special form of *d*(*L _{i}*;

That * _{dr}* and
${\widehat{\mathit{\psi}}}_{\mathit{\text{dr}}}^{\mathit{\text{rev}}}$ are equal is a consequence of the following two facts. First, a close examination of equations (7) and (9) reveals that ultimately

${U}_{i}\left(\psi ;\eta ,\alpha \right)=\left\{{A}_{i}-d\left({L}_{i};\psi ,\eta ,\alpha \right)\right\}\frac{\mathbf{\text{or}}\left({L}_{i};\psi \right)\psi \left[{Y}_{i}-\text{expit}\left\{\mathbf{\text{oddsY}}\left({L}_{i};\eta \right)+\mathbf{\text{or}}\left({L}_{i};\psi \right){A}_{i}\right\}\right]}{}$ and ${U}_{i}^{\mathit{\text{rev}}}\left(\psi ;\eta ,\alpha \right)$ is its reverse analog, i.e.

$${U}_{i}^{\mathit{\text{rev}}}\left(\psi ;\eta ,\alpha \right)=\left\{{Y}_{i}-{d}^{\mathit{\text{rev}}}\left({L}_{i};\psi ,\alpha ,\eta \right)\right\}\frac{\mathbf{\text{or}}\left({L}_{i};\psi \right)\psi \left[{A}_{i}-\text{expit}\left\{\mathbf{\text{oddsA}}\left({L}_{i};\alpha \right)+\mathbf{\text{or}}\left({L}_{i};\psi \right){Y}_{i}\right\}\right]}{}$$

Second, as shown in the appendix, our special choice of *d*(*L _{i}*;

In light of this discussion, we note that interestingly, the analyst will obtain exactly the same answer for her estimate of *ψ* (up to numerical precision), whether she implements the ProRetroSpective algorithm as given in its original form, or if she chooses the reversed version of the algorithm, where the roles of *A* and *Y* are reversed.

The estimator * _{dr}* returned by the preceding algorithm is approximately normal [5], therefore inference on

In this section we argue that the double-robust methodology implemented with the ProRetroSpective algorithm is useful for analyzing data from three other common epidemiological designs, namely matched cohort study designs, and case-control studies, matched or unmatched on a subset of the confounders. As summarized by Rothman and Greenland [1], *matching generally refers to the selection of a reference series-unexposed subjects in a cohort study or controls in a case-control study- that is identical, or nearly so, to the index series with respect to the distribution of one or more potentially confounding factors*. In the following, we make the crucial assumption that the matching covariates do not grow in dimension with the sample size.

Under each of these designs, the sampling distribution, say *p _{s}*(

$$\begin{array}{cc}{\mathbf{\text{or}}}_{s}\phantom{\rule{0.2em}{0ex}}\left(L\right)=\hfill & log\{\frac{{Pr}_{s}(Y=1A=1,L)\times {Pr}_{s}(Y=0A=0,L){Pr}_{s}(Y=1A=0,L)\times {Pr}_{s}(Y=0A=1,L)\}}{}\hfill & =log\{\frac{{Pr}_{s}(A=1Y=1,L)\times {Pr}_{s}(A=0Y=0,L){Pr}_{s}(A=1Y=0,L)\times {Pr}_{s}(A=0Y=1,L)\}}{}\hfill \hfill \end{array}$$

remains equal to the log-odds ratio function **or** (*L*) in the population. This implies that, in principle, any of these designs entails the possibility to unbiasedly, more precisely consistently, estimate the parameter *ψ*_{0} of a model **or** (*L; ψ*) for the population log-odds ratio function **or** (*L*).

Because under a matched cohort design the conditional probability of disease given exposure status and confounders in the sample, Pr* _{s}* (

The opposite occurs for analyzing a case-control design, whether matched on a subset of confounders or not. Specifically, when data is obtained from these designs, the sampling conditional probability of exposure given disease status and confounders, Pr* _{s}* (

Just as for the analysis of an unmatched cohort study, the lack of robustness of maximum likelihood estimation can be partially remedied by moving away from this methodology and adopting instead a double-robust approach. In fact, the methods developed by Tchetgen Tchetgen et al. [5] and, in particular, the *ProRetroSpective* algorithm, can be applied to derive a double-robust estimator of *ψ*_{0}. Specifically, when data arises from any of the three aforementioned designs, the preceding algorithm can be applied step by step, to yield an estimator * _{dr}* that is consistent for

$${\mathbf{\text{oddsA}}}_{s}(L)=log\frac{{Pr}_{s}(A=1Y=0,L){Pr}_{s}(A=0Y=0,L)\phantom{\rule{0.2em}{0ex}}\text{and}\phantom{\rule{0.2em}{0ex}}{\mathbf{\text{oddsY}}}_{s}(L)=log\frac{{Pr}_{s}(Y=1A=0,L){Pr}_{s}(Y=0A=0,L)}{}}{}$$

When data come from a matched cohort design **oddsY_{s}** (

Likewise, because under a case-control design, whether matched or unmatched, **oddsA_{s}** (

In essence, the double-robustness of * _{dr}* is justified because regardless of the sampling design, the theory developed in Tchetgen Tchetgen et al. [5] establishes that estimator

In general, we also recommend using the nonparametric bootstrap to estimate the covariance of * _{dr}* under any of the three aforementioned sampling designs. It is interesting to note that the large sample variance of

$${n}^{-1}{\{E{\left\{\frac{1}{Pr(A=Y=1L)}\right\}}^{-1}}^{}$$

(10)

Interestingly, the expression in the preceding display is similar in functional form to the variance of the log-odds ratio in a simple 2 × 2 table, and in fact, when *L* is absent, reduces exactly to the latter. An explicit expression for Pr(*Y* = *y, A* = *α**L*) as a function of **oddsA** (*L*), **oddsY** (*L*) and **or** (*L*) is given in Appendix B, which can be used to form an empirical version of (10). However, this empirical variance estimator should not be used to estimate the variance of * _{dr}* because it is only consistent if both working models are correct.

In this section, we illustrate the double-robust methods with a re-analysis of data on an unmatched cohort study of 207 white and 81 black randomly sampled postmenopausal women diagnosed with primary cancer of the uterine corpus between 1985 and 1987, from the National Cancer Institute's Black/White Cancer Survival Study (BWCSS). In the original analysis of these data, the primary aim of Hill et al. [9] was to investigate whether black women with endometrial cancer have more advanced disease and less favorable tumor grade than do white women. In that vein, the authors report a moderate but non-statistically significant racial difference in tumor grade after control of important explanatory variables, which they postulate may reflect true biologic variation between blacks and whites and may explain, in part, the observation that blacks with endometrial cancer have a worse prognosis. An extensive description of the design, methodology and analysis of the original study is given in the original manuscript [9], and therefore, is not reproduced here. For our purposes, we consider the estimation of the hypothesized association on the odds ratio scale between race (*A* = 1 for black vs *A* = 0 for white) and the presence of poorly differentiated tumors (*Y* = 1 for poorly differentiated tumors vs *Y* = 0 for moderately or well differentiated tumors), adjusting for the following explanatory factors *L* for differences in grade between blacks and whites: age, histologic subtype, use of estrogens and smoking status.

For comparison, we consider the three analytical methods discussed in section 2, which we use to estimate the odds ratio association between race and the risk of poorly differentiated tumors, within levels of *L*. Briefly, recall that the first approach, which also corresponds to the approach taken by Hill et al. [9], consists of fitting the logistic regression of the indicator of poorly differentiated tumors on race and potential confounders using model (1). Whereas, the second approach fits the logistic regression of the race indicator on the indicator of poorly differentiated tumors and potential confounders using the model given by equation (5), with the choice
$\mathbf{\text{oddsA}}(L;\alpha )={\alpha}_{1}+{\alpha}_{2}^{\prime}L$, where
$\alpha =({\alpha}_{1},{\alpha}_{2}^{\prime})$. Finally, we obtain a doubly robust estimator, which combines these two regression models, by implementing the ProRetroSpective algorithm.

To illustrate the double-robustness property, we consider three working models. In the first model **M**_{1}, we set *α*_{2} = 0 (thus assuming *L* is independent of *A* given *Y* = 0) and estimate *η* by maximum likelihood in the first stage of the algorithm. In the second model **M**_{2}, we set *η*_{2} = 0 (thus assuming *L* is independent of *Y* given *A* = 0) and obtain *α* by maximum likelihood in the first stage of the algorithm. In the third model **M**_{3}, both *α* and *η* are estimated by maximum likelihood in the first stage of the algorithm. Bootstrap standard errors of all estimators were also obtained (based on 500 data sets sampled with replacement).

In model **M**_{1}, as we would expect, the crude reverse analysis which does not account for measured confounders, differs from the prospective and double-robust estimators: exp(_{reverse,}_{M1}) = 2.2 (95%CI= 1.13-3.93) for the prospective logistic regression approach, exp(_{prosp,}_{M1}) =1.88 (95%CI=0.87-4.05) for the reverse regression approach and exp(_{dr,}_{M1}) = 1.88 (95%CI=0.87-4.04) for the double-robust approach. The opposite holds in model **M**_{2} which gives exp(_{reverse,}_{M2}) = 1.89 (95%CI= 0.87-4.09) for the prospective approach, exp(_{prosp,M2}) =2.2 (95%CI=1.13-3.93) for the reverse approach and exp(_{dr,}_{M2}) = 1.88 (95%CI=0.87-4.04) for the double-robust approach In model **M**_{3}, the three methods appear to agree on their estimated association between race and tumor grade, yielding estimates _{prosp,}_{M3}= _{prosp,}_{M1} and _{reverse,}_{M3} = _{reverse,}_{M2} for the prospective and reverse regression respectively and the following double-robust point estimate and Wald -type confidence interval for the odds ratio of interest: exp(_{dr,}_{M3}) = 1.88 (95%CI=0.87-4.04). These three highly concordant estimators generally agree with previously published results of an elevated (though marginally statistically significant) risk of low grade tumors experienced by black women when compared to white women in this study [9]. Nonetheless, these results should be interpreted with caution and should only be taken as an illustrative example of the methods, because the reported (borderline) association cannot necessarily be interpreted as a causal effect without making strong assumptions; for instance, it may not be realistic to assume that *L* contains all common correlates of *A* and *Y* in this data set.

The present paper has focused on the implementation of a new estimator of the parameters of a model for the log odds ratio function **or** (*L*), in the simple case of a binary exposure. An extension of the proposed ProRetroSpective iterative procedure to allow for categorical and continuous exposure, as well as for polytomous outcome is also of interest and will be addressed elsewhere.

Missing covariate data is a feature common to most epidemiological studies, and it is now widely recognized that the routine use of standard complete-case analysis -which excludes subjects with any missing variable- can yield highly inefficient estimators of *ψ*_{0} in logistic regression analysis, and be severely biased when missingness depends on the outcome or correlates of it. In a companion manuscript, we consider the estimation of *ψ*_{0} when *Y*, and a subset *L _{obs}* of the confounders

Interestingly, a referee asked whether our algorithm could be framed as a specific application of targeted maximum likelihood (TML) estimation for semi-parametric logistic regression as described in van der Laan and colleagues in an unpublished paper (Readings in Targeted Maximum Likelihood Estimation, 2009, page 621 available at www.bepress.com). It turns out that, similar to our method, in the absence of model mis-specification, the TML algorithm of Van der Laan also solves the efficient score equations for semi-parametric logistic regression which was first derived by Bickel et al. [10]. However, unlike Bickel et al [10] and van der Laan et al, in this paper, we give a parametrization of the law of (*Y, A*) given *L* in terms of models *or* (*L; *), *oddsA* (*L; α*) and *oddsY* (*L; η*), which when used to evaluate the efficient score, delivers an estimator with a double robust property. We should emphasize that under the standard parametrization which replaces the reverse baseline regression Pr (*A* = 1*Y* = 0, L; *α*) (*i:e: oddsA* (L; *α*)) with a variation independent model for the propensity score Pr (*A* = 1*L*), Tchetgen Tchetgen et al [5] formally established that no useful double robust estimator exist except possibly under the null hypothesis of no treatment effect, which is not surprising since under the null, Pr (*A* = 1*Y* = 0, *L*) = Pr (*A* = 1, *L*). Thus, unless targeted maximum likelihood is implemented using our proposed parametrization, it will generally not share the double robust property of our estimator. This point which was not recognized by van der Laan et al also highlights the fact that targeted maximum likelihood estimation cannot in itself reveal which parametrization of the observed data likelihood yields a double robust estimator when used to solve the efficient score.

The authors would like to thank Jamie Robins for helpful comments, and Roger Logan for programming assistance.

Sources of financial support: Andrea Rotnitzky was partially funded by NIH grant GM48704

ProRetroSpec Macro |

/* ****************************ProRetroSpec Macro***************** |

Author : |

Tchetgen Tchetgen Eric |

Depts Epidemiology and Biostatistics |

Harvard University |

email for inquiries: ude.dravrah.hpsh@egtehcte |

*/ |

/* |

Input to the macro: data=Name of data set; |

y=name of binary outcome variable; |

a=name of binary exposure variable; |

oddsY = list of variables to include in the outcome regression model; |

oddsA=list of variables to include in the exposure regression model; |

epsilon= stopping rule of the algorithm, default value is 0.0001 |

Output of the macro: |

prospective=log-odds ratio estimated by a regression of Y on A and L; |

retrospective=log-odds ratio estimated by a regression of A on Y and L; |

ProRetroSpective=doubly robust estimator of the log-odds ratio, |

eps2=stopping value should be less than epsilon*/ |

%macro ProRetroSpec (data=,y=, a=, oddsY=, oddsA=,epsilon=0.0001, bsamp=200); |

ods listing close; |

data boot; |

do sample=1 to &bsamp; |

do i=1 to nobs; |

pt=round (ranuni(12)*nobs); |

set &data nobs=nobs point=pt; |

output; |

end; |

end; |

stop; |

run; |

%do j=0 %to &bsamp; |

%if &j=0 %then %do; |

data _cero_b; set &data; |

run; |

%end; |

%else %do; |

data _cero_b; set boot; |

if sample=&j then output; |

run; |

%end; |

proc logistic descending outest=psi_y(keep= &a) covout data =_cero_b; |

model &y= &a &oddsY; |

output out =predy xbeta=ybeta; |

run; |

proc logistic descending outest=psi_a (keep=&y) data =_cero_b; |

model &a = &y &oddsA; |

output out =preda xbeta=abeta; |

run; |

data psi_epsilon; |

epsilon=0; |

psi_j=0; |

run; |

data a_coeff (rename=(&a=a_coeff)); |

set psi_y; |

if _n_=1 then output; |

run; |

data y_coeff (rename=(&y=y_coeff)); |

set psi_a; |

if _n_=1 then output; |

run; |

data data_j; merge y_coeff a_coeff preda predy psi_epsilon; |

retain a_c y_c eps psi 0; |

if_n_=1 then a_c=a_coeff; |

if_n_=1 then y_c=y_coeff; |

if _n_=1 then eps=epsilon; |

if _n_=1 then psi=psi_j; |

drop a_coeff y_coeff epsilon psi_j; |

if &a=1 then ybeta= ybeta-a_c; |

if &y=1 then abeta=abeta-y_c; |

psi=psi+eps; |

num=exp(abeta)+exp(abeta+ybeta); |

denum=exp(-(psi))+exp(ybeta)+exp(abeta)+exp(ybeta+abeta); |

a tilda=&a-num/denum; |

y_offset=ybeta+&a*psi; |

run; |

%let err=1; |

%let finish=10; |

%do %while(%sysevalf(&err ge &epsilon)); |

proc logistic descending outest=psi_pr(keep= a_tilda) data =data_j; |

model &y= a_tilda/offset=y_offset noint; |

run; |

data psi_epsilon(keep=epsilon);set psi_pr; |

if _n_=1 then do; |

epsilon = a_tilda; |

output; |

end; |

run; |

data data_j;merge data_j(drop=eps) psi_epsilon; |

retain eps 0; |

if _n_=1 then eps=epsilon; |

psi=psi+eps; |

num=exp(abeta)+exp(abeta+ybeta); |

denum=exp(-(psi))+exp(ybeta)+exp(abeta)+exp(ybeta+abeta); |

a_tilda=&a-num/denum; |

y_offset=ybeta+&a*psi; |

eps2 =abs(eps); |

call symput(‘err’,eps2); |

run; |

%end; |

data results; set data_j; |

if_n_=1 then output; |

run; |

%if &j=0 %then %do; |

data results0; set results; |

prospective=a_c; |

retrospective=y_c; |

ProRetroSpective=psi; |

run; |

%end; |

%if &j = 1 %then %do; |

data trial; set results; run; |

%end; |

%if %eval(&j)> 1 %then %do; |

data trial;set results trial; |

sample = %eval(&j); |

run; |

%end; |

%end; |

data all_results; merge %if &j>1 %then %do; trial %end; results0;retain mean1 mean2 mean3; |

if _n_=1 then do; |

mean1=prospective; |

mean2=retrospective; |

mean3=proretrospective; |

end; |

prospective_variance=(a_c-mean1)**2; |

retrospective_variance=(y_c-mean2)**2; |

proretrospective_variance=(psi-mean3)**2; |

ods listing; |

Proc print data=results0; |

var prospective retrospective ProRetroSpective eps2; |

run; |

%if %eval(&bsamp)>0 %then %do; |

proc means data=all_results mean; var prospective_variance retrospective_variance ProRetroSpective_variance; run; |

%end; |

%mend; |

/***************************************** |

Macro call for the Data Example of Section 4: |

%ProRetroSpec(data=cero, y=grade2, a=race, oddsY=btype1 btype2 estrogen age smoking, oddsA=btype1 btype2 estrogen age smoking |

,epsilon=0.0001, bsamp=500); |

*****************************************/ |

*Explicit expression for* Pr(*Y* = *y, A* = *a**L*) *as a function of oddsA* (*L*), *oddsY* (*L*) *and or* (*L*). *The desired expression follows upon replacing* Pr(*A* = *a**Y* = 0, *L*) *with* [expit {**oddsA (***L*)}]* ^{a}* × [1−expit {

$$Pr(Y=y,A=aL)=\frac{Pr(A=aY=0,L)Pr(Y=yA=0,L){e}^{\mathit{\text{ayor}}(L)}{\sum}_{a\prime =0}^{1}{\sum}_{y\prime =0}^{1}Pr(A=a\prime Y=0,L)Pr(Y=y\prime A=0,L){e}^{a\prime y\prime \mathit{\text{or}}(L)}}{}$$

*where expit*(*x*) = {1 + exp (−*x*)}^{−1}.

**Proof that**
${U}_{i}(\psi ;\eta ,\alpha )={U}_{i}^{\mathit{\text{rev}}}(\psi ;\eta ,\alpha )$

Some algebra gives

$$d(L;\psi ,\eta ,\alpha )=\frac{E\{\mathit{\text{Avar}}(YA,L;\psi ,\eta )L;\psi ,\eta ,\alpha \}E\{\mathit{\text{var}}(YA,L;\psi ,\eta )L;\psi ,\eta ,\alpha \}}{}$$

*with expectations and variances evaluated under the joint density of* (*Y, A*) *given L; say f* (*Y, A**L;ψ, η, α*), *induced by the models oddsY* (*L; η*), *oddsA* (*L; η*) *and or* (*L; ψ*). *Note that*

$$d({L}_{i};\psi ,\eta ,\alpha )=E(AL;\psi ,\eta ,\alpha )$$

is a conditional expectation with respect to another conditional law of A given L, namely the law

$$f(AL;\psi ,\eta ,\alpha )=f(AL;\psi ,\eta ,\alpha )\frac{\mathit{\text{var}}(YA,L;\psi ,\eta ){c}_{a}(L;\psi ,\eta ,\alpha )}{}$$

*where c _{a}* (

Next, observe that for any binary random variable Z and arbitrary vector V, it holds that

$$Z-E(ZV)={(-1)}^{1-Z}\frac{\mathit{\text{var}}(ZL)f(ZL)}{}$$

(11)

Furthermore, observe that

$$\begin{array}{cc}{U}_{i}(\psi ;\eta ,\alpha )\hfill & =\{{A}_{i}-d({\text{L}}_{i};\psi ;\eta ,\alpha )\}\frac{\mathbf{\text{or}}({L}_{i};)\psi [{Y}_{i}-\text{expit}\{\mathbf{\text{oddsY}}({L}_{i};\eta )+\mathbf{\text{or}}({L}_{i};\psi ){A}_{i}\}]}{}\hfill & =\{{A}_{i}-{E}^{}\hfill \hfill \end{array}$$

*Then, applying identity* (11) *to the residuals* {*A _{i}* −

$$\begin{array}{ll}{U}_{i}\left(\psi ;\eta ,\alpha \right)\hfill & =\frac{\mathbf{\text{or}}\left({L}_{i};\psi \right)\psi {(-1)}^{{A}_{i}+{Y}_{i}}\frac{{\mathit{\text{var}}}^{}}{}}{}\hfill \end{array}$$

Likewise, arguing by symmetry, reversing the roles of Y_{i}and A_{i}gives

$${U}_{i}^{\mathit{\text{rev}}}\left(\psi ;\eta ,\alpha \right)=\{\frac{\mathbf{\text{or}}\left({L}_{i};\psi \right)\psi {(-1)}^{{A}_{i}+{Y}_{i}}\frac{1}{f({Y}_{i},{A}_{i}{L}_{i};\psi ,\eta ,\alpha )}\}}{{\mathit{\text{var}}}^{}}$$

*where var** (*Y _{i}*

$${f}^{}$$

*with c _{y}* (

*Then, our proof that*
${U}_{i}(\psi ;\eta ,\alpha )={U}_{i}^{\mathit{\text{rev}}}(\psi ;\eta ,\alpha )$ *is concluded if we show that*

$${\mathit{\text{var}}}^{}$$

or equivalently, that

$${\mathit{\text{var}}}^{}$$

*Because the identity must hold for each L _{i}, and because the law f* (

$$\frac{{\mathit{\text{var}}}^{}{\mathit{\text{var}}}^{}=\frac{E\{\mathit{\text{var}}(AY)\}E\{\mathit{\text{var}}(YA)\}}{}}{}$$

(12)

*where var** (*A*) *stands for the variance of A under the law f** (*A*) = *f* (*A*) *var* (*Y**A*) =*/E* [*var* (*Y**A*)] *and var** (*Y*) *stands for the variance of Y under the law f** (*Y*) = *f* (*Y*) *var* (*A**Y*) =*/E* [*var* (*A**Y*)].

*To prove identity* (12) *we first notice that when A and Y are binary it holds that*

$$\frac{E\{\mathit{\text{var}}(AY)\}E\{\mathit{\text{var}}(YA)\}=\frac{\mathit{\text{var}}(A)}{\mathit{\text{var}}(Y)}}{}$$

(*This can be easily checked by noticing that E* (*A**Y*) *is linear in Y so, E* {*var* (*A**Y*)} = *var* (*A*) − *cov* (*Y, A*)^{2} */var* (*Y*) *and likewise E* {*var* (*Y* *A*)} = *var* (*Y*) − *cov* (*Y, A*)^{2} */var* (*A*)). *Next, by definition of f** (*A*),

$${\mathit{\text{var}}}^{}$$

*So*,

$$\begin{array}{l}\frac{{\mathit{\text{var}}}^{}{\mathit{\text{var}}}^{}}{}=\frac{\mathit{\text{var}}(A)\mathit{\text{var}}(YA=1)\mathit{\text{var}}(YA=0)E{[\mathit{\text{var}}(AY)]2}^{}\mathit{\text{var}}(Y)\mathit{\text{var}}(AY=1)\mathit{\text{var}}(AY=0)E{[\mathit{\text{var}}(YA)]2}^{}\hfill & =\frac{\mathit{\text{var}}{(A)}^{3}\mathit{\text{var}}(YA=1)\mathit{\text{var}}(YA=0)\mathit{\text{var}}{(Y)}^{3}\mathit{\text{var}}(AY=1)\mathit{\text{var}}(AY=0)}{}\hfill}{}\hfill \hfill \end{array}$$

and the proof is concluded by noticing that

$$\frac{\text{var}(\text{Y}\text{A}=1)\mathit{\text{var}}(\text{Y}\text{A}=0)\mathit{\text{var}}(AY=1)\mathit{\text{var}}(AY=0)=\frac{\mathit{\text{var}}{(Y)}^{2}}{\mathit{\text{var}}{(A)}^{2}}}{}$$

1. Rothman K, Greenland S. Modern Epidemiology. Third Edition. Lippincott Williams and Wilkins; Philadelphia: 2008.

2. Breslow NE, Day NE. Statistical Methods in cancer research Vol I:The analysis of case-control data. Lyon: IARC; 1980.

3. Robins J, Hernan M. In: Estimation of the Causal Effects of Time-Varying Exposure.s Longitudinal Data Analysis. Longitudinal Data Analysis. Fitzmaurice G, Davidian M, Verbeke G, Molenberghs, editors. Chapman and Hall/CRC; 2008.

4. Tchetgen Tchetgen E, Robins J. The semi-parametric case-only estimator. Biometrics. 2010 In press. [PMC free article] [PubMed]

5. Tchetgen Tchetgen E, Robins J, Rotnitzy A. On doubly robust estimation in a semi-parametric odds ratio model. Biometrika. 2010;97(1):171–180. [PMC free article] [PubMed]

6. SAS Institute, Inc. SAS/STAT User's Guide, Version 8. Cary NC: 1999.

7. Chen YH. A Semi-parametric Odds Ratio Model for Measuring Association. Biometrics. 2007;63(2):413–421(9). [PubMed]

8. Robins JM, Rotnitzky A. Comment on the Bickel and Kwon article, “Inference for semiparametric models: Some questions and an answer” Statistica Sinica. 2001;11(4):920–936.

9. Hill HA, Coates RJ, Austin H, et al. Racial differences in tumor grade among women with endometrial cancer. Gynecol Oncol. 1995 Feb;56(2):154–63. [PubMed]

10. Bickel P, Klassen C, Ritov Y, Wellner J. Efficient and Adaptive Estimation for Semi-parametric Models. Springer; New York: 1993.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |