Home | About | Journals | Submit | Contact Us | Français |

**|**Int J Biostat**|**PMC3083136

Formats

Article sections

Authors

Related links

Int J Biostat. 2011 January 1; 7(1): Article 17.

Published online 2011 March 11. doi: 10.2202/1557-4679.1217

PMCID: PMC3083136

Sherri Rose, University of California, Berkeley;

Copyright © 2011 The Berkeley Electronic Press. All rights reserved

This article has been cited by other articles in PMC.

We consider two-stage sampling designs, including so-called nested case control studies, where one takes a random sample from a target population and completes measurements on each subject in the first stage. The second stage involves drawing a subsample from the original sample, collecting additional data on the subsample. This data structure can be viewed as a missing data structure on the full-data structure collected in the second-stage of the study. Methods for analyzing two-stage designs include parametric maximum likelihood estimation and estimating equation methodology. We propose an inverse probability of censoring weighted targeted maximum likelihood estimator (IPCW-TMLE) in two-stage sampling designs and present simulation studies featuring this estimator.

We consider two-stage sampling designs where one takes a random sample from a target population and measures *V* on each subject in the first stage. The second stage involves drawing a subsample from the original sample, collecting additional data on the subsample. The decision regarding selection into the subsample can be influenced by *V*. This data structure can be viewed as a missing data structure on the full-data structure *X* collected in the second-stage of the study.

Specifically, the observed data structure on a randomly sampled subject can be represented as *O* = (*V*, Δ, Δ*X*), where *V* is included in *X*, and Δ denotes the indicator of inclusion in the second-stage sample. The sample is then represented as *n* i.i.d. copies *O*_{1}, . . ., *O*_{n} of *O*. One particular type of two-stage sample is a so-called “nested case-control” sample where the outcome *Y* is included in *V* and subjects are sampled conditional on *Y*. We propose an inverse probability of censoring weighted targeted maximum likelihood estimator (IPCW-TMLE) for the estimation of target estimands, such as causal effects, in two-stage sampling designs.

A TMLE is a general procedure for estimation of a target parameter of the data-generating distribution in semiparametric models, and, in particular, can be used for any censored data structure. It is a two-step method where one first obtains an estimate of the data-generating distribution, and then in the second step updates the initial fit in a bias-reduction step targeted toward the parameter of interest, instead of the overall density. The TMLE unifies the locally efficient double robust properties of estimating function based methodology with the properties of maximum likelihood estimation. TMLEs are loss-based well-defined, efficient, unbiased substitution estimators of the target parameter of the data-generating distribution. In this paper, we present general IPCW-TMLEs, and then apply it to nested case-control samples in simulations.

Two-stage designs, including nested case-control studies, have been discussed and developed in previous literature, including Neyman (1938), Cochran (1963), Mantel (1973), Kupper et al. (1975), Liddell et al. (1977), Thomas (1977), and Breslow et al. (1983). Advantages can include reduction in costs associated with collecting data on the entire cohort and minimal losses in efficiency (Ernster, 1994; Rothman and Greenland, 1998; Essebag et al., 2003; Hak et al., 2004; Vittinghoff and Bauer, 2006). Much of the literature focuses on logistic regression for effect estimation (Breslow and Cain, 1988; Flanders and Greenland, 1991; Ernster, 1994; Barlow et al., 1999; Szklo and Nieto, 1999). Robins et al. (1994) presented the missingness framework for two stage designs, and (double robust augmented) inverse probability of treatment-weighted estimators. We also refer to van der Laan and Robins (2003) which provides an in-depth study and overview of double robust estimation for missing data and causal inference data structures.

A recent paper by Wang et al. (2009) presents causal effect estimators using estimating equation methodology where the outcome *Y*, exposure *A*, and a subset *S* of covariates *W* are measured in the first stage (*V* includes *Y*, *A*, and *S*). They consider the same two-stage design, where one measures *V* = (*S*, *Y*, *A*) on everyone in the sample, and *X* = (*S*, *Y*, *A*, *W*) on the subjects in the validation sample defined by Δ = 1, where the missingness mechanism is known. The Wang et al. article focuses on estimation of *EY* (*a*) under the consistency assumption *Y* = *Y* (*A*), the randomization assumption, *A* is independent of *Y* (*a*), given (*W*, *S*), so that *EY* (*a*) = *E*_{S,}_{W}*E*_{X,0}(*Y| A* = *a*, *S*, *W*), and a parametric model for the treatment mechanism Π(*S*, *W*) = *P*(*A* = 1 *|* *S*, *W*). Please see the Appendix for a discussion of the relationship between the estimators presented in Wang et al. (2009) and IPCW-TMLE.

TMLE was first presented in van der Laan and Rubin (2006) and covered in detail in a forthcoming text (van der Laan and Rose, 2011). Methodology for other types of case-control studies, including independent case-control designs and individually matched case-control studies, were first presented in van der Laan (2008b) and Rose and van der Laan (2008, 2009). We make the following remark regarding the problem of estimation of an additive causal effect of a treatment *A* on outcome *Y*, controlling for confounders *W*. When *Y* is continuous, the TMLE based on a quasi-log-likelihood loss function with a logistic regression submodel is recommended (Gruber and van der Laan, 2010). This choice of submodel is more robust than a TMLE based on the squared error loss function with a linear regression, due to the linear regression fluctuations not respecting global constraints. The double robust parametric regression estimators presented by Scharfstein et al. are special cases of TMLEs, as discussed in Rosenblum and van der Laan (2010) (Scharfstein et al., 1999, p. 1141). The class of estimators given in Rosenblum and van der Laan (2010) are not identical to, but are asymptotically equivalent to a class of estimators (Tsiatis, 2006, Section 5.4, p. 132).

Recall that we consider two-stage sampling designs where one takes a random sample from a target population, measures *V* on each subject in this first stage, and draws a subsample where one collects additional data. Inclusion in the subsample can be influenced by *V*. This data structure is a missing-data structure on the full-data structure *X* collected in the second-stage. The observed data structure is *O* = (*V*, Δ, Δ*X*), where *V* is included in *X*, and Δ denotes the indicator of inclusion in the second-stage sample. The sample can then be represented as *n* i.i.d. copies *O*_{1}, . . ., *O*_{n} of *O*.

Let *P*_{X,0} be the true probability distribution of *X*, and let * ^{F}* be a statistical model for

Let *g*_{Δ,0}(*δ* *|* *X*) = *P*_{X,0}(Δ = *δ* *|* *X*) be the conditional probability distribution of Δ, given *X*. We assume the missing at random (MAR) assumption which states that *g*_{Δ,0}(*δ* *|* *X*) = *g*_{Δ,0}(*δ* *|* *V*), i.e., Δ is independent of *X*, given *V*. For notational convenience, let Π_{0}(*V*) *g*_{Δ,0}(1 *|* *V*). This missingness mechanism might be known, a model might be available, or no further assumptions are made beyond MAR. Either way, the missingness mechanism can be estimated from the data (Δ_{i}, *V*_{i}), *i* = 1, . . ., *n*, extracted from the observations *O*_{i}, *i* = 1, . . ., *n*.

The statistical model for the probability distribution *P*_{0} of *O* is now defined in terms of the full-data statistical model and the model on the missingness mechanism. The efficient influence curve of Ψ* ^{F}* (

The TMLE is a general procedure for estimation of a target parameter of the data-generating distribution in semiparametric models (van der Laan and Rubin, 2006). It marries the locally efficient double robust properties of estimating function based methodology and the properties of maximum likelihood estimation. TMLEs are loss-based well-defined, efficient, unbiased substitution estimators of the target parameter of the data-generating distribution.

Suppose that, given *n* i.i.d. observations *X*_{1}, . . ., *X _{n}*,
is a TMLE of

Let
be an initial estimator of *P*_{X,0}, possibly a *L** ^{F}*-loss based super learner (van der Laan et al., 2007). In addition, let {

Such a TMLE
is then defined as follows. For *k* = 1, . . ., *K*, one computes the amount of fluctuation:

for
, and one sets
. Here
is defined as the empirical distribution of the full-data *X*_{1}, . . ., *X*_{n}, and, for a function *f* of *X* and probability distribution *P*, we used the notation *Pf* *∫* *f*(*x*)*dP*(*x*) This updating process is iterated until convergence is achieved, i.e., *K* is chosen so that
. The final update
is denoted with
, and is called the TMLE of *P*_{X,0}. By the score condition on the working fluctuation model, it follows that

Given the TMLE developed for the full-data structure, we propose estimating *ψ*_{0} based on *O*_{1}, . . ., *O*_{n} with an IPCW-TMLE. This IPCW-TMLE is simply defined by the above procedure with the addition of weights Δ_{i}*/*Π_{n}(*V*_{i}) for observations *i* = 1, . . ., *n*, where Π_{n}(*V*) is an estimator of Π_{0}(*V*) *g*_{Δ,0}(1 *|* *V*). Thus, this IPCW-TMLE involves the following steps:

Computing an initial IPCW-loss based estimator (e.g., using super learning, van der Laan et al., 2007) based on, for example, the IPCW-loss function

Typically, this initial estimator is obtained by providing the initial estimator of *P*_{X,0} in the full-data TMLE the IPCW weights.

For *k* = 1, . . ., *K*, one computes the amount of fluctuation:

for
, and one sets
. This updating process is iterated until convergence is achieved, i.e., *K* is chosen so that
. The final update is denoted with
, and is called the IPCW-TMLE of *P*_{X,0}.

Finally, one evaluates the target parameter . This is the TMLE of .

As is apparent from the above definition of IPCW-TMLE, IPCW-TMLE is a targeted minimum loss based estimator (also TMLE), the generalization of TMLE (van der Laan, 2008a; van der Laan et al., 2009; van der Laan and Rose, 2011), but with a loss function defined as IPCW full-data loss function, and a parametric submodel *P _{X}*() with score (Δ/Π

Since it solves the IPCW full-data efficient influence curve equation, the IPCW-TMLE has an influence curve equal to
if Π_{0} is known, and
denotes the limit of
(see next section). Double robustness properties of the full-data efficient influence curve are immediately inherited by the IPCW-TMLE. If Π_{0}(*V*) is consistently estimated with a maximum likelihood estimator, the influence curve of the IPCW-TMLE equals
minus its projection on the tangent space of the model used for Π_{0}. As shown below, if we use a nonparametric maximum likelihood estimator for Π_{0} and the full-data model is nonparametric, then the IPCW-TMLE solves the actual efficient influence curve equation, so that the IPCW-TMLE is efficient if
. As with any asymptotically linear estimator, an estimate of the asymptotic variance
is given by the empirical variance of the estimated influence curve.

By the score condition on the working fluctuation model and , it follows that this IPCW-TMLE solves the ICPW full-data efficient influence curve equation:

If the full-data TMLE is double robust or has other robustness properties, then these properties will be inherited by this IPCW-TMLE under the assumption that Π_{n} is a consistent estimator of Π_{0}. If *V* is discrete (with finite support), then we propose using a nonparametric estimator Π_{n} of Π_{0}.

In this case, we have the following important result. If the full-data model is nonparametric, *V* is discrete, and the missingness mechanism is estimated nonparametrically, then it follows that the IPCW-TMLE actually solves the true efficient influence curve equation. The latter implies that, under appropriate regularity conditions, and if
is consistent for *P*_{X,0}, the IPCW-TMLE will be an asymptotically efficient estimator of *ψ*_{0}.

Consider the statistical model for the observed missing data structure *O* implied by a nonparametric full-data model * ^{F}*, the MAR assumption, possibly a model for the missingness mechanism Π

where *D ^{F}* (

The IPCW-TMLE
solves
for any choice of estimator Π_{n} of Π_{0}. If Π_{n} is a nonparametric estimator of Π_{0}, then it follows that we also have

for any choice of estimator of the regression
. As a consequence, it follows that for nonparametric estimators Π_{n} of Π_{0}, and IPCW-TMLE
, the IPCW-TMLE solves the efficient influence curve equation:

We also note that, if we fit Π_{0} with a logistic regression, use it as an offset, and add a covariate
to update this logistic regression fit of Π_{0}, iterate this updating process of the missingness mechanism until convergence, then the resulting fit
will also solve:

This follows from the well known fact that the score of a univariate linear logistic regression working model logit Π(*δ*) = logit Π + *δC* for the coefficient *δ* in front of the univariate covariate *C*(*V*), equals *C*(*V*)(Δ – Π(*δ*)(*V*)). For such clever fits of the missingness mechanism we also have that (
,
) solves the efficient influence curve estimating equation:

so that double robustness and asymptotic efficiency can still be derived.

The latter type of IPCW-TMLE is slightly more complex than the regular IPCW-TMLE since it now also requires fitting the regression
. However, this represents a minor increase in complexity since it only involves running a mean regression of the outcome
on *V*_{i} among the observations with Δ_{i} = 1.

In this section we demonstrate the IPCW-TMLE for the simple full-data structure *X* = (*W*, *A*, *Y*), with covariate vector *W*, binary exposure (or treatment) *A*, and binary outcome *Y*. The observed data structure for a randomly sampled subject is *O* = (*V*, Δ, Δ*X*), where *V* = *Y*. The target parameter of the full-data distribution of *X* is given by Ψ* ^{F}* (

where *Q*_{0} = (_{0}, *Q*_{W,0}), *Q*_{W,0} is the true full-data marginal distribution of *W*, _{0}(*A*, *W*) = *E*_{X,0}(*Y* *|* *A*, *W*), and *g*_{0}(*a* *|* *W*) = *P*_{X,0}(*A* = *a* *|* *W*). The first term will be denoted by
and the second term by
, since these two terms represent components of the full-data efficient influence curve that are elements of the tangent space of the conditional distribution of *Y*, given *A*, *W*, and the marginal distribution of *W*, respectively. That is,
is the component of the efficient influence curve that equals a score of a parametric fluctuation model of a conditional distribution of *Y*, given (*A*, *W*), and
is a score of a parametric fluctuation model of the marginal distribution of *W*. Note that
equals a function *H** (*A*, *W*) times the residual (*Y* − (*A*, *W*)), where

We can estimate the marginal distribution of *Q*_{W,0} with IPCW-MLE

where *L ^{F}* (

Let *L** ^{F}* (

Similarly, we can estimate *g*_{0} with loss-based super learning based on the IPCW-log-likelihood loss function

This now provides an initial estimator
and
. This estimator was obtained using the same algorithm for computing the initial estimator for the full-data TMLE, but now assigning weights Δ_{i}/Π_{n}(*Y*_{i}) to each observation. In essence, a full-data loss function *L** ^{F}* (

Let

be a parametric submodel through , and let

be a parametric submodel through the conditional distribution of *Y*, given *A*, *W*, implied by
. This describes a submodel
through
with a two-dimensional fluctuation parameter = (_{1}, _{2}). We have that
at = 0 yields the two scores
and
, and therefore spans the full-data efficient influence curve
, a requirement for the parametric submodel for the full-data TMLE. This parametric submodel and the loss function *L** ^{F}* (

Define

and let
. Note _{1,}_{n} = 0 which shows that the IPCW empirical distribution of *W* is not updated. Note also that _{2,}_{n} is obtained by performing an IPCW logistic regression of *Y* on
where
is used as an offset, and extracting the coefficient for
. We then update
with logit
. The updating process converges in one step in this example, so that the IPCW-TMLE is given by
.

Lastly, one evaluates the target parameter , where , by plugging and into our substitution estimator

This is the IPCW-TMLE of .

Suppose our full-data structure is a right-censored data structure and we conduct a nested case-control study. For example, we have that *X* might be defined as *X* = (*W*, *A*, , Ξ, *Y**), where *W* are covariates, *A* is an exposure of interest, = min(*T*, *C*), *T* is the time to the event, *C* denotes a censoring variable, Ξ = *I*( = *T*) is a failure indicator, and *Y** = ( ≤ *t*, Ξ = 1) is an indicator of having an observed failure by endpoint *t*. Our missing data structure is given by *O* = (Δ, Δ*X*, , Ξ, *Y**), where Δ = 1 denotes membership in the nested case-control sample.

A special feature of this right censored data structure is that one will define a case based on a binary random variable *Y** that is not the outcome of interest. For example, *Y** could represent observed death by year 5, which would be denoted *Y** = ( ≤ 5 years, Ξ = 1). It is important to stress that the definition of a case (*Y** = 1) in a nested case-control study within a right censored data structure is therefore different than without right censoring. Let’s say our parameter of interest Ψ* ^{F}* (

We define the TMLE for the full-data structure and we then use the IPCW-TMLE for actual missing data structure. In other words, we need a TMLE of
based on *X*, and then IPCW-TMLE is defined as well. The TMLE of the additive causal effect of treatment on survival, and other parameters, based on the right-censored data structure is presented elsewhere (Moore and van der Laan, 2009a,b; Stitelman and van der Laan, 2010; van der Laan and Rose, 2011).

Nested case-control studies within clinical trials and observational studies are increasingly popular when researchers are interested in effect modification (Rothman and Greenland, 1998; Essebag et al., 2003, 2005; Prentice and Qi, 2006; Vittinghoff and Bauer, 2006; Polley and van der Laan, 2009). This is of particular importance when the candidate patient characteristic effect modifier of the treatment effect is difficult or expensive to measure (Vittinghoff and Bauer, 2006).

The general approach involves defining our full-data structure, for example, *X* = (*W*, *A**, *A*, *Y*), and our observed data *O* = (*V*, Δ, Δ*X*), where again *V* is in *X*. We are interested in studying the effect modification of a variable denoted *A**. Our full-data parameter of interest might be

where _{0}(*a**, *a*) = *E*_{0}(*Y* *|* *A**= *a**, *A* = *a*, *W*). The full-data TMLE involves first running an initial regression of *Y* on *A**, *A*, and *W*. We note that *A* and *A** are implicitly assumed to have finite support. The targeting step requires a parametric working submodel to fluctuate the initial estimator and a choice of loss function. We use a clever covariate that will define this parametric working submodel. The clever covariate for
is given by

where *g*_{0}(*a**, *a* | *W*) = *P*_{X,0}(*A* = *a* *|* *W*)*P*_{X,0}(*A** = *a** | *A* = *a*, *W*), and *P*_{X,0}(*A* = *a* | *W*) may be known, as in a clinical trial, but *P*_{X,0}(*A** = *a** | *A* = *a*, *W*) must be fitted. The clever covariate for the difference parameter
is the corresponding difference of clever covariates. As loss function one can use the least squares loss function, in which case the working submodel is a linear regression of *Y* on *H** using the initial estimator as offset. If *Y* is binary, or continuous in (0, 1) (e.g., after a linear transformation), then one can use the more robust quasi-log-likelihood loss function (Gruber and van der Laan, 2010). In the latter case, the working submodel is a logistic linear regression of *Y* on *H**, using the initial estimator as offset. Therefore, one can target the parameter with a single clever covariate, or one can target all four parameters with a four dimensional clever covariate, and look at multiple differences. This now defines the full-data TMLE for the desired target parameter
. The desired IPCW-TMLE for the observed data is obtained by assigning weights Δ_{i}/Π_{n}(*Y*_{i}) to each observation, or equivalently, by replacing the full-data loss function in the full-data TMLE by the IPCW-loss function.

We present several simulation studies to examine the performance of the IPCW-TMLE. First, we generate simulated nested case-control samples within real cohort data. We then study the IPCW-TMLE in simulated cohorts.

The National Institute of Aging funded Study of Physical Performance and Age-Related Changes in Sonomans (SPPARCS) is a population-based, census-sampled, study of the epidemiology of aging and health. Participants of this longitudinal cohort were recruited if they were aged 54 years and over and were residents of Sonoma, CA or surrounding areas. Study recruitment of 2092 persons occurred between May 1993 and December 1994 and follow-up continued for approximately 10 years. One area of particular research interest for this data has been the effect of vigorous leisure-time physical activity (LTPA) on mortality in the elderly, which has been studied in a previous collaboration (Bembom and van der Laan, 2008) using marginal structural models. LTPA was calculated from answers to a detailed questionnaire where performed vigorous physical activities are assigned standardized intensity values in metabolic equivalents (METs). The recommended level of energy expenditure for the elderly is 22.5 METs.

The full-data structure is *X* = (*W*, *A*, *Y*), where *Y* = *I*(*T* ≤ 5 years), *T* is time to the event death, *A* is a binary categorization of LTPA, and *W* are potential confounders. These variables are further defined in Table 1. The observed data structure on a randomly sampled subject can be represented as *O* = (*V*, Δ, Δ*X*), where *V* is in *X*. Of note is the lack of any right censoring in this longitudinal cohort. The outcome (death within or at five years after baseline interview) and date of death was recorded for each subject. This information was available from a variety of sources, including death certificates. Our parameter of interest is the risk difference
, the average treatment effect of LTPA on mortality five years after baseline interview.

The cohort was reduced to a size of *n* = 2066, as 26 subjects were missing LTPA values and/or self-rated health score (1.2% missing data). The prevalence of death was 13%, and the number of cases in the cohort sample was *nC* = 269. The TMLE was estimated on the full cohort sample, and the results are displayed in Table 2. Within TMLE, the machine learning Deletion/Substitution/Addition (DSA) algorithm was used to obtain an estimate of the functions _{0} = *P*_{X,0}(*Y* = 1 *|* *A*, *W*) and *g*_{0} = *P*_{X,0}(*A* *|* *W*) since the functional form of the data was unknown. One could also use an ensemble approach, such as super learning (van der Laan et al., 2007). The estimated parameter of interest was highly significant, and indicates that physical activity at or above recommended levels decreases five-year mortality risk in this population by 5.4%.

We used this cohort study to simulate nested case-control study designs where an estimate of the missingness weights were obtained from the full cohort. Members of the nested case-control sample are denoted with Δ = 1. Our observed data structure was defined as *O* = (*V*, Δ, Δ*X*) and we had *V* = *Y*. Therefore, the missing data structure ignored those individuals with Δ = 0, except for the purpose of estimating Π_{0}(*V*).

Control individuals were randomly sampled from among those still alive five years from baseline interview, and assigned the value Δ = 1. This was a simplified approach compared to an incidence-density design where individuals are sampled from those still at risk of death at the time a case becomes a case. Sampling was performed with various numbers of controls relative to the number of cases (2*nC*, 3*nC*, and 4*nC*). The empirical values for *P*_{X,}_{n}(Δ = 1 | *Y* = 0), were 0.299, 0.446, and 0.608 for the three sample sizes. All cases (Y=1) were sampled with probability 1.

The cohort was resampled 1000 times. In each of the 1000 cohort resamples, one nested case-control study was extracted; those individuals with (Δ = 1), allowing for ties (Bureau et al., 2008). The estimated values Π_{n}(*V*) used in the weight vector were taken from their respective cohort resample. The IPCW-TMLE was estimated in each of the 1000 nested case-control samples, and the TMLE was estimated in the cohort samples. The DSA algorithm was used to obtain estimates of the functions _{0} and *g*_{0}. The relative efficiency of the nested case-control parameters are compared to the cohort parameter in Table 3, as well as average values for the parameter of interest. Relative efficiency of the nested case-control design improved as the number of controls increases. With an average of 4 controls per case (approximately 1076 of the 1797 available noncase subjects), the relative efficiency of the nested case-control design reached 78.9%.

In the SPPARCS data simulations, we did not know the true value of the parameter of interest. It was important to have a completely objective way of defining the truth, and to then assess the performance of our estimator with respect to the truth. Therefore, we repeat the same simulation study, but now from a population we fully understand, as we know the value of the true
. The cohort was sampled from the target population of 1,000,000 individuals. We simulated a five-dimensional covariate *W* = (*W*_{j} *: j* = 1, . . ., 5), a binary exposure *A*, and indicator *Y*, where 1 indicated disease (or in the case of the SPPARCS data, death by 5 years from baseline interview). These variables were generated according to the following rules:

The true value for the risk difference was *RD* = −0.061 and the prevalence

The true value for the risk difference was *RD* = −0.061 and the prevalence of death was 13.3%. One cohort sample was taken with 2,066 individuals, and the estimated value of death prevalence was 14.3%. The number of cases in the cohort sample was *nC* = 296. Controls were randomly sampled from among the noncases in the original cohort at various sample sizes relative to the number of cases (2*nC*, 3*nC*, and 4*nC*), and assigned the value Δ = 1. Noncases that were not sampled were assigned the value Δ = 0. The values for *P*_{X,}_{n}(Δ = 1 | *Y* = 0) were 0.330, 0.506, and 0.674 for the three sample sizes. All cases were assigned Δ = 1.

Logistic regression was used to estimate the functions _{0} and *g*_{0} since the functional form was known. The relative efficiency of the nested case-control parameters are compared to the cohort in Table 4, as well as average values for the parameter of interest. As before, relative efficiency of the nested case-control design improves as the number of controls increases. With an average of 4 controls per case, the nested design reaches a relative efficiency of 78.4%.

For a simulated clinical trial, 10,000 subjects were sampled and assigned a treatment *A*. The outcome of disease was assigned with *P*_{X,0}(*Y* = 1 | *W*, *A*) = expit(3*A* − 4*W*_{1} + *W*_{3} − 12*W*_{4} − 2*W*_{5} + 2*A*sin(*W*_{3})). Of the 10,000 subjects, 647 individuals developed disease (6.47%). The value of the effect modification parameter of interest in the full trial was
. The full-data in the randomized controlled trial cohort was analyzed with a TMLE.

We proposed that the effect modifier of interest, *W*_{3} *A** was only measured in a nested case-control sample. Controls were randomly sampled from among the noncases in the original cohort at various sample sizes relative to the number of cases (2*nC*, 3*nC*, 4*nC*, and 5*nC*), and assigned Δ = 1. Noncases that were not sampled were assigned Δ = 0. The values for *P*_{X,}_{n}(Δ = 1 | *Y* = 0) were 0.141, 0.210, 0.280, and 0.350 for the four sample sizes. All subjects with *Y* = 1 were assigned Δ = 1.

An IPCW-TMLE was used to analyze the nested case-control samples. Multinomial regression was used with main terms to estimate the function _{0}, representing a misspecified model. Due to the double robustness of the TMLE and IPCW-TMLE procedures, the estimates of the parameter of interest are consistent even when _{0} is misspecified. The values for *g*_{0}(*A** | *W*) were known since it was a randomized controlled trial. Results are displayed in Table 5. The relative efficiency of the nested case-control design improves as the number of controls increases, and with 38.8% of the total trial participants we reach an efficiency of 86.4%.

Two-stage sampling designs, including nested case-control sampling, are popular in many fields, including epidemiology. They have the potential to reduce the costs associated with collecting data on the full cohort with minimal losses in efficiency (Ernster, 1994; Rothman and Greenland, 1998; Hak et al., 2004; Vittinghoff and Bauer, 2006). We introduced the IPCW-TMLE for estimation of causal effects in two-stage sampling designs, with a focus on nested case-control sampling designs. In general, TMLE methodology can be used in conjunction with procedures that handle censoring, missingness, measurement error, and other persistent issues found in public health and medicine, in addition to adjusting for the missingness due to the two-stage sampling design.

Our simulated nested case-control studies within the SPPARCS data demonstrated 78.9% efficiency with an average of 4 controls per case. We had 78.4% efficiency in our simulated nested case-control studies within a simulated cohort, again with an average of 4 controls per case. These results coincided with the conclusions of Ury (1975), which noted that as a general rule, 4 controls per case yields a relative efficiency of 80.0%. We also demonstrated the use of IPCW-TMLEs for nested case-control study designs within randomized controlled trials when interested in an effect modification research question. With less than 40% of the trial subjects, we reached an efficiency of 86.4% compared to the full trial.

Maintainers of large comprehensive databases that include adverse events often require researchers to pay for access, and cost almost always increases as the sample size requested increases. Thus, nested case-control studies are also a natural design for studies of safety with pharmaceutical drugs. The IPCW-TMLE is maximally efficient in these scenarios as no covariate information on the noncase-control observations is discarded. With the increase in popularity of nested case-control study designs in longitudinal cohorts and randomized controlled trials, the IPCW-TMLE procedure provides an additional tool to yield unique biological and public health discovery.

Recall the paper by Wang et al. (2009) discussed in Section 2 where they consider the same two-stage design as in this paper. Let’s consider the model for the observed data *O* = (*V*, Δ, Δ*X*) implied by a nonparametric full-data model for the distribution of *X*, and known *P*_{X,0}(Δ = 1 | *V*). In that case, the IPCW-TMLE we propose is locally efficient if *P*_{X,0}(Δ = 1 | *V*) is nonparametrically estimated or is estimated in a targeted way as specified in our article, and will be inefficient otherwise. If the full-data model is not nonparametric, then our proposed IPCW-TMLE will not be locally efficient, even if *P*_{X,0}(Δ = 1 | *V*) is estimated nonparametrically.

If *X* = (*S*, *Y*, *A*, *W*), and one only assumes the consistency and randomzation assumption, then the statistical model for the distribution of *X* is indeed nonparametric. Thus, in that statistical model, the proposed IPCW-TMLE of *EY* (*a*) will be efficient if *S*, *Y*, *A* are discrete and *P*_{X,0}(Δ = 1 *|* *S*, *Y*, *A*) is estimated nonparametrically or in targeted manner. However, as in Wang et al., if one also assumes a parametric model for the treatment mechanism, then the statistical model for the full-data is *not* nonparametric. As a consequence of this choice of full-data model, the efficient influence curve does not exist in closed form, and has smaller variance than the efficient influence curve for the nonparametric full-data model, (and there exists a whole class of double robust influence curves/estimating functions), so that the Cramer-Rao lower bound in their more restricted model is smaller than the Cramer-Rao lower bound for the nonparametric full-data model our IPCW-TMLE aims to achieve. For such a nonparametric full-data model, their locally efficient estimator solves the actual efficient influence curve estimating equation while the IPCW-TMLE solves the inefficient IPCW-full-data efficient equation.

Wang et al. also consider the subclass of influence functions/estimating functions generated by the nonparametric full-data model corresponding with a saturated parametric model for the treatment mechanism, and they refer to the optimal influence function in this subclass as the efficient double robust estimating function. Their efficient double robust estimating function equals the efficient influence curve for the observed data model implied by nonparametric full-data model, i.e., the efficient influence curve of our model. As a consequence, their efficient double robust estimator (based on solving the efficient double robust estimating equation) and our double robust TMLE are both locally efficient for the observed data model corresponding with the nonparametric full-data model. If the full-data model is nonparametric, *V* is continuous, and we do not use the targeted estimator of the missingness mechanism then our proposed IPCW-TMLE is not locally efficient, while their efficient double robust estimator will be locally efficient.

**Author Notes:** The authors would like to thank the editor and referees for their helpful contributions. This research was funded by NIH grant R01 A1074345-01.

Sherri Rose, University of California, Berkeley.

Mark J. van der Laan, University of California, Berkeley.

- Barlow WE, Ichikawa L, Rosner D, Izumi S. Analysis of case-cohort designs. J Clin Epidemiol. 1999;52(12):1165–1172. doi: 10.1016/S0895-4356(99)00102-X. [PubMed] [Cross Ref]
- Bembom O, van der Laan MJ. Data-adaptive selection of the truncation level for inverse-probability-of-treatment-weighted estimators. 2008
*Technical Report 230, Division of Biostatistics, University of California, Berkeley*. - Breslow NE, Cain KC. Logistic regression for two-stage case-control data. Biometrika. 1988;75(1):11–20. doi: 10.1093/biomet/75.1.11. [Cross Ref]
- Breslow NE, Lubin JH, Marek P. Multiplicative models and cohort analysis. J Am Stat Assoc. 1983;78:1–12. doi: 10.2307/2287093. [Cross Ref]
- Bureau A, Diallo MS, Ordovas JM, Cupples LA. Estimating interaction between genetic and environmental risk factors: Efficiency of sampling designs within a cohort. Epidemiology. 2008;19(1):83–93. doi: 10.1097/EDE.0b013e31815c4d0e. [PubMed] [Cross Ref]
- Cochran WG. Sampling Techniques. Wiley; New York, NY: 1963.
- Ernster VL. Nested case-control studies. Prev Med. 1994;23(5):587–590. doi: 10.1006/pmed.1994.1093. [PubMed] [Cross Ref]
- Essebag V, Genest J, Jr, Suissa S, Pilote L. The nested case-control study in cardiology. American Heart Journal. 2003;146(4):581–590. doi: 10.1016/S0002-8703(03)00512-X. [PubMed] [Cross Ref]
- Essebag V, Platt RW, Abrahamowicz M, Pilote L. Comparison of nested case-control and survival analysis methodologies for analysis of time-dependent exposure. BMC Medical Research Methodology. 2005;5(5) doi: 10.1186/1471-2288-5-5. [PMC free article] [PubMed] [Cross Ref]
- Flanders WD, Greenland S. Analytic methods for two-stage case-control studies and other stratified designs. Statistics in Medicine. 1991;10(5) doi: 10.1002/sim.4780100509. [PubMed] [Cross Ref]
- Gruber S, van der Laan MJ. A targeted maximum likelihood estimator of a causal effect on a bounded continuous outcome. The International Journal of Biostatistics. 2010;6(1) doi: 10.2202/1557-4679.1260. Article 26. [PMC free article] [PubMed] [Cross Ref]
- Hak E, Wei F, Grobbee DE, Nichol KL. A nested case-control study of influenza vaccination was a cost-effective alternative to a full cohort analysis. J Clin Epidemiol. 2004;57(9):875–880. doi: 10.1016/j.jclinepi.2004.01.019. [PubMed] [Cross Ref]
- Kupper LL, McMichael AJ, Spirtas R. A hybrid epidemiologic study design useful in estimating relative risk. J Am Stat Assoc. 1975;(70):524–528. doi: 10.2307/2285927. [Cross Ref]
- Liddell FDK, McDonald JC, Thomas DC. Methods of cohort analysis: appraisal by application to asbestos mining. J R Stat Soc Ser A. 1977;(140):469–491.
- Mantel N. Synthetic retrospective studies and related topics. Biometrics. 1973;29(3):479–486. doi: 10.2307/2529171. [PubMed] [Cross Ref]
- Moore K, van der Laan MJ. Application of time-to-event methods in the assessment of safety in clinical trials. In: Peace KE, editor. Design and Analysis of Clinical Trials with Time-to-Event Endpoints. Chapman & Hall/CRC; 2009a. Biostatistics Series.
- Moore KL, van der Laan MJ. Increasing power in randomized trials with right censored outcomes through covariate adjustment. Journal of Biopharmaceutical Statistics. 2009b;19(6):1099–1131. doi: 10.1080/10543400903243017. [PMC free article] [PubMed] [Cross Ref]
- Neyman J. Contribution to the theory of sampling human populations. J Am Statist Ass. 1938;33:101–116. doi: 10.2307/2279117. [Cross Ref]
- Polley EC, van der Laan MJ. Selecting optimal treatments based on predictive factors. In: Peace KE, editor. Design and Analysis of Clinical Trials with Time-to-Event Endpoints. Chapman & Hall/CRC; 2009. Biostatistics Series.
- Prentice RL, Qi L. Aspects of the design and analysis of high-dimensional snp studies for disease risk estimation. Biostatistics. 2006;7(3):339–354. doi: 10.1093/biostatistics/kxj020. [PubMed] [Cross Ref]
- Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association. 1994;89(427):846–866. doi: 10.2307/2290910. [Cross Ref]
- Rose S, van der Laan MJ. Simple optimal weighting of cases and controls in case-control studies. The International Journal of Biostatistics. 2008;4(1) doi: 10.2202/1557-4679.1115. Article 19. [PMC free article] [PubMed] [Cross Ref]
- Rose S, van der Laan MJ. Why match? Investigating matched casecontrol study designs with causal effect estimation. The International Journal of Biostatistics. 2009;5(1) doi: 10.2202/1557-4679.1127. Article 1. [PMC free article] [PubMed] [Cross Ref]
- Rosenblum M, van der Laan MJ. Targeted maximum likelihood estimation of the parameter of a marginal structural model. The International Journal of Biostatistics. 2010;6(2) doi: 10.2202/1557-4679.1238. http://www.bepress.com/ijb/vol6/iss2/19. [PMC free article] [PubMed] [Cross Ref]
- Rothman K, Greenland S. Modern Epidemiology. 2nd edition. Lippincott, Williams and Wilkins; Philadelphia, PA: 1998.
- Scharfstein DO, Rotnitzky A, Robins JM. Adjusting for non-ignorable drop-out using semiparametric nonresponse models, (with discussion and rejoinder) Journal of the American Statistical Association. 1999;94:1096–1120. 1121–1146. doi: 10.2307/2669923. [Cross Ref]
- Stitelman OM, van der Laan MJ. Collaborative targeted maximum likelihood for time-to-event data. The International Journal of Biostatistics. 2010;6(1) doi: 10.2202/1557-4679.1249. Article 21. [PubMed] [Cross Ref]
- Szklo M, Nieto FJ. Epidemiology: Beyond the Basics. 2nd edition. Jones & Bartlett Publishers; Boston, MA: 1999.
- Thomas DC. Addendum to: “Methods of cohort analysis: appraisal by application to asbestos mining” by F.D.K. Liddell and J.C. McDonald and D.C. Thomas. J R Stat Soc Ser A. 1977;(140):469–491.
- Tsiatis AA. Semiparametric Theory and Missing Data. Springer; New York: 2006.
- Ury HK. Efficiency of case-control studies with multiple controls per case: continuous or dichotomous data. Biometrics. 1975;31(3):643–649. doi: 10.2307/2529548. [PubMed] [Cross Ref]
- van der Laan MJ. The construction and analysis of adaptive group sequential designs. 2008a. Technical report 232, Division of Biostatistics, University of California, Berkeley.
- van der Laan MJ. Estimation based on case-control designs with known prevalence probability. The International Journal of Biostatistics. 2008b;4(1) doi: 10.2202/1557-4679.1114. Article 17. [PubMed] [Cross Ref]
- van der Laan MJ, Robins JM. Unified methods for censored longitudinal data and causality. Springer; New York: 2003.
- van der Laan MJ, Rose S. Targeted Learning: Causal Inference for Observational and Experimental Data. Springer; New York: 2011. www.targetedlearningbook.com.
- van der Laan MJ, Rubin D. Targeted maximum likelihood learning. The International Journal of Biostatistics. 2006;2(1) doi: 10.2202/1557-4679.1043. Article 11. [Cross Ref]
- van der Laan MJ, Polley EC, Hubbard AE. Super learner. 2007.
*Technical Report 222, Division of Biostatistics, University of California, Berkeley*. [PubMed] - van der Laan MJ, Rose S, Gruber S. Readings on targeted maximum likelihood estimation. 2009. Technical report 254, Division of Biostatistics, University of California, Berkeley.
- Vittinghoff E, Bauer DC. Case-only analysis of treatment-covariate interactions in clinical trials. Biometrics. 2006;62(3):769–776. doi: 10.1111/j.1541-0420.2006.00511.x. [PubMed] [Cross Ref]
- Wang W, Scharfstein D, Tan Z, MacKenzie EJ. Causal inference in outcome-dependent two-phase sampling designs. J.R. Statist. Soc. B. 2009;71(5):947–969. doi: 10.1111/j.1467-9868.2009.00712.x. [Cross Ref]

Articles from The International Journal of Biostatistics are provided here courtesy of **Berkeley Electronic Press**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's Canada Institute for Scientific and Technical Information in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |