Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Stat Sin. Author manuscript; available in PMC 2017 April 1.
Published in final edited form as:
Stat Sin. 2016 April; 26(2): 509–526.
doi:  10.5705/ss.2014.015
PMCID: PMC4820319

Marginal Structural Cox Models with Case-Cohort Sampling


A common objective of biomedical cohort studies is assessing the effect of a time-varying treatment or exposure on a survival time. In the presence of time-varying confounders, marginal structural models fit using inverse probability weighting can be employed to obtain a consistent and asymptotically normal estimator of the causal effect of a time-varying treatment. This article considers estimation of parameters in the semiparametric marginal structural Cox model (MSCM) from a case-cohort study. Case-cohort sampling entails assembling covariate histories only for cases and a random subcohort, which can be cost effective, particularly in large cohort studies with low outcome rates. Following Cole et al. (2012), we consider estimating the causal hazard ratio from a MSCM by maximizing a weighted-pseudo-partial-likelihood. The estimator is shown to be consistent and asymptotically normal under certain regularity conditions. Finite sample performance of the proposed estimator is evaluated in a simulation study. In the corresponding supplementary document, computation of the estimator using standard survival analysis software is presented.

Key words and phrases: Case-cohort Study, Causal Inference, Cox Model, Marginal Structural Model, Survival Analysis

1. Introduction

Biomedical cohort studies are often conducted with the goal of assessing the effect of a time-varying treatment (or exposure) on a survival time. In such studies there may exist time-dependent covariates which are simultaneously (i) confounders and (ii) affected by prior treatment. In the presence of time-varying confounders affected by prior treatment, standard methods such as Cox regression modeling with time-varying covariates do not in general yield consistent estimators of the causal effect of treatment (Robins (1986, 1998); Robins and Rotnitzky (1992); Hernán, Brumback, and Robins (2001)). On the other hand, marginal structural models (MSM) fit using inverse probability weighting can be employed to obtain consistent estimators of the causal effect of a time-varying treatment on an outcome of interest, even if there are time-varying confounders affected by prior treatment (Robins (1999); Hernán, Brumback, and Robins (2001)).

Recently, Cole et al. (2012) considered fitting MSCMs via inverse probability weighting in the presence of case-cohort sampling. The case-cohort study design is a cost-efficient approach to estimate treatment effects in large cohorts with low event rates when treatment or covariate information is expensive. The design entails randomly selecting a subcohort from the entire cohort. Covariate information is then collected only from the random subcohort and from individuals that are observed to experience an event (i.e., cases), saving cost and effort relative to obtaining covariate information from the full cohort. In addition to being cost efficient, the case-cohort design enjoys other benefits. For instance, the subcohort can serve as a basis for real-time covariate monitoring during the course of the study. Also, because the subcohort is chosen randomly, survival times to different diseases can be analyzed using the same subcohort (Self and Prentice (1988)).

In the presence of case-cohort sampling, Cole et al. (2012) considered estimating the causal hazard ratio of a MSCM via inverse probability weighting. Simulation studies indicated that their estimator could perform well empirically, but no formal justification for their estimator has been developed to date. Following Cole et al. (2012), we consider estimating the causal hazard ratio of a MSCM via inverse probability weighting in case-cohort studies and establish consistency and asymptotic normality for the estimator that maximizes a weighted-pseudopartial- likelihood (WPPL) under certain regularity conditions.

The approach utilized in this paper entails standard counting process and martingale theory. This formulation readily enables practical implementation of the methods using existing survival analysis software. Framing the problem using counting processes may also be helpful in future work, e.g., in fitting MSCMs to data from nested case-control studies or in the presence of competing risks. In the special situation that the subcohort is the full cohort, the proposed inverse probability weighted estimator is asymptotically equivalent to the estimator in Robins (1999). In this case our proof gives an alternative consistency and normality proof to the one in Robins (1999) that does not utilize the usual counting process framework. We also derive a new variance estimator that arises from the counting process formulation under both full and case-cohort settings. Empirical results presented here indicate that, in certain scenarios, the proposed variance estimator may be preferred to the so-called “robust” variance estimator (Lin and Ying (1993)) employed in Cole et al. (2012).

The outline of the remainder of this paper is as follows. In Section 2, estimators of the hazard ratio of a MSCM in the presence of case-cohort sampling are introduced, including the estimator proposed by Cole et al. (2012). Consistency and asymptotic normality results are presented in Section 3 and a simulation study is in Section 4. Some additional considerations are in Section 5. Regularity conditions and discussion of the conditions are deferred to the Appendix. The supplementary document accompanying this paper includes detailed proofs for Theorems 3.1 – 3.6, a description of how a MCSM can easily be fit via inverse probability weighting for either the full cohort or case-cohort setting using standard survival analysis software, such as R or SAS, additional simulation study results including performance of the baseline cumulative hazard estimator, and a summary of notation.

2. Marginal Structural Cox Model Estimators

2.1. Notation, Assumptions, and Model

Consider an observational cohort study where the outcome of interest is a survival time T, based on the time from study entry until some particular outcome occurs. We assume T is continuous so that there are no tied failure times. During the study, individuals may dropout or discontinue participation so that T is right censored. Individuals may or may not elect to receive treatment at various points of time during the study; let Ai(t) indicate whether subject i is on treatment at time t. If more than one treatment is available, Ai(t) is a vector of treatment indicator variables corresponding to the treatment levels. We assume Ai(t) is a p×1 vector and treatment variation is irrelevant (VanderWeele (2009)): for fixed values of Ai(t), additional variation in the treatment (e.g., dose, timing) does not affect the potential outcome. The subscript i is often suppressed when there is no ambiguity, as we assume random vectors are drawn independently from a distribution common to all subjects. Let L(t) denote a vector of covariates at time t and L(0) represent baseline covariates. Overbars are used to represent history up to and including time t, with Ā(t) = {A(u) : 0 ≤ ut} and L(t) defined analogously. Assume decisions related to treatment at t are made after obtaining the covariate information at t, so L(t) is temporally prior to A(t). For a case-cohort study, the time varying covariates L(t) and treatment A(t) are by design observed only for the cases and individuals in the random subcohort (while under study); L(t) and A(t) are missing for all other individuals. Corresponding to the subcohort, let C denote the set of indices of size ñn that are randomly selected without replacement from the set {1, …, n} corresponding to the entire cohort.

Let ā denote a possible (static) treatment plan, ā= {a(t) : 0 ≤ tτ}, where τ is the study duration. Assume τ = 1 without loss of generality. Each possible value of ā can be interpreted as a prespecified treatment plan. Given a single treatment, examples of ā are never treat (a(t) = 0 for all t [set membership] [0, 1]), treat starting at a prespecified time t1 < 1 (a(t) = I{tt1} where I{·} is the usual indicator function), treat from baseline (a(t) = 1 for all t [set membership] [0, 1]). Let Tā be a subject’s potential failure time had (possibly contrary to what was observed in the actual study) the subject been treated according to ā. Let [dbl vert, bar (under)] denote statistical independence. Assume


These are referred to as the causal consistency, conditional exchangeability, and positivity assumptions, respectively. Assumption (2.1) states that, in the absence of censoring, the observed failure time T equals the potential failure time Tā for all treatment plans ā compatible with the observed treatment up to time T. Assumption (2.2) ensures no unmeasured confounding. Assumption (2.3) states that the conditional probability of receiving any particular treatment is greater than zero. Of the three, only (2.3) can be tested empirically. Sensitivity analysis may be useful in assessing the robustness of inference drawn to violations of (2.2) (Robins, Rotnitzky, and Scharfstein (1999)).

Consider the MSCM


where λTā (t) is the hazard of failure at time t if all individuals in the population had followed treatment plan ā through time t, λ0(t) is an unspecified baseline hazard function corresponding to the hazard if all individuals had been untreated through time t, f(ā(t)) is a specified function of treatment history up to time t, and β0 is an unknown parameter vector. Hereafter, we consider the MSCM


where, for notational convenience, we let r{·} = exp{·}. For example, if we are interested in the causal effect of current AZT treatment on mortality of HIV-positive homosexual men, then r(β0) is the ratio of the hazard of death at time t had all subjects in the population alive at time t been exposed to AZT compared to being unexposed at time t. Here (2.4) focuses on the effect of current treatment status only; however, the results presented below are valid for any specified f(ā(t)).

We employ the counting process framework to study the large sample behavior of estimators of β0. All processes discussed hereafter refer to observed processes. Let (Ω, F, p) be a complete probability space and let {Ft : t [set membership] [0, 1]} be an increasing right-continuous family of sub σ-algebras of F consisting of failure times, covariates, and treatment histories up to time t, and censoring histories up to time t+ for all subjects in a cohort of size n. Thus the filtration with respect to the probability space is the same as usual, except that treatment histories are now listed separately from covariate histories. Let Ni(·) be a counting process adapted to Ft representing the number of failures of subject i by time t so that dNi(t) indicates the number of events of subject i that occurred in [t, t + dt) for sufficiently small dt. Because failures are assumed to occur in continuous time, we only allow jumps of size 1 and no simultaneous jumps can occur in [t, t+dt). Let Ci(t) = 0 indicate that subject i remained uncensored prior to time t and Ci(t) = 1 otherwise. The treatment process Ai(·) and the censoring process Ci(·) are assumed to be piece-wise constant point processes with cadlag (right-continuous with left-hand limits) step-function sample paths. The processes A(·) and C(·) are assumed to have jumps that can occur at no more than a finite number of time points. Informally, all participants follow (approximately) the same visit schedule. This assumption should be reasonable in studies with regularly scheduled follow-up visits (e.g., every six months) and good study compliance. We refer to censoring as ignorable (or noninformative) if the cause-specific hazard of being censored at t among subjects alive and uncensored does not depend on the failure times Tā given prior treatment/covariate history Ā(t) and L(t) (Hernán, Brumback, and Robins (2001)). Let Yi(t) = I{Ni(t) = Ci(t) = 0} denote whether an individual is at-risk of being observed to fail at time t, having left-continuous sample paths, and assume pr[Y (1) > 0] > 0.

2.2 Inverse Probability Weights

If we can correctly model the probability of receiving treatment at time t given the past treatment history and covariate history, then we can consistently estimate the weights


These are referred to as stabilized inverse-probability-of-treatment-weights (IPTWs). If the numerator probabilities in (2.5) were replaced with 1, then the weights are referred to as unstabilized IPTWs. We can consistently estimate the numerator probabilities in (2.5) based on sample proportions because A(·) is assumed to have at most a finite number of jumps over the study period. Under (2.2) to (2.3), in the absence of censoring, Robins (1999) showed that a consistent estimator of the unknown parameter β0 in (2.4) can be obtained by fitting an ordinary time-dependent Cox model with the contribution of subject i to the risk set at time t weighted by estimates of (2.5). Informally we can think of the analysis via IPTWs as reweighting the observed data set such that it has the same properties as a random sample, with respect to the measured confounders L, from a population where L(t) [dbl vert, bar (under)] A(t)|Ā(t) holds at time t. The weighted study population is sometimes called a pseudo-population.

Dropout may introduce selection bias if it is associated with exposure and the outcome. In the presence of such censoring, we can still obtain a consistent estimator of β0 by fitting the ordinary Cox model, but weighting a subject alive and uncensored at time t by estimates of WT(t) × WC(t), where


This is under the assumption of no unmeasured confounders for censoring, an analogous assumption to (2.3) for censoring, and assuming that we can correctly model the denominator probabilities in (2.6) (Robins (1999)). Here the weighted study population can be thought of as a pseudo-population in which there is no confounding due to measured covariates or selection bias due to censoring. In Section 2.3, we make use of the stabilized weights defined by W(t) [equivalent] WT (t) × WC(t) after modifying (2.5) by adding C(k) = 0 to the conditioning events in both the numerator and the denominator (Hernán et al. (2000)). Hereafter W(t) is referred to as inverse-probability-weights (IPWs). Here (2.5) and (2.6) are finite products and (2.3) ensures non-zero probabilities in their denominators; hence the IPWs are bounded at all t. Our results are not limited to a specific form of the weights W(t). The proposed methods are applicable to different inverse probability weighting analysis provided that the IPWs (or IPTWs in the absence of censoring) are bounded, such as when truncated (Cole and Hernán (2008)) and normalized (Xiao, Abrahamowicz, and Moodie (2010)) weights are employed. Under the assumption of finite support of the treatment and censoring processes, unstabilized weights (where the numerators probabilities of both (2.5) and (2.6) are replaced with 1) are also bounded, but are highly variable and monotone increasing functions of t. Other weights such as stabilized, truncated, and normalized weights are generally recommended in practice as they lead to more efficient estimators of the causal treatment effect.

We now briefly describe estimation of the random weights W(t), denoted by Ŵ(t). One can specify a pooled logistic model (treating each person-visit as an observation) to estimate the probability in the denominators of (2.5) and (2.6) at each time (for example, at each visit), then plug in the estimated probabilities (Hernán et al. (2000, 2001)). In the presence of case-cohort sampling, the same model can be used to obtain Ŵ(t) after weighting subcohort controls by the inverse probability of subcohort selection. Example SAS code to obtain Ŵ(t) using case-cohort data is provided in Section 2 of the supplement. We assume throughout that the models to estimate denominator probabilities in the IPWs are correctly specified. In practice, investigators may wish to explore the sensitivity of treatment effect estimates to different model specifications for estimating the weights.

2.3 Weighted-Pseudo-Partial-Likelihood

We consider two weighted-pseudo-partial-likelihoods (WPPL) that form the basis for obtaining consistent estimators of β0 in the presence of case-cohort sampling. They are formed by weighting individual contributions to the usual partial likelihoods by Wi(t) assuming that Wi(t) is known.

The log-WPPL created by individual-time-specific weights at time t under the full cohort setting is given by


it is motivated by the weighted estimating equations proposed by Robins (1993). The log-WPPL in the case-cohort setting is


This is slightly different from the log-WPPL proposed by Cole et al. (2012),


Here (2.8) and (2.9) differ only in whether a case outside the subcohort C contributes to the risk set. If Wi(u) = 1 for all i and u, (2.8) reduces to the log-likelihood considered by Self and Prentice (1988) and (2.9) reduces to the log-likelihood considered by Prentice (1986).

Let [beta], beta, and β* be solutions to [partial differential]l(β, 1; Ŵ)/[partial differential]β= 0, [partial differential]l(β, 1; Ŵ)/[partial differential]β = 0, and [partial differential]l*(β, 1; Ŵ)/[partial differential]β = 0, respectively. Based on arguments as in Andersen and Gill (1982), in Theorems 3.1 - 3.2 we show [beta] and beta are consistent estimators of β0. That β*[mapsto]p β0 can be shown analogously. Asymptotic normality of [beta] and beta will be shown via asymptotic normality of score statistics corresponding to (2.7) and (2.8).

To make use of counting process and martingale theory, under (2.1) each (observed) counting process Ni(·)(i = 1, …, n) can be uniquely decomposed into the sum of its intensity process λi and a local square integrable martingale Mi,


where the intensity process is given by


which embodies the same parameters as in (2.4).

3. Asymptotic Properties

In this section, we present the main results: consistency and asymptotic normality of the estimators [beta], beta, and β*. Sufficient conditions for these results are stated in the Appendix, followed by discussion of the conditions. Proofs of the theorems are given in S1 of the supplementary document.

Theorem 3.1

(Consistency of [beta] under full cohort) Under conditions A–F, [beta] [mapsto]p β0.

Theorem 3.2

(Consistency of beta under the case-cohort) Under conditions A–G, beta [mapsto]p β0.

It is straightforward to show that the estimator based on (2.9) converges in probability to the same limit as beta. An individual case’s contribution to C at its failure time (which is weighted by its IPWs) is asymptotically negligible in the sense that IPWs are bounded at all times and weighted subcohort averages are asymptotically stable (see conditions B and G-3 in the Appendix).

Theorem 3.3

Under conditions A–G, betaβ* [mapsto]p 0.

We need some additional notation. Let c[multiply sign in circle]0 = 1, c[multiply sign in circle]1 = c, c[multiply sign in circle]2 = cc′, and ci denote the i-th element of c for a p × 1 column vector c. Let r(j){βA(t)} = A(t)[multiply sign in circle]jr{βA(t)}, j = 0, 1, 2. Full cohort averages are defined by S(j)(β,t)=n-1i=1nYi(t)r(j){βAi(t)} for j = 0, 1, 2, as given in Andersen and Gill (1982) with covariates Zi(t) being replaced by the treatment process Ai(t). Similarly, subcohort averages are defined by S(j)(β,t) = n−1Σi[set membership]CYi(t)r(j){βAi(t)}. Limits of S(j)(β, t) and S(j)(β, t) are given by s(j)(β, t), formally defined in the Appendix. Analogously, let weighted full cohort averages be SW(k)(j)(β,t)=n-1i=1nWi(t)kYi(t)r(j){βAi(t)} and weighted subcohort averages be SW(k)(j)(β,t)=n-1iCWi(t)kYi(t)r(j){βAi(t)} for j = 0, 1, 2 and k = 1, 2, with limits sW(k)(j)(β,t). Let E = S(1)/S(0) and = S(1)/S(0) with limit e = s(1)/s(0), and let EW(k)=SW(k)(1)/SW(k)(0) and EW(k)=SW(k)(1)/SW(k)(0) with limits eW((k)). Lastly, let v = s(2)/s(0)e[multiply sign in circle]2 and vW(k)=sW(k)(2)/sW(k)(0)-eW(k)2, and let =01v(β0,t)s(0)(β0,t)λ0(t)dt and W(k)=01vW(k)(β0,t)sW(k)(0)(β0,t)λ0(t)dt.

Theorem 3.4

(Asymptotic normality of the full cohort MSCM score statistic)

Let U(β0, t) = [partial differential]l(β, t)/[partial differential]β|β=β0 be the full cohort MSCM score process at time t. Under conditions A–F,


where ΣU = ΣW(2) + ΔW(1),W(2) and


Theorem 3.4 can be used to show asymptotic normality of the MSCM case-cohort score statistic:

Theorem 3.5

(Asymptotic normality of the case-cohort MSCM score statistic)

Let Ũ(β0, t) = [partial differential]l(β, t)/[partial differential]β|β=β0 be the case-cohort MSCM Score process at time t. Under conditions A–G,


where ΣŨ = ΣU + Δα,


with q(j)(β, ·, ·) defined in condition G-2 in the Appendix.

Theorem 3.6

(Asymptotic normality of beta) Under conditions A-G,


where ΣŨ is given in Theorem 3.5.

Based on Theorem 3.6, we propose a new variance estimator




Here {W(2)=Ŵ2} and {W(1)=Ŵ} denote that the IPWs in W(2) and W(1) are replaced by Ŵ2 or Ŵ, NŴ(t) is defined by Σi Ŵi(t)Ni(t), and Ĝ(·, ·, ·) is G(·, ·, ·) in Theorem 3.5 with q(j)(β0, ·, ·) and sW(1)(j)(β0,·) in h(j)(β0, ·, ·) replaced by Q(j)(beta, ·, ·) (defined in condition G-2 in the Appendix) and SW(1)(j)(β,·) along with eW(1) (β0, ·) replaced by W(1) (beta, ·). Estimators (3.2), (3.3), and (3.4) are consistent estimators of ΣW(1), ΣW(2), and ΔW(1),W(2) in view of condition A and the fact that supβ,t |n−1{x2110(β, t) − I(β, t)}| →p 0 where x2110(β, t) = −[partial differential]2l(β, t)/[partial differential]β2 (see S1–2 of the supplementary document). Estimator (3.5) is a consistent estimator of Δα in view of conditions A, G-1(ii), that n−1NŴ(t) uniformly converges to 0tsW(1)(0)(β0,u)λ0(u)du, and that n−1NŴ (1) is bounded in probability.

The proposed variance estimator (3.1) is different from the robust estimator proposed by Lin and Ying (LY, Lin and Ying (1993)) that is used in most MSM analyses. Both (3.1) and the LY estimator are sandwich-type estimators where the “bread” of sandwich ( ^W(1)-1) is the same. However, the “meat” is different. In particular, the “meat” of the LY estimator is based on (weighted) score residuals whereas the “meat” of (3.1) is given by an estimator of ΣŨ + Δα. Simulation results reported in §4 (and S2-2 in the supplement document) indicate that (3.1) can be more accurate when the size of subcohort is small.

4. Simulation

A simulation study was conducted to examine the finite sample bias of beta and β*, and the performance of the proposed variance estimator (3.1) as well as the LY variance estimator. Simulations were conducted as in Cole et al. (2012). Briey, potential survival times were generated according to the MSCM given in (2.4), and observed survival times were generated by stochastically generating time varying exposures and confounders for cohorts of size n = 1, 000 (see Cole et al. (2012) for details). While they considered only one scenario in which 25% of individuals were cases and a 20% subcohort fraction, ñn−1 × 100 = 20, we considered 36 scenarios by varying both the subcohort fraction and the event rate from 5 to 30%. Censoring times were generated from uniform distributions with support chosen to achieve the desired event rate. We did not incorporate (2.6) when calculating IPWs because the censoring times were generated independently of the exposure and potential survival times. Following Cole et al. (2012), stabilized weights were used to calculate IPWs. For each scenario 5,000 data sets were generated under the null β0 = 0 and the alternative β0 = log(1/2).

Results from the simulation study are summarized in Table 4.1. Only results obtained from six scenarios under the null are presented; results from other scenarios and under the alternative were similar (see S2 of the supplementary document for results under the alternative). For all scenarios, under both the null and alternative, beta and β* were nearly unbiased; that the two estimators performed similarly is not surprising in light of Theorem 3.3. Under the null, the proposed variance estimator was always less biased than the LY variance estimator when the subcohort fraction was only 5%, regardless of the event rate. Similarly, (3.1) was less biased regardless of the subcohort fraction when the event rate was 5%. Both the proposed and the LY variance estimators were approximately unbiased when the subcohort fraction and event rate were both at least 20%. Wald confidence intervals (CIs) using the LY variance estimator tended to undercover when the subcohort fraction was 5%, whereas Wald CIs using (3.1) exhibited coverage close to the nominal level for all scenarios considered. In summary, both beta and β*, along with the proposed variance estimator and CIs, exhibited good finite sample properties for the scenarios considered, while performance of the LY variance estimator depended on subcohort size and event rate.

Table 4.1
Summary of simulation study. Bias denotes the empirical bias of the different estimators of β0. ESE denotes the empirical standard errors. ASE denotes the average estimated standard errors and Coverage denotes the empirical coverage of 95% Wald-type ...

5. Additional Considerations

5.1. Baseline Cumulative Hazard Estimation

In addition to the treatment effect β, it may be of interest to estimate the cumulative baseline hazard function. Similar to the cumulative baseline hazard estimator proposed in Self and Prentice (1988), a consistent estimator of Λ0(t)=0tλ0(u)du with case-cohort sampling is


This estimator is equivalent to Self and Prentice’s estimator when Ŵi(t) = 1 for all i and t. A consistent estimator of the cumulative baseline hazard can also obtained by using β* instead of beta in (5.1). In the supplement (5.1) is shown to be consistent under conditions A–G and its performance is examined in a simulation study; see S1 and S2 of the supplement.

5.2. Limitations and Future Directions

While Cole et al. (2012) extensively discussed limitations of the MSCM with case-cohort sampling, we reiterate some of the issues that are important from a theoretical point of view. First, the asymptotic results presented here correct model specification for both the treatment assignment model (2.5) and the censoring model (2.6). If (2.2) holds, flexible parametric regression models fit adjusting for all measured confounders and their histories should provide a good approximation to denominator probabilities in (2.5) and (2.6). However, (2.2) is an untestable assumption and thus in practice analysts may want to explore sensitivity of treatment effect estimates to departures from this assumption.

Second, the proposed MSCM estimators for the case-cohort study, which are based on Prentice (1986) and Self and Prentice (1988) type estimators, are not expected to be fully efficient. After Prentice (1986) and Self and Prentice (1988), various methods have been proposed as means of improving the efficiency of the hazard ratio estimator in the standard case-cohort Cox regression analysis. Those methods seek to improve efficiency mostly by using weighted partial-likelihood estimation. For example, Barlow (1994), Barlow et al. (1999), Kong, Cai, and Sen (2006), and Kong and Cai (2009) utilize fixed and time-varying inverse-probability-sampling weights to account for subcohort sampling to improve efficiency. Borgan et al. (2000), Kulich and Lin (2004), and Breslow et al. (2009a, b) consider leveraging phase 1 covariates (available from the entire cohort) to improve efficiency. These methods could potentially be extended to develop more efficient MSCM estimators in the presence of case-cohort sampling.

Supplementary Material



We thank Professor Danyu Lin for expert advice and the Editor and three anonymous reviewers for helpful comments that improved the paper. The authors were partially supported by the Lifespan/Tufts/Brown Center for AIDS Research (CFAR), University of North Carolina CFAR, and National Institutes of Health grants P30 AI042853, P30 AI50410, R01 AI085073, UL1RR025747, R01 ES021900, P01 CA142538, and R01 AI100654.

Appendix: Regularity conditions

In what follows, norms are defined by ||c[multiply sign in circle]2|| = supi,j |cicj|, ||c|| = supi |ci|, and c=(ci2)1/2=(cc)1/2.

  • A
    (Uniform consistency of estimated weights)
  • B
    (Stability of weights) Individual time-specific weights Wi(t) and the corresponding estimators Ŵi(t) are strictly positive and bounded, with positive real numbers M1 and M2 such that
  • C
    (Finite interval) 01λ0(t)dt<.
  • D
    (Asymptotic stability)
    1. There exists a neighborhood B0 of β0 and functions s(0), s(1), and s(2) defined on B0 × [0, 1] such that
    2. There exists a neighborhood B of β0, B [subset, dbl equals] B0, and functions sW(k)(j) defined on B × [0, 1] such that
    3. n1/2{SW(1)(0)(β0,t)-sW(1)0(β0,t)}dN(0,σ2(t)) uniformly in t [set membership] [0, 1], for some σ2(t).
  • E
    (Lindeberg condition) For any ε > 0, j = 1, …, p
  • F
    (Asymptotic regularity conditions) s(j)(β, t) and sW(k)(j)(β,t) are continuous functions of β [set membership] B uniformly in t [set membership] [0, 1] that are bounded on B × [0, 1] for j = 0, 1, 2 and k = 1, 2. For all (β, t) [set membership] B × [0, 1] and m = 0, 1,
    Here s(0) and sW(k)(0) are bounded away from zero and the matrices Σ and ΣW(k) are positive definite.
  • G-1
    (Stability of subcohort average)
    1. (Nontrivial subcohort) ñn−1p α for some α [set membership] (0, 1].
    2. (Asymptotic normality of subcohort averages at β0) For any ε > 0
      and the sequences of distributions of n1/2{(β0, t) − E(β0, t)} are tight on the product space of cadlag functions equipped with the product Skorohod topology, and so are n1/2{W(1) (β0, t) − EW(1) (β0, t)}.
  • G-2
    (Asymptotic stability and regularity of covariance function) There exists a neighborhood B of β0 and functions q(j)(β, t, u) for j = 0, 1, 2, defined on B × [0, 1]2 such that q(j)(β, t, u) are continuous functions of β [set membership] B uniformly in (t, u) [set membership] [0, 1]2, the q(j) are bounded on B × [0, 1]2 and
    Moreover, supn≥1 x2130[Q(j)(β, t, u)] for j = 0, 1, 2 are bounded sequences where x2130 denote expectation.
  • G-3
    (Asymptotic stability of subcohort averages) If Q(j)(β, t, u) are covariance functions based on subcohort members i = 1, …, ñ, then
    the subcohort average converges to the limit of the full cohort in probability uniformly in β [set membership] B and t [set membership] [0, 1], and
    the subcohort covariance functions converge in probability uniformly in β [set membership] B and t [set membership] [0, 1] to the full cohort covariance functions. In addition,
    for some [sigma with tilde]2(t).

All conditions except A and B are extensions of the regularity conditions from Self and Prentice (1988) by incorporating IPWs. Conditions A and B are required for IPWs to ensure asymptotic properties of the proposed estimators.

Condition A

Ŵ(·) and W(·) are assumed to be predictable with respect to the filtration Ft because weights are determined by predictable processes: A(·), L(·), and their histories. Then along with the assumption of no misspecification of the model used to estimate denominator probabilities in W(·), the finite number of jumps assumption on the treatment and censoring processes are sufficient for this condition to hold. From a practical point of view, having a finite number of time points when treatment status can change or when censoring might occur may be reasonable to assume. For instance, studies often have planned visits at finite discrete intervals when a patient can have treatment altered. Similarly, the censoring time for a subject is often assumed to be the last observed visit time before the subject became lost-to-follow-up.

Condition B

All weights discussed in Section 2.2 satisfy conditions A and B in general, except the unstabilized weights. Unstabilized weights satisfy conditions A and B only when the assumption of finite support of A(·) and C(·) is met. The IPWs are strictly positive by (2.3).

Conditions C and D

Condition C is the same as in Self and Prentice (1988). When the IPWs are equal to 1, SW(k)(j) in D is S(j) in Self and Prentice (1988) for all j = 0, 1, 2 and k = 1, 2. Then ΣW(k) defined in Section 3 and Δα defined in Theorem 3.5 are Σ and Δ defined in Self and Prentice (1988), respectively. Hence, ΣŨ in Theorem 3.5 is Σ + Δ, the asymptotic covariance matrix of the case-cohort score statistic in Self and Prentice (1988), in the absence of IPWs.

Condition E

If the treatment process A(·) is bounded (as assumed here) and B and F are satisfied, then E holds trivially.

Condition F

eW(k) can be interpreted as the weighted average of a treatment function with the weights taking an inverse-probability weighted exponential form. The positive definite condition on Σ in Andersen and Gill (1982) can easily be extended to the ΣW(k) assuming W(t) are bounded away from zero on t [set membership] [0, 1].

Condition G

This is the same as condition G in Self and Prentice (1988), incorporating individual-specific time-varying weights Wi(t)(i = 1, …, n). Conditions A–F are sufficient to prove consistency and asymptotic normality of [beta]. To show asymptotic properties of beta and β*, this additional condition is required to ensure asymptotic stability of certain quantities estimated using subcohort data.


Supplementary Materials

The online supplementary material contains proofs corresponding to consistency and asymptotic normality of MSCM parameter estimators along with consistency of the cumulative baseline hazard estimator proposed in Section 5.1, implementation of the method using standard survival analysis software such as R or SAS, and additional simulation study results including performance of the proposed baseline cumulative hazard estimator. A summary of notation introduced in the main text and the supplement is presented.

The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.


  • Andersen PK, Gill RD. Cox’s regression model for counting processes: A large sample study. The Annals of Statistics. 1982;10:1100–1120.
  • Barlow WE. Robust variance estimation for the case-cohort design. Biometrics. 1994;50:1064–1072. [PubMed]
  • Barlow WE, Ichikawa L, Rosner D, Izumi S. Analysis of case-cohort designs - A prospective cohort study. Journal of Clinical Epidemiology. 1999;52:1165–1172. [PubMed]
  • Borgan O, Langholz B, Samuelsen SO, Goldstein L, Pogoda J. Exposure stratified case-cohort designs. Lifetime Data Analysis. 2000;6:39–58. [PubMed]
  • Breslow NE, Lumley T, Ballantyne CM, Chambless LE, Kulich M. Improved Horvitz-Thompson estimation of model parameters from two-phase stratified samples: applications in epidemiology. Stat Biosc. 2009a;1:32–49. [PMC free article] [PubMed]
  • Breslow NE, Lumley T, Ballantyne CM, Chambless LE, Kulich M. Using the whole cohort in the analysis of case-cohort data. American Journal of Epidemiology. 2009b;169:1398–1405. [PMC free article] [PubMed]
  • Cole SR, Hernán MA, Robins JM, Anastos K, Chmiel J, Detels R, Ervin C, Feldman J, Greenblatt R, Kingsley L, Lai S, Young M, Cohen M, Munoz A. Effect of highly active antiretroviral therapy on time to acquired immunodeficiency syndrome or death using marginal structural models. American Journal of Epidemiology. 2003;158:687–694. [PubMed]
  • Cole SR, Hernán MA. Constructing inverse probability weights for marginal structural models. American Journal of Epidemiology. 2008;168:656–664. [PMC free article] [PubMed]
  • Cole SR, Hudgens MG, Tien PC, Anastos K, Kingsley L, Chmiel JS, Jacobson LP. Marginal structural models for case-cohort study designs to estimate the association of antiretroviral therapy initiation with incident AIDS or death. American Journal of Epidemiology. 2012;175:381–390. erratum: 175 (7), 732. [PMC free article] [PubMed]
  • Hernán MA, Brumback B, Robins JM. Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. Epidemiology. 2000;11:561–570. [PubMed]
  • Hernán MA, Brumback B, Robins JM. Marginal structural models to estimate the joint causal effect of nonrandomized treatments. Journal of the American Statistical Association. 2001;96:440–448.
  • Kulich M, Lin D. Improving the efficiency of relative-risk estimation in case-cohort studies. Journal of the American Statistical Association. 2004;99:832–844.
  • Kong L, Cai J, Sen PK. Asymptotic results for fitting semiparametric transformation models to failure time data from case-cohort studies. Statistica Sinica. 2006;16:135–151.
  • Kong L, Cai J. Case-cohort analysis with accelerated failure time model. Biometrics. 2009;65:135–142. [PMC free article] [PubMed]
  • Lin DY, Ying Z. Cox regression with incomplete covariate measurements. Journal of American Statistical Association. 1993;88:1341–1349.
  • Prentice RL. A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika. 1986;73:1–11.
  • Robins JM. A new approach to causal inference in mortality studies with a sustained exposure period - Application to control of the healthy worker survivor effect. Mathematical Modelling. 1986;7:1393–1512.
  • Robins JM, Rotnitzky A. Recovery of information and adjustment for dependent censoring using surrogate markers. In: Jewell NP, Dietz K, Farewell VT, editors. AIDS Epidemiology - Methodological Issues. Birkhäuser; Boston: 1992. pp. 297–331.
  • Robins JM. Information recovery and bias adjustment in proportional hazards regression analysis of randomized trials using surrogate markers. Proceedings of the Biopharmaceutical Section, American Statistical Associtaion. 1993:24–33.
  • Robins JM. Correction for non-compliance in equivalence trials. Statistics in Medicine. 1998;17:269–302. [PubMed]
  • Robins JM. Marginal structural models versus structural nested models as tools for causal inference. In: Halloran ME, Berry DA, editors. Statistical Models Epidemiology: The Environment and Clinical Trials. Springer-Verlag; NY: 1999. pp. 95–134.
  • Robins JM, Rotnitzky A, Scharfstein D. Sensitivity analysis for selection bias and unmeasured confounding in missing data and causal inference models. In: Halloran M, Berry D, editors. Statistical Models in Epidemiology: The Environment and Clinical Trials. Springer-Verlag; NY: 1999. pp. 1–92.
  • Self SG, Prentice RL. Asymptotic distribution theory and efficiency results for case-cohort studies. The Annals of Statistics. 1988;16:64–81.
  • Therneau TM, Li H. Computing the Cox model for case cohort designs. Lifetime Data Analysis. 1999;5:99–112. [PubMed]
  • Therneau T. A Package for Survival Analysis in S. R package version 2.37-4. 2013
  • VanderWeele TJ. Concerning the consistency assumption in causal inference. Epidemiology. 2009;20:880–883. [PubMed]
  • Xiao Y, Abrahamowicz M, Moodie EEM. Accuracy of conventional and marginal structural Cox model estimators: A simulation study. The International Journal of Biostatistics. 2010;6:Article 13. [PubMed]