PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
Biometrics. Author manuscript; available in PMC 2010 June 29.
Published in final edited form as:
PMCID: PMC2889133
NIHMSID: NIHMS117764

Cox Regression in Nested Case-Control Studies with Auxiliary Covariates

SUMMARY

Nested case-control (NCC) design is a popular sampling method in large epidemiologic studies for its cost effectiveness to investigate the temporal relationship of diseases with environmental exposures or biological precursors. Thomas’ maximum partial likelihood estimator is commonly used to estimate the regression parameters in Cox’s model for NCC data. In this paper, we consider a situation that failure/censoring information and some crude covariates are available for the entire cohort in addition to NCC data and propose an improved estimator that is asymptotically more efficient than Thomas’ estimator. We adopt a projection approach that, heretofore, has only been employed in situations of random validation sampling and show that it can be well adapted to NCC designs where the sampling scheme is a dynamic process and is not independent for controls. Under certain conditions, consistency and asymptotic normality of the proposed estimator are established and a consistent variance estimator is also developed. Furthermore, a simplified approximate estimator is proposed when the disease is rare. Extensive simulations are conducted to evaluate the finite sample performance of our proposed estimators and to compare the efficiency with Thomas’ estimator and other competing estimators. Moreover, sensitivity analyses are conducted to demonstrate the behavior of the proposed estimator when model assumptions are violated, and we find that the biases are reasonably small in realistic situations. We further demonstrate the proposed method with data from studies on Wilms’ tumor.

Keywords: Counting process, Cox proportional hazards model, Martingale, Risk set sampling, Survival analysis

1. Introduction

Due to its quality of being cost-effective in studying temporal relationship between disease and exposures, nested case-control (NCC) sampling (Thomas, 1977; Oakes, 1981) has been considered a useful alternative to cohort design and case-control design. The most commonly used analytical approach for NCC data is Thomas’ maximum partial likelihood estimation approach (Thomas, 1977; Oakes, 1981) under Cox proportional hazards model (Cox, 1972) assumption. The consistency and asymptotic normality of Thomas’ estimator have been formally established using counting process and martingale theory (Goldstein and Langholz, 1992). Recently, Chen (2004) proposed a partial likelihood based local-averaging estimator that is more efficient than Thomas’ estimator away from the null.

Furthermore, in the presence of extended NCC data (Chen, 2004) which consist of failure/censoring times and indices for the full cohort and entire covariate histories for the cases and selected controls, a number of methods have been proposed to improve the estimation efficiency: e.g. the inverse probability weighted (IPW) method (Robins et al., 1994; Samuelsen, 1997); the local-average estimation approach (Chen, 2004); the likelihood-based approaches (Chen and Little, 1999; Scheike and Juul, 2004; Zeng et al., 2006). Since parent cohorts of NCC studies are usually well-characterized, carefully-followed epidemiological cohorts, the failure/censoring information on the entire cohort is often available. In many studies, however, the true exposure covariates may be difficult or expensive to be assembled for the full cohort or for their entire history to be measured for the cases and selected controls. Instead, some auxiliary covariates, such as crude measurements of the exposure or inferred covariates from questionnaire, can be easily or cheaply assembled for the entire cohort. The aims of this paper are to incorporate the information of failure/censoring and auxiliary covariates from the entire cohort into the analysis of NCC data and to propose an easily computed estimator that is asymptotically more efficient than Thomas’ estimator.

Towards this goal, we propose to adopt a projection technique which has been used to improve the efficiency of various models in cohort studies with random validation sampling, such as general linear regression models (Chen and Chen, 2000), Cox’s model (Chen, 2002), and the additive hazards model (Jiang and Zhou, 2007). To the best of our knowledge, the projection method heretofore has been only studied for the random validation sampling and its adaptation to the NCC sampling entails new challenges, primarily due to the non-independent sampling scheme of NCC design. Statistical inference thus can not rely on the conventional independent central limit theory. In this paper, we show that the projection method can be well adapted to the NCC design under certain conditions and will lead to an improved estimator that is guaranteed to achieve an asymptotic variance no bigger than that of Thomas’ estimator.

The rest of this article is organized as follows. In Section 2, we derive the proposed estimator and its asymptotic properties and present a practical computation procedure. A rare-disease approximate estimator is also provided and some inference remarks are discussed. In Section 3, extensive simulation studies are conducted to evaluate the performance of our proposed estimators under various practical settings. An illustration with a real dataset from Wilms’ tumor studies is also provided. We conclude with some discussions in Section 4 and provide all the technical details in Supplementary Material.

2. Projection Estimator and Statistical Inference

Consider a full cohort of size n. Let {Ti*,Ci,Zi(·),i=1,,n} denote n i.i.d. triplets of failure times, censoring times, and p-dimensional covariate processes of interest. Define Ti=min(Ti*,Ci),δi=I(Ti*Ci) , Ni(t) = δiI(Tit) and Yi(t) = I(Tit), where I(·) denotes the indicator function throughout. An NCC study identifies cases as subjects of δi = 1 and randomly samples (m−1) controls without replacement from the risk set at each failure time, excluding the failed subject itself. For a given case i, let Ri* denote the indices of the (m − 1) selected controls and define R˜i=Ri*{i} . The true covariates are then ascertained for all the cases and selected controls. Therefore, for a standard NCC design, the observed data consist of {Ti,Zi(Ti),Zj(Ti):δi=1,jRi*,i=1,,n} . As we discussed in the introduction section, in addition to the data collected by the NCC sampling, we consider the situation that the failure/censoring information and some auxiliary covariates, i.e. {Ti, δi,Xi(t) : 0 ≤ tTi, i = 1,(...) ,n}, are also collected for the entire cohort, where Xi(t) denotes the q-dimensional auxiliary covariate processes of subject i.

Assume that, given the true covariate Z(·), T* follows a Cox proportional hazards model

λ{t|Z¯(t)}=λ0(t)exp{β0Z(t)},
(1)

where Z(t) = {Z(s) : 0 ≤ st}, λ0(t) is an unspecified baseline hazard function and β0 is a p-dimensional parameters of interest. Furthermore, we assume that the censoring time C is independent of the failure time T* given Z.

2.1 Thomas’ estimator under the true model

Thomas’ estimator, denoted by [beta], is the solution to

UZ(β)=i=1n0τ{Zi(t)EZ,R˜i(t;β)}dNi(t)=0,
(2)

where τ = inf{t : pr(T > t) = 0} and EZ,w(t;β)=jweβZj(t)Zj(t)jweβZj(t) for a set w. Oakes (1981) showed that Thomas’ estimator maximizes the partial likelihood, and Goldstein and Langholdz (1992) proved that, under certain regularity conditions,

n1/2(β^β0)N(0,Γ1),
(3)

as n → ∞, where Γ=limnn1UZ(β)β|β=β0.

2.2 Estimators under a working model

To utilize the auxiliary information available on the full cohort, we assume a working Cox’s model specified by α0(t) exp{γ′X(t)}. We first introduce extra notation:

S(k)(t;γ)=n1i=1nYi(t)eγXi(t)Xik(t)andS(k)(t)=n1i=1nYi(t)λi(t)Xik(t),

where k = 0, 1, 2, and for a vector a, a[multiply sign in circle]0 = 1, a[multiply sign in circle]1 = a and a[multiply sign in circle]2 = aa′; λi(t) generically denotes the true hazard function of subject i. Let s(k)(t; γ) = E{S(k)(t; γ)} and s(k)(t) = E{S(k)(t)} where the expectation is taken with respect to the joint distribution of (T, δ,X).

Let [gamma with tilde] denote the full-cohort maximum partial likelihood estimator under the working model, defined as the solution to

U˜(γ)i=1n0τ{Xi(t)X¯(t,γ)}dNi(t)=0,
(4)

where X¯(t,γ)=S(1)(t;γ)S(0)(t;γ) . Lin and Wei (1989) showed that [gamma with tilde] converges in probability to a constant vector γ*, which is the unique solution to

0τ{s(1)(t)s(1)(t;γ)s(0)(t;γ)s(0)(t)}dt=0,

provided that the matrix

A={limnn1U˜(γ)γ|γ=γ*}=0τ{s(2)(t;γ)s(0)(t;γ)(s(1)(t;γ)s(0)(t;γ))2}s(0)(t)dt

is positive definite. Furthermore, Lin and Wei (1989) showed that, as n → ∞,

n1/2(γ˜γ*)N(0,A1BA1),
(5)

under certain regularity conditions, where B = limn→∞ Var {n−1/2Ũ*)}.

Next, we derive another consistent estimator for γ* based on the auxiliary covariates of those subjects selected by the NCC sampling. To achieve this, we impose the following conditions on the auxiliary covariates X:

  • C1. given the true covariates Z(·), X(·) is independent of T* and C
  • C2. there exist ᾰ0(·) and γ̆ such that the induced hazard function of T* given X(·) has a proportional form, i.e.
    λ{t|X¯(t)}=αˇ0(t)exp{γˇX(t)}.
    (6)

Condition C1 indicates that X is a true surrogate of Z, which is commonly assumed in many studies of surrogacy. Condition C2 ensures that Thomas’ estimator based on the auxiliary covariates can estimate the same quantity as the full cohort estimator [gamma with tilde] under the working model (Xiang and Langholz, 1999). Therefore, let [gamma with tilde] be the solution to

UX(γ)=i=1n0τ{Xi(t)EX,R˜i(t;γ)}dNi(t)=0,
(7)

where EX,w(t;γ)=jweγXj(t)Xj(t)jweγXj(t) for a set w. Xiang and Langholz (2003) showed that

n1/2(γ^γ*)nN(0,I1VI1),
(8)

in distribution, where I=limnn1UX(γ)γ|γ=γ* and V = limn→∞ Var {n−1/2UX*)}.

The assumption C2 is required for rigorous theoretical justification but in general it may not hold exactly (Prentice, 1982). Note that the primary interest here is how well the NCC estimator [gamma with circumflex] approximates the full-cohort estimator [gamma with tilde] under the working model rather than how the working model deviates from the true model. Although the limiting difference of [gamma with circumflex][gamma with tilde] may not exactly be zero, such difference in general does not occur unless the magnitude of the misspecification is unreasonably large as noted in Xiang and Langholz (1999). In addition, under the rare-disease assumption that is often true in NCC studies, the induced hazard function can be adequately approximated by λ0(t)E{exp(β′Z(t)) | X (t)} that can further relax the assumption. We will further investigate the impact of condition C2 on the parameter estimation in our simulation studies.

2.3 Projection estimator and its asymptotic properties

Following the similar projection idea used in Chen and Chen (2000), Chen (2002) and Jiang and Zhou (2007), we incorporate the information available on the entire cohort, i.e. {(Ti, δi,Xi), i = 1,(...) n}, into the estimation of β by considering the joint limiting distribution of {n1/2( [beta]−β0)′, n1/2([gamma with circumflex][gamma with tilde])′}′. We introduce some notation. Let r = {1,(...) ,m}, Yr(t) = [product]i[set membership]r Yi(t), PY (t) = pr{Y1(t) = 1}, and x¯(t;γ)=s(1)(t;γ)s(0)(t;γ). Moreover, define

K1=0τPY(t)E[m1ir{Zi(t)EZ,r(t;β0)}{Xi(t)EX,r(t;γ*)}λi(t)|Yr(t)=1]dt,K2=0τPY(t)E[m1ir{Zi(t)EZ,r(t;β0)}{Xi(t)x¯(t;γ*)}λi(t)|Yr(t)=1]dt,1=0τPY(t)E[m1ir{Xi(t)EX,r(t;γ*)}{Xi(t)x¯(t;γ*)}λi(t)|Yr(t)=1]dt,2=0τ0τPY(t)cov[ir{Xi(t)EX,r(t;γ*)}λi(t),Y1(s){X1(s)x¯(s;γ*)}{λ1(s)eγ*X1(s)s(0)(s)s(0)(s;γ*)}|Yr(t)=1]dtds.

PROPOSITION 1

Under conditions C1 and C2, and the regularity conditions given in the Web Appendix A,

n1/2(β^β0γ^γ˜)N((00),(Γ1ΔΔΩ)),

in distribution as n → ∞, where

Δ=Γ1K1I1Γ1K2A1,
(9)

Ω=I1VI1+A1BA12I1(Σ1+Σ2)A1.
(10)

The proof of Proposition 1 is given in the Web Appendix A. By Proposition 1 and the multivariate normal distribution theory, we have,

E{n1/2(β^β0)|(γ^γ˜)}=n1/2ΔΩ1(γ^γ˜).

It is easy to see that Γ, I and A can be consistently estimated by Γ^=n1UZ(β^)β,I^=n1UX(γ^)γandA^=n1U˜(γ˜)γ , respectively. Furthermore, let

K^1=n1i=1n0τ{Zi(t)EZ,R˜i(t;β^)}{Xi(t)EX,R˜i(t,γ^)}dNi(t),K^2=n1i=1n0τ{Zi(t)EZ,R˜i(t;β^)}{Xi(t)X¯(t,γ^)}dNi(t).

Under the regularity conditions, the consistencies of [K with circumflex]1 and [K with circumflex]2 easily follow Lemma 1 in the supplementary material of Xiang and Langholz (2003). Therefore, the covariance component Δ can be consistently estimated by [Delta with circumflex] = [Gamma]−1 [K with circumflex]1 Î−1[Gamma]−1 [K with circumflex]2 Â−1.

Next, examining the components of Ω in (10), we note that Σ2 has a very complicated exposition and it is not straightforward to construct a consistent estimator in general. Thus, we propose to use the bootstrap method (Efron, 1979) to estimate Ω. The bootstrapping approach is feasible here because the auxiliary covariates are available on the entire cohort. More specifically, in the jth run of bootstrap, j = 1, (...) , J, where J is a large number, we first randomly sample n subjects from the full cohort with replacement. Then for each case in this bootstrapped sample, we randomly select m−1 controls from the risk-set at this case failure time excluding case itself and thus obtain a new NCC dataset. Next, we estimate [gamma with tilde] (j) and [gamma with circumflex](j) by fitting the working model to the jth bootstrapped full cohort data and NCC data, respectively. The empirical variance-covariance matrix of [n1/2{[gamma with circumflex](j) − [gamma with tilde](j)}, j = 1, (...) ,J] yields a consistent estimator for Ω, denoted by O. The algorithm does not require any complex variance formula or much programming effort and can be easily implemented in many existing statistical software.

After obtaining the estimates of [Delta with circumflex] and O, an improved estimator for β can be constructed as beta = [beta][Delta with circumflex]O−1 ([gamma with circumflex][gamma with tilde]).Based on Proposition 1, it is easy to show that n1/2(beta − β0) is asymptotically normal with mean zero and variance-covariance matrix Γ−1 − ΔΩ−1Δ′. Therefore, the asymptotic variance of beta is guaranteed to be no bigger than that of Thomas’ estimator and can be consistently estimated by [Gamma]−1[Delta with circumflex]O−1[Delta with circumflex]′.

2.4 Inference remarks and rare-disease approximation

It is worthy of making two observations when comparing the projection approach under the random validation sampling and under the NCC sampling. First, in the methods proposed for random validation sampling, all estimating equations can be rewritten as sums of independent mean-zero terms asymptotically. But in our procedure, estimating functions UZ(β) and UX(γ) based on NCC data do not have such independent presentations, and thus entail new technical challenges to establish the asymptotic properties of the proposed estimator. Second, although [gamma with circumflex] and [gamma with tilde] converge in probability to the same limit γ* under certain conditions, they do not have the same limiting distribution unless m → ∞, see (5) and (8). However, in random validation sampling, the validation-set estimator and the full-cohort estimator always converge to the same limiting distribution and have the same asymptotic covariance with the validation-set estimator based on the true model. In the context of NCC sampling, the asymptotic variance of n1/2([gamma with circumflex][gamma with tilde]) has much more complicated form as shown in Proposition 1 and we thus propose to estimate it using the bootstrap method. In summary, these complications root in the non-independent sampling scheme of NCC design.

When the disease is rare, as in many NCC studies, the proposed projection estimator can be well approximated by a plug-in type estimator because the estimation on the variance component V can be greatly simplified (Xiang and Langholz, 2003) and Σ2 is approximately negligible. More specifically, we first propose a rare-disease estimator for Ω given by Or = Î−1 Vr Î−1 + Â−1 [B with circumflex]Â−1 − 2Î−1[Sigma]1 Â−1, where

V^r=n1i=1n0τ{Xi(t)EX,R˜i(t,γ^)}2dNi(t),B^=n1i=1n[0τ{Xi(t)X¯(t,γ˜)}{dNi(t)Yi(t)eγ˜Xi(t)jdNj(t)nS(0)(t;γ˜)}]2,Σ^1=n1i=1n0τ{Xi(t)EX,R˜i(t;γ^)}{Xi(t)X¯i(t,γ˜)}dNi(t).

Therefore, the rare-disease approximate estimator is defined β˜r=β^Δ^Ω^r1(γ^γ˜) and its variance estimator is given by Γ^1Δ^Ω^r1Δ^ .

3. Numerical Studies

3.1 Simulations under correct model conditions

We first investigate the finite-sample performance of the proposed estimator and the rare-disease estimator by extensive simulations. We compare the efficiency of the proposed estimator with Thomas’ estimator, a local-averaging estimator (Chen, 2004), and an IPW estimator (Samuelsen, 1997). We consider the following scenarios:

  • S1. independent auxiliary covariate: Z and X are independently and identically distributed
  • S2. normal auxiliary covariate is measured with normal error: X=Z+ε,ε~N(0,σε2) , The true covariate Z ~ N(2, 0.52) and σε = 0.5 or 0.2. We generate the failure time T* from a Cox’s model λ(t|Z) = λ0eβZ, where three different values of β as 0,−0.5 and −1 are considered. We examine two censoring scenarios: the random censoring, where C ~ U(0, 5), and the covariate-dependent censoring by generating the censoring time C uniformly from {0, min(3|Z|, 5)}. The value of λ0 is chosen to control the disease incidence rate at 6% ~ 7%. Under S1, we examine the robustness of the proposed estimator with completely independent/wrong surrogate covariate. Scenario S2 is a classical measurement error model and it is easy to see that conditions C1 and C2 are satisfied (Xiang and Langholz, 1999). We consider the cohort size of 2000 and the NCC study with 2 or 4 controls. For Chen’s estimator, we set the local-average bandwidth to be 2n−1/3. For the IPW estimator, the weight function is defined as πi = δi + (1 − δi)V0i/p0i where V0i is the indicator of subject i ever being selected as a control and p0i=1ΠTjTi(1m1ΣkYk(Tj)1δj). We run 500 simulations for each setting and the number of bootstrap samples is set to be 500.

Simulation results under the random censoring are summarized in Table 1 and those under the covariate-dependent censoring are presented in Table 2. In all the scenarios, the proposed estimator shows negligible biases. The estimated standard errors (SE), using the proposed bootstrap method, are close to the sample standard deviation (SD) of the estimates. Thus, the 95% Wald-type confidence intervals all have reasonable coverage probabilities (CP). Moreover, under this rare-disease situation, the rare-disease approximate estimator performs well as it yields reasonable coverage probabilities (CP*).

Table 1
Simulation results with independent censoring
Table 2
Simulation results with covariate-dependent censoring

To compare the efficiency of various approaches, we calculate the empirical relative efficiency for each estimator, defined as the ratio of sample variances of the estimator and the full-cohort maximum partial likelihood estimator under the true model with the latter one being a reference. The efficiency results are summarized in Table 1 and Table 2 (see the last four columns). In scenario S1, where the surrogate covariate is completely independent of the true covariate, the proposed estimator shows very comparable efficiency as Thomas’ estimator because independent surrogate covariates can hardly provide any information to improve Thomas’ estimator [beta]. Under this scenario, the IPW estimator outperforms others as the selection probability is accurately estimated and used to recover the original full-cohort.

In scenario S2, where a true surrogate covariate is available, the proposed general estimator shows efficiency gain over Thomas’ estimator and the magnitude of gain is more obvious when the number of controls is small and the measurement error is small. For example, when β = −1, σε = 0.2 and m−1 = 2, the gain of the proposed method over Thomas’s estimator, calculated as (RE/RE1 − 1)%, reaches 61% with covariate-independent censoring and 49% with covariate-dependent censoring. On the other hand, the efficiency of the proposed method approaches the full-cohort efficiency as the measurement error decreases or the number of controls increases. For example, when β = 0, σε = 0.2 and m−1 = 4, the relative efficiency of our estimator achieves 99.1% with covariate-independent censoring and 98.1% with covariate-dependent censoring. Moreover, in all simulations under scenario S2, the proposed estimator always outperforms other competing estimators.

Additional simulations when the disease is common with 15% and 25% incidence rates are presented in the Web Appendix B. We observe that the proposed estimator still performs well but the rare-disease approximation may fail with unsatisfactory coverage probabilities.

3.2 Simulations of sensitivity analysis

In this subsection, we further investigate the properties of the proposed estimator beta when conditions C1 and C2 are violated. Primarily, we focus on the violation of condition C2 that has some practical implications and consider the following scenarios:

  • S3. non-normal covariate and error: X = Z + u where Z ~ U(1, 3) and u ~ U(−1, 1)
  • S4. multiplicative error, i.e. X = εZ where ε ~ exp(1) and Z ~ N(2, 0.52)
  • S5. working model with a missing covariate, i.e. Z = (Z1,Z2) but X = Z1 only
  • S6. informative auxiliary covariate, i.e. X = Z + α log T* + ε.

Under scenarios S3 and S4, the induced hazard function λ(t|X) defined in (6) does not have a proportional exponential form unless β0 = 0. Under S5 of missing covariate, λ(t|X) generally will not be proportional. We generate dichotomous variables (Z1,Z2) from a multinomial distribution with πlk = pr(Z1 = l,Z2 = k), where k, l = 0, 1 as in Xiang and Langholz (1999). We consider an extreme situation of the odds-ratio being 5 by setting (π00, π01, π10; π11) = (0.2, 0.2, 0.1, 0.5). Note that, under scenarios S3–S5, the surrogate covariate X violates condition C2 only. Finally, under scenario S6, Z ~ N(2, 0.52), ε ~ N(0, 0.22) and we explore the informative auxiliary covariate as in Jiang and Zhou (2007). When α ≠ 0, the auxiliary covariate X in S6 clearly violates both C1 and C2.

In the sensitivity analyses, we consider the random censoring of U(0, 5) and use 2 controls only. Table 3 gives some representative results for different values of β0. Under scenarios S3, S4 and S5 where only condition C2 is violated, we observe that the proposed estimator is reasonably robust with small biases and satisfactory coverage probabilities. All mean squared errors are also reasonably small. The results agree with the observations by Xiang and Langholz (1999) that the difference between NCC estimates and the full-cohort estimates is often small for moderate violation of condition C2 due to measurement error or covariate omission. Moreover, for the missing covariate situation (S5), the efficiency gain of the proposed estimator on the parameter corresponding to the covariate without missing (Z1) is clearly more obvious than that for the covariate with missing (Z2). For scenario S6, we observe that biases become more obvious when the magnitude of β0 increases, and thus the coverage probabilities deteriorate; indicating that condition C1 is an important assumption for the validity of the projection method.

Table 3
Simulation results of sensitivity analysis

3.3 Wilms’ tumor studies

We demonstrate the proposed approach by utilizing of a full-cohort study collected from studies conducted by the National Wilms Tumor Study Group (D’Angio et al., 1989; Green et al., 1998). Wilms’ tumor is a malignant tumor of the kidney and typically occurs in children. This dataset contains full information of 3,915 subjects participating the third and fourth Wilms’ tumor studies and 669 (17.09%) patients who had disease relapse are considered as cases. We also compare the proposed estimator with Thomas’ estimator and the IPW estimator under the NCC sampling with various numbers of controls.

To estimate the effects of unfavorable histology status and other covariates on patients’ relapse-free survival, we follow Kulich and Lin (2004) to assume that the relapse time follows model (1) with eight covariates: Age1 (age of diagnosis if less than 1 year old); Age2 (age of diagnosis of 1 year and older); UH (unfavorable central histology); Age1×UH; Age2×UH; Stage (3–4 vs 1–2); Diameter; and Stage×Diameter. We simulate NCC studies from this full cohort with the number of controls ranging from 1 to 3. The evaluation of tumor histology by central pathologists is considered the true histology assessment and pretended to be available only for the cases and selected controls; the reading by pathologists in the local institutions is considered a surrogate measurement and available for the entire cohort.

The results are summarized in Table 4. Under all situations, we observe that the estimates from the proposed method, the IPW method and Thomas’ approach are all similar to the full-cohort estimates. The standard error estimates from the proposed method are uniformly smaller than those from Thomas’ estimator indicating that the proposed estimator is more efficient by incorporating auxiliary information from the full cohort. In addition, as observed in our simulations, the empirical efficiency gain of the proposed method over Thomas’ estimator is evident when the number of controls in the NCC study is small. When the number of controls increases, the efficiencies of all estimators approach to the full cohort estimator. Moreover, for those covariates whose true values are available for the entire cohort, the proposed estimator approaches the full-cohort efficiency very fast and achieves higher relative efficiency even with a small number of controls compared to Thomas’ estimator. For example, for the covariate of tumor stage (Stage), the relative efficiency of our proposed estimator with respect to the full-cohort estimator already achieves 78% with just one control while it is only 33% for Thomas’ estimator.

Table 4
Wilms’ tumor study: parameter estimates and standard errors

4. Concluding remarks

We show that the projection idea can be well employed in NCC studies with auxiliary covariates and can lead to an improved estimator for the regression parameters in Cox’s model. The efficiency gains of our proposed estimator over Thomas’ estimator are large when the number of controls in NCC study is small and the correlation between true covariates and auxiliary covariates is strong. When condition C2 (6) is violated, the proposed projection estimator is theoretically biased but the bias is usually small in realistic situations. In addition, our simulation studies showed that the bias was often negligible compared to the variance. The proposed approach is computationally convenient and can be implemented using common statistical software with a little programming effort. The R-code for implementing the proposed approach can be obtained from the authors.

In this paper, the proposed estimator builds on Thomas’ estimator, which is the most commonly used method in practice for analyzing NCC data and only requires the true covariates to be measured for the cases and selected controls at case failure times rather than the entire history of the true covariate process. But when extended NCC data are available, Thomas’ estimator is not semiparametrically efficient (Robins, et al. 1994). In fact, as we observed in simulation studies, the IPW estimator (Samuelsen, 1997) often performed well. Therefore, statistical methods that can make use of the auxiliary covariate information to further improve the efficiency of the IPW estimator are also of our great interest. More specifically, replace the estimating equations used in (2) and (7) by

UA*(θ)=i=1n0τπi{Ai(t)A¯w(t;θ)}dNi(t)=0,

where A¯w(t;θ)=ΣjπjYj(t)eγAj(t)Aj(t)ΣjπjYj(t)eγAj(t) with (A, θ) = (Z, β) for the true model and (A, θ) = (X, γ) for the working model. It is easy to show that the solution to the above estimating equation and the corresponding full cohort estimator always converge to the same limit under either the true model or the working model. Thus, the IPW-based projection method may relax the proportional assumption on the induced hazard function in condition C2. This research will be investigated elsewhere.

5. Supplementary Materials

The Web Appendices referenced in Section 2.3 and Section 3.1 are available under the Paper Information link at the Biometrics website http://www.tibs.org/biometrics.

Acknowledgements

The authors thank the associate editor and two referees for their comments that substantially improved the presentation of the paper. This work was partially supported by NIEHS Pilot Project (M. Liu) and National Science Foundation Grant DMS-0504269 (W. Lu).

References

  • Chen HY, Little RJA. Proporitional hazards regression with missing covariates. Journal of The American Statistical Association. 1999;94:896–908.
  • Chen KN. Statistical estimation in the proportional hazards model with risk set sampling. Annals of Statistics. 2004;32:1513–1532.
  • Chen YH. Cox regression in cohort studies with validation sampling. Journal of the Royal Statistical Society, Series B. 2002;64:51–62.
  • Chen YH, Chen H. A unified approach to regression analysis under double-sampling designs. Journal of the Royal Statistical Society, Series B. 2000;62:449–460.
  • Cox DR. Regression models and life-tables. Journal of the Royal Statistical Society, Series B. 1972;34:187–220.
  • D’Angio GJ, Breslow N, Beckwith JB, Evans A, Baum H, de Lorimier A, Ferbach D, Hrabovsky E, Jones G, Kelalis P. Results of the Third National Wilms Tumor Study. Treatment of Wilms’ tumor. Cancer. 1989;64:349–360. [PubMed]
  • Efron B. Bootstrap methods - another look at the jackknife. Annals of Statistics. 1979;7:1–26.
  • Green DM, Breslow NE, Beckwith JB, Finklestein JZ, Grundy PG, Thomas PRM, Kim T, Shochat S, Haase GM, Ritchey ML, Kelalis PP, D’Angio GJ. Comparison between single-dose and divided-dose administration of dacti-nomycin and doxorubicin for patients with Wilms’ tumor: A report from the National Wilms’ Tumor Study Group. Journal of Clinical Oncology. 1998;16:237–245. [PubMed]
  • Goldstein L, Langholz B. Asymptotic theory for nested case-control sampling in the cox regression-model. Annals of Statistics. 1992;20:1903–1928.
  • Jiang JC, Zhou HB. Additive hazard regression with auxiliary covariates. Biometrika. 2007;94:359–369.
  • Kulich M, Lin DY. Improving the efficiency of relative-risk estimation in case-cohort studies. Journal of The American Statistical Association. 2004;99:832–844.
  • Lin DY, Wei LJ. The robust inference for the cox proportional hazards model. Journal of The American Statistical Association. 1989;84:1074–1078.
  • Oakes D. Survival times - aspects of partial likelihood. International Statistical Review. 1981;49:235–252.
  • Prentice R. Covariate measurement errors and parameter estimation in a filure time regression model. Biometrika. 1982;69:331–342.
  • Robins JM, Rotnitzky A, Zhao LP. Estimation of regression-coefficients when some regressors are not always observed. Journal of The American Statistical Association. 1994;89:846–866.
  • Samuelsen SO. A pseudolikelihood approach to analysis of nested case-control studies. Biometrika. 1997;84:379–394.
  • Scheike TH, Juul A. Maximum Likelihood Estimation for Cox’s Regression Model under Nested Case-Control Sampling. Biostatistics. 2004;5:193–206. [PubMed]
  • Thomas DC. Addendum to “Methods of Cohort Analysis - Appraisal by Application to Asbestos Mining” In: Liddell FDK, McDonald JC, Thomas DC, editors. Journal of the Royal Statistical Society, Series A. Vol. 140. 1977. pp. 469–491.
  • Xiang AH, Langholz B. Comparison of case-control to full cohort analyses under model misspecification. Biometrika. 1999;86:221–226.
  • Xiang AH, Langholz B. Robust variance estimation for rate ratio parameter estimates from individually matched case-control data. Biometrika. 2003;90:741–746.
  • Zeng D, Lin DY, Avery CL, North KE, Bray MS. Efficient Semi-parametric Estimation of Haplotype-Disease Associations in Case-Cohort and Nested Case-Control Studies. Biostatistics. 2006;7:486–502. [PubMed]