Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2889133

Formats

Article sections

- SUMMARY
- 1. Introduction
- 2. Projection Estimator and Statistical Inference
- 3. Numerical Studies
- 4. Concluding remarks
- 5. Supplementary Materials
- References

Authors

Related links

Biometrics. Author manuscript; available in PMC 2010 June 29.

Published in final edited form as:

Published online 2009 June 8. doi: 10.1111/j.1541-0420.2009.01277.x

PMCID: PMC2889133

NIHMSID: NIHMS117764

The publisher's final edited version of this article is available at Biometrics

Nested case-control (NCC) design is a popular sampling method in large epidemiologic studies for its cost effectiveness to investigate the temporal relationship of diseases with environmental exposures or biological precursors. Thomas’ maximum partial likelihood estimator is commonly used to estimate the regression parameters in Cox’s model for NCC data. In this paper, we consider a situation that failure/censoring information and some crude covariates are available for the entire cohort in addition to NCC data and propose an improved estimator that is asymptotically more efficient than Thomas’ estimator. We adopt a projection approach that, heretofore, has only been employed in situations of random validation sampling and show that it can be well adapted to NCC designs where the sampling scheme is a dynamic process and is not independent for controls. Under certain conditions, consistency and asymptotic normality of the proposed estimator are established and a consistent variance estimator is also developed. Furthermore, a simplified approximate estimator is proposed when the disease is rare. Extensive simulations are conducted to evaluate the finite sample performance of our proposed estimators and to compare the efficiency with Thomas’ estimator and other competing estimators. Moreover, sensitivity analyses are conducted to demonstrate the behavior of the proposed estimator when model assumptions are violated, and we find that the biases are reasonably small in realistic situations. We further demonstrate the proposed method with data from studies on Wilms’ tumor.

Due to its quality of being cost-effective in studying temporal relationship between disease and exposures, nested case-control (NCC) sampling (Thomas, 1977; Oakes, 1981) has been considered a useful alternative to cohort design and case-control design. The most commonly used analytical approach for NCC data is Thomas’ maximum partial likelihood estimation approach (Thomas, 1977; Oakes, 1981) under Cox proportional hazards model (Cox, 1972) assumption. The consistency and asymptotic normality of Thomas’ estimator have been formally established using counting process and martingale theory (Goldstein and Langholz, 1992). Recently, Chen (2004) proposed a partial likelihood based local-averaging estimator that is more efficient than Thomas’ estimator away from the null.

Furthermore, in the presence of *extended* NCC data (Chen, 2004) which consist of failure/censoring times and indices for the full cohort and entire covariate histories for the cases and selected controls, a number of methods have been proposed to improve the estimation efficiency: e.g. the inverse probability weighted (IPW) method (Robins et al., 1994; Samuelsen, 1997); the local-average estimation approach (Chen, 2004); the likelihood-based approaches (Chen and Little, 1999; Scheike and Juul, 2004; Zeng et al., 2006). Since parent cohorts of NCC studies are usually well-characterized, carefully-followed epidemiological cohorts, the failure/censoring information on the entire cohort is often available. In many studies, however, the true exposure covariates may be difficult or expensive to be assembled for the full cohort or for their entire history to be measured for the cases and selected controls. Instead, some auxiliary covariates, such as crude measurements of the exposure or inferred covariates from questionnaire, can be easily or cheaply assembled for the entire cohort. The aims of this paper are to incorporate the information of failure/censoring and auxiliary covariates from the entire cohort into the analysis of NCC data and to propose an easily computed estimator that is asymptotically more efficient than Thomas’ estimator.

Towards this goal, we propose to adopt a projection technique which has been used to improve the efficiency of various models in cohort studies with random validation sampling, such as general linear regression models (Chen and Chen, 2000), Cox’s model (Chen, 2002), and the additive hazards model (Jiang and Zhou, 2007). To the best of our knowledge, the projection method heretofore has been only studied for the random validation sampling and its adaptation to the NCC sampling entails new challenges, primarily due to the non-independent sampling scheme of NCC design. Statistical inference thus can not rely on the conventional independent central limit theory. In this paper, we show that the projection method can be well adapted to the NCC design under certain conditions and will lead to an improved estimator that is guaranteed to achieve an asymptotic variance no bigger than that of Thomas’ estimator.

The rest of this article is organized as follows. In Section 2, we derive the proposed estimator and its asymptotic properties and present a practical computation procedure. A rare-disease approximate estimator is also provided and some inference remarks are discussed. In Section 3, extensive simulation studies are conducted to evaluate the performance of our proposed estimators under various practical settings. An illustration with a real dataset from Wilms’ tumor studies is also provided. We conclude with some discussions in Section 4 and provide all the technical details in Supplementary Material.

Consider a full cohort of size *n*. Let
$\{{T}_{i}^{*},{C}_{i,}{Z}_{i}(\xb7),i=1,\cdots \phantom{\rule{thinmathspace}{0ex}},n\}$
denote *n* i.i.d. triplets of failure times, censoring times, and *p*-dimensional covariate processes of interest. Define
${T}_{i}=\text{min}({T}_{i}^{*},{C}_{i}),{\delta}_{i}=I({T}_{i}^{*}\le {C}_{i})$
, *N _{i}*(

Assume that, given the true covariate *Z*(·), *T** follows a Cox proportional hazards model

$$\mathrm{\lambda}\{t|\overline{Z}(t)\}={\mathrm{\lambda}}_{0}(t)\phantom{\rule{thinmathspace}{0ex}}\text{exp}\{{\beta}_{0}^{\prime}Z(t)\},$$

(1)

where (*t*) = {*Z*(*s*) : 0 ≤ *s* ≤ *t*}, λ_{0}(*t*) is an unspecified baseline hazard function and β_{0} is a *p*-dimensional parameters of interest. Furthermore, we assume that the censoring time *C* is independent of the failure time *T** given *Z*.

Thomas’ estimator, denoted by , is the solution to

$${U}_{Z}(\beta )={\displaystyle \sum _{i=1}^{n}{\displaystyle {\int}_{0}^{\tau}\{{Z}_{i}(t)-{E}_{Z,{\tilde{R}}_{i}}(t;\beta )\}{\mathit{\text{dN}}}_{i}(t)=0,}}$$

(2)

where τ = inf{*t* : pr(*T* > *t*) = 0} and
${E}_{Z,w}(t;\beta )=\frac{{\displaystyle {\sum}_{j\in w}{e}^{\beta \prime {Z}_{j}(t)}{Z}_{j}(t)}}{{\displaystyle {\sum}_{j\in w}{e}^{\beta \prime {Z}_{j}(t)}}}$
for a set *w*. Oakes (1981) showed that Thomas’ estimator maximizes the partial likelihood, and Goldstein and Langholdz (1992) proved that, under certain regularity conditions,

$${n}^{1/2}(\widehat{\beta}-{\beta}_{0})\to N(0,{\mathrm{\Gamma}}^{-1}),$$

(3)

as *n* → ∞, where
$\mathrm{\Gamma}=-\phantom{\rule{thinmathspace}{0ex}}{\text{lim}}_{n\to \mathrm{\infty}}{n}^{-1\frac{\partial {U}_{Z}(\beta )}{\partial \beta}{|}_{\beta ={\beta}_{0}}}$.

To utilize the auxiliary information available on the full cohort, we assume a working Cox’s model specified by α_{0}(*t*) exp{γ′*X*(*t*)}. We first introduce extra notation:

$$\begin{array}{lll}{S}^{(k)}(t;\mathrm{\gamma})={n}^{-1}{\displaystyle \sum _{i=1}^{n}{Y}_{i}(t){e}^{\mathrm{\gamma}\prime {X}_{i}(t)}{X}_{i}^{\otimes k}(t)}\hfill & \text{and}\hfill & {S}^{(k)}(t)={n}^{-1}{\displaystyle \sum _{i=1}^{n}{Y}_{i}(t){\lambda}_{i}(t){X}_{i}^{\otimes k}(t),}\hfill \end{array}$$

where *k* = 0, 1, 2, and for a vector *a*, *a*^{0} = 1, *a*^{1} = *a* and *a*^{2} = *aa*′; λ* _{i}*(

Let denote the full-cohort maximum partial likelihood estimator under the working model, defined as the solution to

$$\tilde{U}(\mathrm{\gamma}){\displaystyle \sum _{i=1}^{n}{\displaystyle {\int}_{0}^{\tau}\{{X}_{i}(t)-\overline{X}(t,\mathrm{\gamma})\}{\mathit{\text{dN}}}_{i}(t)=0,}}$$

(4)

where
$\overline{X}(t,\mathrm{\gamma})=\frac{{S}^{(1)}(t;\mathrm{\gamma})}{{S}^{(0)}(t;\mathrm{\gamma})}$
. Lin and Wei (1989) showed that converges in probability to a constant vector γ_{*}, which is the unique solution to

$${\int}_{0}^{\tau}\{{s}^{(1)}(t)-\frac{{s}^{(1)}(t;\mathrm{\gamma})}{{s}^{(0)}(t;\mathrm{\gamma})}{s}^{(0)}(t)}\}\mathit{\text{dt}}=0,$$

provided that the matrix

$$A=-\left\{\underset{n\to \infty}{\mathrm{lim}}{n}^{-1}\frac{\partial \tilde{U}(\mathrm{\gamma})}{\partial \mathrm{\gamma}}{|}_{\mathrm{\gamma}={\mathrm{\gamma}}_{*}}\right\}={\displaystyle {\int}_{0}^{\tau}\left\{\frac{{s}^{(2)}(t;\mathrm{\gamma})}{{s}^{(0)}(t;\mathrm{\gamma})}-{\left(\frac{{s}^{(1)}(t;\mathrm{\gamma})}{{s}^{(0)}(t;\mathrm{\gamma})}\right)}^{\otimes 2}\right\}{s}^{(0)}(t)\mathit{\text{dt}}}$$

is positive definite. Furthermore, Lin and Wei (1989) showed that, as *n* → ∞,

$${n}^{1/2}(\tilde{\mathrm{\gamma}}-{\mathrm{\gamma}}_{*})\to N(0,{A}^{-1}{\mathit{\text{BA}}}^{-1}),$$

(5)

under certain regularity conditions, where *B* = lim_{n→∞} Var {*n*^{−1/2}*Ũ* (γ_{*})}.

Next, we derive another consistent estimator for γ_{*} based on the auxiliary covariates of those subjects selected by the NCC sampling. To achieve this, we impose the following conditions on the auxiliary covariates *X*:

- C1. given the true covariates
*Z*(·),*X*(·) is independent of*T** and*C* - C2. there exist ᾰ
_{0}(·) and γ̆ such that the induced hazard function of*T** given*X*(·) has a proportional form, i.e.$$\lambda \{t|\overline{X}(t)\}={\stackrel{\u02c7}{\alpha}}_{0}(t)\phantom{\rule{thinmathspace}{0ex}}\text{exp}\{\stackrel{\u02c7}{\mathrm{\gamma}}\prime X(t)\}.$$(6)

Condition C1 indicates that *X* is a true surrogate of *Z*, which is commonly assumed in many studies of surrogacy. Condition C2 ensures that Thomas’ estimator based on the auxiliary covariates can estimate the same quantity as the full cohort estimator under the working model (Xiang and Langholz, 1999). Therefore, let be the solution to

$${U}_{X}(\mathrm{\gamma})={\displaystyle \sum _{i=1}^{n}{\displaystyle {\int}_{0}^{\tau}\{{X}_{i}(t)-{E}_{X,{\tilde{R}}_{i}}(t;\mathrm{\gamma})\}{\mathit{\text{dN}}}_{i}(t)=0,}}$$

(7)

where
${E}_{X,w}(t;\mathrm{\gamma})=\frac{{\displaystyle {\sum}_{j\in w}{e}^{\mathrm{\gamma}\prime {X}_{j}(t)}{X}_{j}(t)}}{{\displaystyle {\sum}_{j\in w}{e}^{\mathrm{\gamma}\prime {X}_{j}(t)}}}$
for a set *w*. Xiang and Langholz (2003) showed that

$${n}^{1/2}(\widehat{\mathrm{\gamma}}-{\mathrm{\gamma}}_{*}){\to}_{n\to \infty}N(0,{I}^{-1}{\mathit{\text{VI}}}^{-1}),$$

(8)

in distribution, where
$I=-\phantom{\rule{thinmathspace}{0ex}}{\text{lim}}_{n\to \mathrm{\infty}}{n}^{-1\frac{\partial {U}_{X}(\mathrm{\gamma})}{\partial \mathrm{\gamma}}{|}_{\mathrm{\gamma}={\mathrm{\gamma}}_{*}}}$
and *V* = lim_{n→∞} Var {*n*^{−1/2}*U _{X}*(γ

The assumption C2 is required for rigorous theoretical justification but in general it may not hold exactly (Prentice, 1982). Note that the primary interest here is how well the NCC estimator approximates the full-cohort estimator under the working model rather than how the working model deviates from the true model. Although the limiting difference of − may not exactly be zero, such difference in general does not occur unless the magnitude of the misspecification is unreasonably large as noted in Xiang and Langholz (1999). In addition, under the rare-disease assumption that is often true in NCC studies, the induced hazard function can be adequately approximated by λ_{0}(*t*)E{exp(β′*Z*(*t*)) | (*t*)} that can further relax the assumption. We will further investigate the impact of condition C2 on the parameter estimation in our simulation studies.

Following the similar projection idea used in Chen and Chen (2000), Chen (2002) and Jiang and Zhou (2007), we incorporate the information available on the entire cohort, i.e. {(*T _{i}*, δ

$$\begin{array}{l}{K}_{1}={\displaystyle {\int}_{0}^{\tau}{P}_{Y}(t)\mathrm{E}\left[{m}^{-1}{\displaystyle \sum _{i\in r}\{{Z}_{i}(t)-{E}_{Z,r}(t;{\beta}_{0})\}{\{{X}_{i}(t)-{E}_{X,r}(t;{\mathrm{\gamma}}_{*})\}}^{\prime}{\lambda}_{i}(t)|{Y}_{r}(t)=1}\right]}\phantom{\rule{thinmathspace}{0ex}}\mathit{\text{dt}},\hfill \\ {K}_{2}={\displaystyle {\int}_{0}^{\tau}{P}_{Y}(t)\mathrm{E}\left[{m}^{-1}{\displaystyle \sum _{i\in r}}\{{Z}_{i}(t)-{E}_{Z,r}(t;{\beta}_{0})\}{\{{X}_{i}(t)-\overline{x}(t;{\mathrm{\gamma}}_{*})\}}^{\prime}{\lambda}_{i}(t)|{Y}_{r}(t)=1\right]}\phantom{\rule{thinmathspace}{0ex}}\mathit{\text{dt}},\hfill \\ {\displaystyle {\sum}_{1}={\displaystyle {\int}_{0}^{\tau}{P}_{Y}(t)\mathrm{E}\left[{m}^{-1}{\displaystyle \sum _{i\in r}}\{{X}_{i}(t)-{E}_{X,r}(t;{\mathrm{\gamma}}_{*})\}{\{{X}_{i}(t)-\overline{x}(t;{\mathrm{\gamma}}_{*})\}}^{\prime}{\lambda}_{i}(t)|{Y}_{r}(t)=1\right]}\phantom{\rule{thinmathspace}{0ex}}\mathit{\text{dt}},}\hfill \\ {\displaystyle {\sum}_{2}={\displaystyle {\int}_{0}^{\tau}{\displaystyle {\int}_{0}^{\tau}{P}_{Y}(t)}\mathrm{\text{cov}}\phantom{\rule{thinmathspace}{0ex}}[{\displaystyle \sum _{i\in r}\{{X}_{i}(t)-{E}_{X,r}(t;{\mathrm{\gamma}}_{*})\}{\lambda}_{i}(t)}},}\hfill \\ \hfill {Y}_{1}\left(s\right){\{{X}_{1}\left(s\right)-\overline{x}(s;{\mathrm{\gamma}}_{*})\}}^{\prime}\left\{{\lambda}_{1}\left(s\right)-\frac{{e}^{{\mathrm{\gamma}}_{*}^{\prime}{X}_{1}\left(s\right)}{s}^{(0)}\left(s\right)}{{s}^{(0)}\left(s;{\mathrm{\gamma}}_{*}\right)}\right\}|{Y}_{r}\left(t\right)=1]\mathit{\text{dtds}}.\end{array}$$

Under conditions C1 and C2, and the regularity conditions given in the Web Appendix A,

$${n}^{1/2}\left(\begin{array}{l}\widehat{\beta}-{\beta}_{0}\\ \widehat{\mathrm{\gamma}}-\tilde{\mathrm{\gamma}}\end{array}\right)\to N\left({\left(\begin{array}{l}0\\ 0\end{array}\right)}_{,}\left(\begin{array}{l}\begin{array}{cc}{\mathrm{\Gamma}}^{\mathrm{-}1}& \mathrm{\Delta}\end{array}\\ \begin{array}{cc}{\mathrm{\Delta}}^{\mathrm{\prime}}& \mathrm{\Omega}\end{array}\end{array}\right)\right),$$

in distribution as *n* → ∞, where

$$\mathrm{\Delta}={\mathrm{\Gamma}}^{-1}{K}_{1}{I}^{-1}-{\mathrm{\Gamma}}^{-1}{K}_{2}{A}^{-1},$$

(9)

$$\mathrm{\Omega}={I}^{-1}V{I}^{-1}+{A}^{-1}B{A}^{-1}-2{I}^{-1}({\mathrm{\Sigma}}_{1}+{\mathrm{\Sigma}}_{2}){A}^{-1}.$$

(10)

The proof of Proposition 1 is given in the Web Appendix A. By Proposition 1 and the multivariate normal distribution theory, we have,

$$\mathrm{E}\{{n}^{1/2}(\widehat{\beta}-{\beta}_{0})|(\widehat{\mathrm{\gamma}}-\tilde{\mathrm{\gamma}})\}={n}^{1/2}\mathrm{\Delta}{\mathrm{\Omega}}^{-1}(\widehat{\mathrm{\gamma}}-\tilde{\mathrm{\gamma}}).$$

It is easy to see that Γ, *I* and *A* can be consistently estimated by
$\widehat{\mathrm{\Gamma}}=-{n}^{-1\frac{\partial {U}_{Z}(\widehat{\beta})}{\partial \beta}},\widehat{I}=-{n}^{-1\frac{\partial {U}_{X}(\widehat{\mathrm{\gamma}})}{\partial \mathrm{\gamma}}}\phantom{\rule{thinmathspace}{0ex}}\text{and}\phantom{\rule{thinmathspace}{0ex}}\widehat{A}=-{n}^{-1\frac{\partial \tilde{U}(\tilde{\mathrm{\gamma}})}{\partial \mathrm{\gamma}}}$
, respectively. Furthermore, let

$$\begin{array}{l}{\widehat{K}}_{1}={n}^{-1}{\displaystyle \sum _{i=1}^{n}{\displaystyle {\int}_{0}^{\tau}\{{Z}_{i}(t)-{E}_{Z,{\tilde{R}}_{i}}(t;\widehat{\beta})}\}{\{{X}_{i}(t)-{E}_{X,{\tilde{R}}_{i}}(t,\widehat{\mathrm{\gamma}})\}}^{\prime}{\mathit{\text{dN}}}_{i}(t),}\\ {\widehat{K}}_{2}={n}^{-1}{\displaystyle \sum _{i=1}^{n}{\displaystyle {\int}_{0}^{\tau}\{{Z}_{i}(t)-{E}_{Z,{\tilde{R}}_{i}}(t;\widehat{\beta})}\}{\{{X}_{i}(t)-\overline{X}(t,\widehat{\mathrm{\gamma}})\}}^{\prime}{\mathit{\text{dN}}}_{i}(t).}\end{array}$$

Under the regularity conditions, the consistencies of _{1} and _{2} easily follow Lemma 1 in the supplementary material of Xiang and Langholz (2003). Therefore, the covariance component Δ can be consistently estimated by = ^{−1} _{1} *Î*^{−1} − ^{−1} _{2} *Â*^{−1}.

Next, examining the components of Ω in (10), we note that Σ_{2} has a very complicated exposition and it is not straightforward to construct a consistent estimator in general. Thus, we propose to use the bootstrap method (Efron, 1979) to estimate Ω. The bootstrapping approach is feasible here because the auxiliary covariates are available on the entire cohort. More specifically, in the *j*th run of bootstrap, *j* = 1, , *J*, where *J* is a large number, we first randomly sample *n* subjects from the full cohort with replacement. Then for each case in this bootstrapped sample, we randomly select *m*−1 controls from the risk-set at this case failure time excluding case itself and thus obtain a new NCC dataset. Next, we estimate (*j*) and (*j*) by fitting the working model to the *j*th bootstrapped full cohort data and NCC data, respectively. The empirical variance-covariance matrix of [*n*^{1/2}{(*j*) − (*j*)}, *j* = 1, ,*J*] yields a consistent estimator for Ω, denoted by . The algorithm does not require any complex variance formula or much programming effort and can be easily implemented in many existing statistical software.

After obtaining the estimates of and , an improved estimator for β can be constructed as = − ^{−1} (− ).Based on Proposition 1, it is easy to show that *n*^{1/2}( − β_{0}) is asymptotically normal with mean zero and variance-covariance matrix Γ^{−1} − ΔΩ^{−1}Δ′. Therefore, the asymptotic variance of is guaranteed to be no bigger than that of Thomas’ estimator and can be consistently estimated by ^{−1} − ^{−1}′.

It is worthy of making two observations when comparing the projection approach under the random validation sampling and under the NCC sampling. First, in the methods proposed for random validation sampling, all estimating equations can be rewritten as sums of independent mean-zero terms asymptotically. But in our procedure, estimating functions *U _{Z}*(β) and

When the disease is rare, as in many NCC studies, the proposed projection estimator can be well approximated by a plug-in type estimator because the estimation on the variance component *V* can be greatly simplified (Xiang and Langholz, 2003) and Σ_{2} is approximately negligible. More specifically, we first propose a rare-disease estimator for Ω given by * _{r}* =

$$\begin{array}{l}{\widehat{V}}_{r}={n}^{-1}{\displaystyle \sum _{i=1}^{n}}{\displaystyle {\int}_{0}^{\tau}}{\{{X}_{i}(t)-{E}_{X,{\tilde{R}}_{i}}(t,\widehat{\mathrm{\gamma}})\}}^{\otimes 2}{\mathit{\text{dN}}}_{i}(t),\\ \widehat{B}={n}^{-1}{\displaystyle \sum _{i=1}^{n}{\left[{\displaystyle {\int}_{0}^{\tau}\{{X}_{i}(t)-\overline{X}}(t,\tilde{\mathrm{\gamma}})\}\left\{{\mathit{\text{dN}}}_{i}(t)-{Y}_{i}(t){e}^{{\tilde{\mathrm{\gamma}}}^{\prime}{X}_{i}(t)}\frac{{\displaystyle {\sum}_{j}{\mathit{\text{dN}}}_{j}(t)}}{n{S}^{(0)}(t;\tilde{\mathrm{\gamma}})}\right\}\right]}^{\otimes 2},}\\ {\widehat{\Sigma}}_{1}={n}^{-1}{\displaystyle \sum _{i=1}^{n}{\displaystyle {\int}_{0}^{\tau}\{{X}_{i}(t)-{E}_{X,{\tilde{R}}_{i}}}(t;\widehat{\mathrm{\gamma}})\}{\{{\text{X}}_{i}(t)-{\overline{X}}_{i}(t,\tilde{\mathrm{\gamma}})\}}^{\prime}}{\mathit{\text{dN}}}_{i}(t).\end{array}$$

Therefore, the rare-disease approximate estimator is defined ${\tilde{\beta}}_{r}=\widehat{\beta}-\widehat{\mathrm{\Delta}}{\widehat{\mathrm{\Omega}}}_{r}^{-1}(\widehat{\mathrm{\gamma}}-\tilde{\mathrm{\gamma}})$ and its variance estimator is given by ${\widehat{\mathrm{\Gamma}}}^{-1}-\widehat{\mathrm{\Delta}}{\widehat{\mathrm{\Omega}}}_{r}^{-1}\widehat{\mathrm{\Delta}}\prime $ .

We first investigate the finite-sample performance of the proposed estimator and the rare-disease estimator by extensive simulations. We compare the efficiency of the proposed estimator with Thomas’ estimator, a local-averaging estimator (Chen, 2004), and an IPW estimator (Samuelsen, 1997). We consider the following scenarios:

- S1. independent auxiliary covariate:
*Z*and*X*are independently and identically distributed - S2. normal auxiliary covariate is measured with normal error: $X=Z+\epsilon ,\epsilon ~N(0,{\sigma}_{\epsilon}^{2})$ , The true covariate
*Z*~*N*(2, 0.5^{2}) and σ_{ε}= 0.5 or 0.2. We generate the failure time*T** from a Cox’s model λ(*t*|*Z*) = λ_{0}*e*^{βZ}, where three different values of β as 0,−0.5 and −1 are considered. We examine two censoring scenarios: the random censoring, where*C*~*U*(0, 5), and the covariate-dependent censoring by generating the censoring time*C*uniformly from {0, min(3|*Z*|, 5)}. The value of λ_{0}is chosen to control the disease incidence rate at 6% ~ 7%. Under S1, we examine the robustness of the proposed estimator with completely independent/wrong surrogate covariate. Scenario S2 is a classical measurement error model and it is easy to see that conditions C1 and C2 are satisfied (Xiang and Langholz, 1999). We consider the cohort size of 2000 and the NCC study with 2 or 4 controls. For Chen’s estimator, we set the local-average bandwidth to be 2*n*^{−1/3}. For the IPW estimator, the weight function is defined as π= δ_{i}+ (1 − δ_{i})_{i}*V*_{0i}/*p*_{0i}where*V*_{0i}is the indicator of subject*i*ever being selected as a control and ${p}_{0i}=1-{\mathrm{\Pi}}_{{T}_{j}\le {T}_{i}}\left(1-\frac{m-1}{{\mathrm{\Sigma}}_{k}{Y}_{k}({T}_{j})-1}{\delta}_{j}\right)$. We run 500 simulations for each setting and the number of bootstrap samples is set to be 500.

Simulation results under the random censoring are summarized in Table 1 and those under the covariate-dependent censoring are presented in Table 2. In all the scenarios, the proposed estimator shows negligible biases. The estimated standard errors (SE), using the proposed bootstrap method, are close to the sample standard deviation (SD) of the estimates. Thus, the 95% Wald-type confidence intervals all have reasonable coverage probabilities (CP). Moreover, under this rare-disease situation, the rare-disease approximate estimator performs well as it yields reasonable coverage probabilities (CP*).

To compare the efficiency of various approaches, we calculate the empirical relative efficiency for each estimator, defined as the ratio of sample variances of the estimator and the full-cohort maximum partial likelihood estimator under the true model with the latter one being a reference. The efficiency results are summarized in Table 1 and Table 2 (see the last four columns). In scenario S1, where the surrogate covariate is completely independent of the true covariate, the proposed estimator shows very comparable efficiency as Thomas’ estimator because independent surrogate covariates can hardly provide any information to improve Thomas’ estimator . Under this scenario, the IPW estimator outperforms others as the selection probability is accurately estimated and used to recover the original full-cohort.

In scenario S2, where a true surrogate covariate is available, the proposed general estimator shows efficiency gain over Thomas’ estimator and the magnitude of gain is more obvious when the number of controls is small and the measurement error is small. For example, when β = −1, σ_{ε} = 0.2 and *m*−1 = 2, the gain of the proposed method over Thomas’s estimator, calculated as (RE/RE_{1} − 1)%, reaches 61% with covariate-independent censoring and 49% with covariate-dependent censoring. On the other hand, the efficiency of the proposed method approaches the full-cohort efficiency as the measurement error decreases or the number of controls increases. For example, when β = 0, σ_{ε} = 0.2 and *m*−1 = 4, the relative efficiency of our estimator achieves 99.1% with covariate-independent censoring and 98.1% with covariate-dependent censoring. Moreover, in all simulations under scenario S2, the proposed estimator always outperforms other competing estimators.

Additional simulations when the disease is common with 15% and 25% incidence rates are presented in the Web Appendix B. We observe that the proposed estimator still performs well but the rare-disease approximation may fail with unsatisfactory coverage probabilities.

In this subsection, we further investigate the properties of the proposed estimator when conditions C1 and C2 are violated. Primarily, we focus on the violation of condition C2 that has some practical implications and consider the following scenarios:

- S3. non-normal covariate and error:
*X*=*Z*+*u*where*Z*~*U*(1, 3) and*u*~*U*(−1, 1) - S5. working model with a missing covariate, i.e.
*Z*= (*Z*_{1},*Z*_{2}) but*X*=*Z*_{1}only - S6. informative auxiliary covariate, i.e.
*X*=*Z*+ α log*T** + ε.

Under scenarios S3 and S4, the induced hazard function λ(*t*|*X*) defined in (6) does not have a proportional exponential form unless β_{0} = 0. Under S5 of missing covariate, λ(*t*|*X*) generally will not be proportional. We generate dichotomous variables (*Z*_{1},*Z*_{2}) from a multinomial distribution with π* _{lk}* = pr(

In the sensitivity analyses, we consider the random censoring of *U*(0, 5) and use 2 controls only. Table 3 gives some representative results for different values of β_{0}. Under scenarios S3, S4 and S5 where only condition C2 is violated, we observe that the proposed estimator is reasonably robust with small biases and satisfactory coverage probabilities. All mean squared errors are also reasonably small. The results agree with the observations by Xiang and Langholz (1999) that the difference between NCC estimates and the full-cohort estimates is often small for moderate violation of condition C2 due to measurement error or covariate omission. Moreover, for the missing covariate situation (S5), the efficiency gain of the proposed estimator on the parameter corresponding to the covariate without missing (*Z*_{1}) is clearly more obvious than that for the covariate with missing (*Z*_{2}). For scenario S6, we observe that biases become more obvious when the magnitude of β_{0} increases, and thus the coverage probabilities deteriorate; indicating that condition C1 is an important assumption for the validity of the projection method.

We demonstrate the proposed approach by utilizing of a full-cohort study collected from studies conducted by the National Wilms Tumor Study Group (D’Angio et al., 1989; Green et al., 1998). Wilms’ tumor is a malignant tumor of the kidney and typically occurs in children. This dataset contains full information of 3,915 subjects participating the third and fourth Wilms’ tumor studies and 669 (17.09%) patients who had disease relapse are considered as cases. We also compare the proposed estimator with Thomas’ estimator and the IPW estimator under the NCC sampling with various numbers of controls.

To estimate the effects of unfavorable histology status and other covariates on patients’ relapse-free survival, we follow Kulich and Lin (2004) to assume that the relapse time follows model (1) with eight covariates: Age1 (age of diagnosis if less than 1 year old); Age2 (age of diagnosis of 1 year and older); UH (unfavorable central histology); Age1×UH; Age2×UH; Stage (3–4 vs 1–2); Diameter; and Stage×Diameter. We simulate NCC studies from this full cohort with the number of controls ranging from 1 to 3. The evaluation of tumor histology by central pathologists is considered the true histology assessment and pretended to be available only for the cases and selected controls; the reading by pathologists in the local institutions is considered a surrogate measurement and available for the entire cohort.

The results are summarized in Table 4. Under all situations, we observe that the estimates from the proposed method, the IPW method and Thomas’ approach are all similar to the full-cohort estimates. The standard error estimates from the proposed method are uniformly smaller than those from Thomas’ estimator indicating that the proposed estimator is more efficient by incorporating auxiliary information from the full cohort. In addition, as observed in our simulations, the empirical efficiency gain of the proposed method over Thomas’ estimator is evident when the number of controls in the NCC study is small. When the number of controls increases, the efficiencies of all estimators approach to the full cohort estimator. Moreover, for those covariates whose true values are available for the entire cohort, the proposed estimator approaches the full-cohort efficiency very fast and achieves higher relative efficiency even with a small number of controls compared to Thomas’ estimator. For example, for the covariate of tumor stage (Stage), the relative efficiency of our proposed estimator with respect to the full-cohort estimator already achieves 78% with just one control while it is only 33% for Thomas’ estimator.

We show that the projection idea can be well employed in NCC studies with auxiliary covariates and can lead to an improved estimator for the regression parameters in Cox’s model. The efficiency gains of our proposed estimator over Thomas’ estimator are large when the number of controls in NCC study is small and the correlation between true covariates and auxiliary covariates is strong. When condition C2 (6) is violated, the proposed projection estimator is theoretically biased but the bias is usually small in realistic situations. In addition, our simulation studies showed that the bias was often negligible compared to the variance. The proposed approach is computationally convenient and can be implemented using common statistical software with a little programming effort. The R-code for implementing the proposed approach can be obtained from the authors.

In this paper, the proposed estimator builds on Thomas’ estimator, which is the most commonly used method in practice for analyzing NCC data and only requires the true covariates to be measured for the cases and selected controls at case failure times rather than the entire history of the true covariate process. But when extended NCC data are available, Thomas’ estimator is not semiparametrically efficient (Robins, et al. 1994). In fact, as we observed in simulation studies, the IPW estimator (Samuelsen, 1997) often performed well. Therefore, statistical methods that can make use of the auxiliary covariate information to further improve the efficiency of the IPW estimator are also of our great interest. More specifically, replace the estimating equations used in (2) and (7) by

$${U}_{A}^{*}(\theta )={\displaystyle \sum _{i=1}^{n}{\displaystyle {\int}_{0}^{\tau}{\pi}_{i}\{{A}_{i}(t)-{\overline{A}}_{w}(t;\theta )\}{\mathit{\text{dN}}}_{i}(t)=0,}}$$

where
${\overline{A}}_{w}(t;\theta )=\frac{{\mathrm{\Sigma}}_{j}{\pi}_{j}{Y}_{j}(t){e}^{\mathrm{\gamma}\prime {A}_{j}(t)}{A}_{j}(t)}{{\mathrm{\Sigma}}_{j}{\pi}_{j}{Y}_{j}(t){e}^{\mathrm{\gamma}\prime {A}_{j}(t)}}$
with (*A*, θ) = (*Z*, β) for the true model and (*A*, θ) = (*X*, γ) for the working model. It is easy to show that the solution to the above estimating equation and the corresponding full cohort estimator always converge to the same limit under either the true model or the working model. Thus, the IPW-based projection method may relax the proportional assumption on the induced hazard function in condition C2. This research will be investigated elsewhere.

The Web Appendices referenced in Section 2.3 and Section 3.1 are available under the Paper Information link at the Biometrics website http://www.tibs.org/biometrics.

The authors thank the associate editor and two referees for their comments that substantially improved the presentation of the paper. This work was partially supported by NIEHS Pilot Project (M. Liu) and National Science Foundation Grant DMS-0504269 (W. Lu).

- Chen HY, Little RJA. Proporitional hazards regression with missing covariates. Journal of The American Statistical Association. 1999;94:896–908.
- Chen KN. Statistical estimation in the proportional hazards model with risk set sampling. Annals of Statistics. 2004;32:1513–1532.
- Chen YH. Cox regression in cohort studies with validation sampling. Journal of the Royal Statistical Society, Series B. 2002;64:51–62.
- Chen YH, Chen H. A unified approach to regression analysis under double-sampling designs. Journal of the Royal Statistical Society, Series B. 2000;62:449–460.
- Cox DR. Regression models and life-tables. Journal of the Royal Statistical Society, Series B. 1972;34:187–220.
- D’Angio GJ, Breslow N, Beckwith JB, Evans A, Baum H, de Lorimier A, Ferbach D, Hrabovsky E, Jones G, Kelalis P. Results of the Third National Wilms Tumor Study. Treatment of Wilms’ tumor. Cancer. 1989;64:349–360. [PubMed]
- Efron B. Bootstrap methods - another look at the jackknife. Annals of Statistics. 1979;7:1–26.
- Green DM, Breslow NE, Beckwith JB, Finklestein JZ, Grundy PG, Thomas PRM, Kim T, Shochat S, Haase GM, Ritchey ML, Kelalis PP, D’Angio GJ. Comparison between single-dose and divided-dose administration of dacti-nomycin and doxorubicin for patients with Wilms’ tumor: A report from the National Wilms’ Tumor Study Group. Journal of Clinical Oncology. 1998;16:237–245. [PubMed]
- Goldstein L, Langholz B. Asymptotic theory for nested case-control sampling in the cox regression-model. Annals of Statistics. 1992;20:1903–1928.
- Jiang JC, Zhou HB. Additive hazard regression with auxiliary covariates. Biometrika. 2007;94:359–369.
- Kulich M, Lin DY. Improving the efficiency of relative-risk estimation in case-cohort studies. Journal of The American Statistical Association. 2004;99:832–844.
- Lin DY, Wei LJ. The robust inference for the cox proportional hazards model. Journal of The American Statistical Association. 1989;84:1074–1078.
- Oakes D. Survival times - aspects of partial likelihood. International Statistical Review. 1981;49:235–252.
- Prentice R. Covariate measurement errors and parameter estimation in a filure time regression model. Biometrika. 1982;69:331–342.
- Robins JM, Rotnitzky A, Zhao LP. Estimation of regression-coefficients when some regressors are not always observed. Journal of The American Statistical Association. 1994;89:846–866.
- Samuelsen SO. A pseudolikelihood approach to analysis of nested case-control studies. Biometrika. 1997;84:379–394.
- Scheike TH, Juul A. Maximum Likelihood Estimation for Cox’s Regression Model under Nested Case-Control Sampling. Biostatistics. 2004;5:193–206. [PubMed]
- Thomas DC. Addendum to “Methods of Cohort Analysis - Appraisal by Application to Asbestos Mining” In: Liddell FDK, McDonald JC, Thomas DC, editors. Journal of the Royal Statistical Society, Series A. Vol. 140. 1977. pp. 469–491.
- Xiang AH, Langholz B. Comparison of case-control to full cohort analyses under model misspecification. Biometrika. 1999;86:221–226.
- Xiang AH, Langholz B. Robust variance estimation for rate ratio parameter estimates from individually matched case-control data. Biometrika. 2003;90:741–746.
- Zeng D, Lin DY, Avery CL, North KE, Bray MS. Efficient Semi-parametric Estimation of Haplotype-Disease Associations in Case-Cohort and Nested Case-Control Studies. Biostatistics. 2006;7:486–502. [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |