Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2891224

Formats

Article sections

- Summary
- 1 Introduction
- 2 Sampling Scheme and Data Structure
- 3 The Estimation Likelihood Inference
- 4 Simulation Study
- 5 Data Example
- 6 Discussion
- References

Authors

Related links

Biometrics. Author manuscript; available in PMC 2010 June 29.

Published in final edited form as:

Published online 2009 June 9. doi: 10.1111/j.1541-0420.2009.01280.x

PMCID: PMC2891224

NIHMSID: NIHMS90132

The publisher's final edited version of this article is available at Biometrics

See other articles in PMC that cite the published article.

In cancer research, it is important to evaluate the performance of a biomarker (e.g. molecular, genetic, or imaging) that correlates patients’ prognosis or predicts patients’ response to a treatment in large prospective study. Due to overall budget constraint and high cost associated with bioassays, investigators often have to select a subset from all registered patients for biomarker assessment. To detect a potentially moderate association between the biomarker and the outcome, investigators need to decide how to select the subset of a fixed size such that the study efficiency can be enhanced. We show that, instead of drawing a simple random sample from the study cohort, greater efficiency can be achieved by allowing the selection probability to depend on the outcome and an auxiliary variable; we refer to such a sampling scheme as *outcome and auxiliary-dependent subsampling* (OADS). This paper is motivated by the need to analyze data from a lung cancer biomarker study that adopts the OADS design to assess EGFR mutations as a predictive biomarker for whether a subject responds to a greater extent to EGFR inhibitor drugs. We propose an estimated maximum likelihood method that accommodates the OADS design and utilizes all observed information, especially those contained in the likelihood score of EGFR mutations (an auxiliary variable of EGFR mutations) that is available to all patients. We derive the asymptotic properties of the proposed estimator and evaluate its finite sample properties via simulation. We illustrate the proposed method with a data example.

In cancer research, there is a growing need for assessing the utility of a biomarker (e.g. genetic, molecular, or imaging) in predicting disease prognosis and treatment efficacy. Such task involves assessment of the association between patient’s clinical outcome and biomarker measures while adjusting for confounding variables. In many cases, due to the low prevalence rates of subjects with positive outcome (e.g. tumor response) and positive biomarker (e.g. genetic mutations), rigorous evaluation of cancer biomarkers requires a prospective cohort study of a large size. If the biomarker assays are expensive, the cost of assessing all subjects in the entire cohort is prohibitive. In such situations, the subsampling scheme of selecting a subset of subjects for biomarker assays is often used. As subjects with positive tumor response or positive biomarker levels may be more informative about the relationship between the outcome and the biomarker, it makes sense to oversample these subjects. Compared to a simple random subsample, utilization of such a subsampling scheme is expected to improve the efficiency in estimating the outcome-biomarker correlation with a given size of the subsample (Zhou *et al*., 2002).

We further illustrate this idea with a lung cancer biomarker study. Recent cancer studies found that EGFR inhibition drugs, such as Erlotinib and Gefitinib, moderately extended survival for patients with advanced non-small cell lung cancer. Intriguingly, based on retrospectively available samples, researchers (Paez *et al*., 2004; Lynch *et al*., 2004) found that patients with EGFR mutations responded in a greater extent to the EGFR inhibitor drugs than those without mutations. EGFR mutations are summarized by the type and the extent of EGFR gene irregularity. A national consortium for lung cancer patients treated by EGFR inhibition drugs was recently established (Eberhard *et al*., 2007) to prospectively evaluate EGFR mutations as a predictive biomarker for receiving EGFR inhibitor drugs, that is, to test whether EGFR mutants respond in a greater extent to EGFR inhibition drugs than EGFR wild-types. Hundreds patients treated with EGFR inhibition drugs will be registered into the consortium. All of them are required to submit tissue samples for assays on EGFR mutations. The consortium is expected to predominantly consist of non-responders to the treatment (~ 70%) and EGFR wild-types (~ 85%). Due to the high cost of genotyping EGFR genes, it is not cost-effective to genotype all banked samples. How to efficiently select a subset of patients for EGFR mutations assays becomes an important issue. Paez *et al*. (2004) reported that women, Asian patients, non-smokers, and patients with adenocarcinoma have much higher probability of being EGFR mutants. Taking advantage of this finding, CALGB investigators (Jänne *et al*., 2008) suggested a subsampling scheme which includes a simple random subsample of 250 patients as well as two supplementary samples. Of the two supplementary subsamples, one includes all responders and the other includes non-responders with a > 0.70 likelihood score of EGFR mutations. The likelihood score is the predicted probability of a patient having EGFR mutations from a logistic regression model using baseline patient characteristics as predictors. Subsampling according to the likelihood score of EGFR mutations is expected to increase the chance of EGFR mutants being represented in the subset of patients who will have EGFR mutations measured. Also, since the likelihood score of EGFR mutations is observed for all patients, it can be used as an auxiliary variable for EGFR mutations to better quantify the effect of EGFR mutations.

We refer to the subsampling scheme illustrated above as the *outcome and auxiliary-dependent sub-sampling* (OADS). The OADS can be considered as an extension of the outcome-dependent subsampling (ODS). In the ODS, the subsampling depends on the subjects’ outcomes in order to enrich the selected sample with those who have a rare outcome. Study designs using the ODS subsampling have been discussed by Zhou and his colleagues (Zhou *et al*., 2001; Zhou *et al*., 2002; Weaver & Zhou, 2005; Zhou *et al*., 2007). To reap the benefits of the ODS subsampling, one generally needs an analysis that accounts for the outcome-dependent nature of the sampling scheme. In the OADS, the subsampling depends on both the subjects’ outcome (e.g. tumor response) and an auxiliary variable (e.g. the likelihood score of EGFR mutations). The key idea is to achieve higher efficiency by concentrating more information in the OADS subsample as compared to the simple random subsample (SRS) and the ODS subsample. Wang & Zhou (2006) considered a design with two sampling components - a random sample (SRS) and an outcome and auxiliary-dependent sample (OADS), in which all patients in the study cohort have all variables observed, including the extent of EGFR mutations. On the other hand, the motivating example in this paper is proposed as a large prospective study. Due to the large size of patients and the associated high cost, we proposed that only a subset of patients in the entire study cohort have their EGFR mutations observed through genotyping. The two-stage sampling scheme in the current motivating example leads to a completely different data structure from that of Wang & Zhou (2006). The data structure that we consider in this paper consists of three sampling components: SRS, OADS and . The denotes those patients who have all variables but the extent of EGFR mutations observed. How to efficiently use all information in such a data structure is the focus of the current paper.

The origin of the ODS sampling can be found in case-control study (e.g. Breslow & Day, 1980) and its extensions such as nested case-control study (Breslow & Cain, 1988), case-cohort study (Prentice, 1986), and two-stage study (e.g. White, 1982; Wacholder & Weinberg, 1994; Breslow & Chatterjee, 1999). Related ideas of choice based sampling have also been developed in economics (e.g. Cosslett, 1981). These study designs may be considered as examples that utilize the idea of the ODS sampling/subsampling. Of these designs, the OADS design that we consider in this paper is especially related to the two-stage study, in which the outcomes and some stratification variables of all subjects are observed at the first stage, but other variables are only observed in a subsample of all subjects at the second stage. In a general framework of a two-stage sampling, Weaver & Zhou (2005) developed an estimated likelihood method to allow both continuous outcome and the ODS subsampling. For the two-stage study with binary outcome, Flanders & Greenland (1991) and Zhao & Lipsitz (1992) proposed a Horvitz-Thompson type method (Horvitz & Thompson, 1952) that weights the complete data observed inversely with the selection probability; Breslow & Cain (1988) developed a conditional likelihood estimator, 1985); Wild (1991), Cosslett (1981) and Breslow & Holubkov (1997) studied nonparametric maximum likelihood estimation. These statistical methods, especially those for two-stage case-control studies, provide tools to correct estimation bias due to the ODS subsampling, but they are not ideal methods for analyzing data arising from the motivating lung cancer study. In the EGFR lung cancer study, for each patient in the study cohort a likelihood score of EGFR mutations can be computed according to a prediction model with smoking history, sex, race and histology as predictors. The likelihood score contains valuable auxiliary information about the true extent of EGFR mutations; we call it the *auxiliary variable* for the biomarker of interest. Auxiliary variable can be any variable that is correlated to the biomarker. Auxiliary variable is not necessarily a surrogate variable, which has a strict statistical definition (e.g. Prentice, 1989). The existence of such an auxiliary variable not only enables investigators to identify possible EGFR mutants for efficient subsampling, but also functions as an intermediate variable to better quantify the correlation of EGFR mutations and tumor response to treatment. In principle, the profile likelihood-based method in Wang & Zhou (2006) can be extended to the OADS design that we consider in this paper. When the auxiliary variable is continuous, however, one is unable to apply the profile likelihood-based method without categorizing the auxiliary variable and accordingly losing valuable information.

In this paper, we propose an outcome and auxiliary-dependent subsampling (OADS) design to improve study efficiency and a statistical method that accommodates the OADS design and the auxiliary variable for the biomarker. The rest of the paper is organized as follows. In Section 2, we present the outcome and auxiliary-dependent subsampling (OADS) scheme and its data structure. We propose in Section 3 an estimated likelihood method and give its asymptotic properties. In Section 4, we demonstrate via simulation the benefits of the OADS design and the performance of the proposed method in finite samples by comparing it with the competing methods. Section 5 illustrates the proposed method using a data example. Final remarks are presented in Section 6. A sketched proof of the asymptotic properties of the proposed estimator is given in the Appendix, which is available on line as supplementary material.

To fix notation, let *Y* be a categorical outcome with possible values 1, … ,*K*, e.g. tumor response (complete, partial, stable, or progressed), let *X* be the measure of the biomarker that is observed only for those subjects in the OADS subsample, e.g. the extent of EGFR mutations, let *Z* be the vector of all covariates that are observed for all subjects in the study cohort, and let *W* be the auxiliary variable for *X*, e.g. the likelihood score of EGFR mutations. We assume that the conditional density of *Y* given (*X, Z, W*) belongs to a canonical exponential family and is parameterized as $P(Y|X,Z,W)=h({\beta}_{0}+{\beta}_{1}X+{\beta}_{2}^{\prime}Z)$, where *h*^{−1}(.) is a known link function and (1,*X,Z*) are covariates. This formulation implies that as an auxiliary variable for *X, W* provides no additional information about *Y* when *X* is included. The formulation is always true if *W* is allowed to be an element of the vector *Z*.

Let {*c _{r}*}

$$\begin{array}{cc}\text{Subjects}\phantom{\rule{thinmathspace}{0ex}}\text{in}\phantom{\rule{thinmathspace}{0ex}}V+\overline{V}:\hfill & \{{Y}_{i},{Z}_{i},{W}_{i}\}\phantom{\rule{thinmathspace}{0ex}}\text{for}\phantom{\rule{thinmathspace}{0ex}}i\in {V}_{\mathit{rk}}+{\overline{V}}_{\mathit{rk}}.\hfill \\ \text{Subjects}\phantom{\rule{thinmathspace}{0ex}}\text{in}\phantom{\rule{thinmathspace}{0ex}}V:\hfill & \{{Y}_{i},{Z}_{i},{W}_{i}|Y=k,C=r\}\phantom{\rule{thinmathspace}{0ex}}\text{for}\phantom{\rule{thinmathspace}{0ex}}i\in {V}_{\mathit{rk}}.\hfill \end{array}$$

The above data structure is slightly different from that of the motivating example, in which the subset of patients who have *X* observed consist of two components: an SRS and an OADS. Since the likelihood and the inference method for the SRS/OADS design is not different from that of the OADS only design, we consider the OADS only design thereafter.

In this section, we describe an estimated likelihood method for data arising from the OADS design. Following the argument in Schill *et al*. (1993), one can show that the joint likelihood of the OADS design is

$$L(\beta )=\left\{{\displaystyle \prod _{r=1}^{R}{\displaystyle \prod _{k=1}^{K}{\displaystyle \prod _{i\in {V}_{rk}}{P}_{\beta}({Y}_{i}|{X}_{i},{Z}_{i})\phantom{\rule{thinmathspace}{0ex}}g({X}_{i}|{Z}_{i},{W}_{i})}}}\right\}\phantom{\rule{thinmathspace}{0ex}}\left\{{\displaystyle \prod _{r=1}^{R}{\displaystyle \prod _{k=1}^{K}{\displaystyle \prod _{i\in {\overline{V}}_{\mathit{rk}}}{P}_{\beta}({Y}_{i}{|Z}_{i},{W}_{i})}}}\right\},$$

(1)

where *P _{β}*(

Notice that *P _{β}*(

$$\stackrel{\u2322}{l}(\beta )\phantom{\rule{thinmathspace}{0ex}}\propto \phantom{\rule{thinmathspace}{0ex}}{\displaystyle \sum _{r=1}^{R}{\displaystyle \sum _{k=1}^{K}{\displaystyle \sum _{i\in {V}_{\mathit{rk}}}\text{log}{P}_{\beta}}}}({Y}_{i}|{X}_{i},{Z}_{i})+{\displaystyle \sum _{r=1}^{R}{\displaystyle \sum _{k=1}^{K}{\displaystyle \sum _{j\in {\overline{V}}_{\mathit{rk}}}\text{log}{\stackrel{\u2322}{P}}_{\beta}}}}({Y}_{j}|{Z}_{j},{W}_{j}).$$

(2)

Statistical inference on *β* can then be carried out by maximizing the estimated likelihood function (2). When the subsample consists of a simple random sample, Pepe & Fleming (1991) and Carroll & Wand (1991) studied estimated likelihood methods for validation studies. Zhou & Pepe (1995) and Zhou & Wang (2000) studied the estimated likelihood method for time to event data. Zhou, Chen & Cai (2002) extended the approach to random effect models for clustering data.

For data arising from the OADS design, a nonparametric estimator for *G*(*X*|*Z, W*) without taking into account the subsampling dependency on *Y* and *C* will render bias on _{β}(*Y _{j}* |

$$G(X|Z,W)={\displaystyle \sum _{s}{\displaystyle \sum _{l}P(Y=l,C=s|{W}^{*})}}\phantom{\rule{thinmathspace}{0ex}}G(X|{W}^{*},Y=l,C=s),$$

(3)

where *P*(*Y* = *l,C* = *s* |*W**) is the joint probability {*Y,C*} conditional on *W** for the OADS subsample. In other words, if one is able to estimate consistently the conditional distribution *G*(*X*|*W**) within a stratum {*Y* = *l,C* = *s*}, a consistent estimator of *G*(*X*|*Z,W*) can then be obtained by summing *Ĝ*_{sl}(*X*|*W**) over all *s,l* with appropriate weights. Let π_{sl}(*W**) = *P*(*Y* = *l,C* = *s* |*W**) and *G _{sl}*(

$${P}_{\beta}({Y}_{j}|{Z}_{j},{W}_{j})={\displaystyle \int {P}_{\beta}({Y}_{j}|x,{Z}_{j})dG(x|{Z}_{j},{W}_{j})}={\displaystyle \int {P}_{\beta}({Y}_{j}|x,{Z}_{j})}\phantom{\rule{thinmathspace}{0ex}}d\{{\displaystyle \sum _{s}{\displaystyle \sum _{l}{\pi}_{sl}({W}_{j}^{*}){G}_{sl}(x|{W}_{j}^{*})\}}}.$$

(4)

According to (4), if consistent nonparametric estimators π_{sl}(*W**) and *G _{sl}*(

Notice that *P _{β}*(

$${\widehat{G}}_{\mathit{sl}}(x|{w}^{*})=\frac{{\displaystyle \sum _{i=\in {V}_{\mathit{sl}}}I({X}_{i}\le x){K}_{h}({W}_{i}^{*}-{w}^{*})}}{{\displaystyle \sum _{i\in {V}_{\mathit{sl}}}{K}_{h}({W}_{i}^{*}-{w}^{*})}}$$

(5)

where *K _{h}*(.)=

Using (4), a weighted kernel estimator for *P _{β}*(

$${\widehat{P}}_{\beta}({Y}_{j}|{Z}_{j},{W}_{j})={\displaystyle \sum _{s=1}^{R}{\displaystyle \sum _{l=1}^{K}{\widehat{\pi}}_{sl}({W}_{j}^{*})\frac{{\displaystyle \sum _{i\in {V}_{sl}}{P}_{\beta}({Y}_{j}|{X}_{i},{Z}_{j}){K}_{h}}({W}_{i}^{*}-{W}_{j}^{*})}{{\displaystyle \sum _{i\in {V}_{\mathit{sl}}}{K}_{h}}({W}_{i}^{*}-{W}_{j}^{*})},}}$$

(6)

where ${\widehat{\pi}}_{\mathit{sl}}({W}_{j}^{*})$ is another kernel smoother

$${\widehat{\pi}}_{\mathit{sl}}({W}_{j}^{*})=\frac{{\displaystyle \sum _{i=1}^{N}I({Y}_{i}=l,{C}_{i}=s){K}_{h}({W}_{i}^{*}-{W}_{j}^{*})}}{{\displaystyle \sum _{i=1}^{N}{K}_{h}({W}_{i}^{*}-{W}_{j}^{*})}}.$$

The proposed estimator is therefore the solution to the score equation (*β*)/*β* = 0. Estimates can be obtained through the Newton-Raphson iterative procedure. A consistent estimator $\widehat{\mathit{var}}(\widehat{\beta})$ for the variance matrix of can be obtained using large sample properties. A simple *ad hoc* bandwidth selection $h=2\widehat{\sigma}{W}^{*}{n}_{V}^{-1/3}$ can be used, where _{W*} is the estimated standard deviation of *W**.

In the case of discrete *W**, an analogous estimator can be obtained by replacing the kernel function ${K}_{h}({W}_{j}^{*}-{W}_{i}^{*})$ with an indicator function ${I}_{[{W}_{j}^{*}-{W}_{i}^{*}]}.$ Specifically, when ${W}_{j}^{*}$ for *j* * _{rk}* contains only discrete components with possible

$$\begin{array}{c}\hfill {\widehat{P}}_{\beta}({Y}_{j}|{Z}_{j},{W}_{j})={\displaystyle \int {P}_{\beta}({Y}_{j}|x,{Z}_{j})\phantom{\rule{thinmathspace}{0ex}}d\{{\displaystyle \sum _{s}{\displaystyle \sum _{l}{\widehat{\pi}}_{\mathit{sl}}({w}^{*}){\widehat{G}}_{\mathit{sl}}(x|{w}^{*})\}}}}\\ ={\displaystyle \sum _{s=1}^{R}{\displaystyle \sum _{l=1}^{K}{\widehat{\pi}}_{\mathit{sl}}({W}_{j}^{*})\frac{{\displaystyle \sum _{i\in {V}_{\mathit{sl}}}{P}_{\beta}({Y}_{j}|{X}_{i}},{Z}_{j}){I}_{[{W}_{i}^{*}={W}_{j}^{*}]}}{{\displaystyle \sum _{i\in {V}_{sl}}{I}_{[{W}_{i}^{*}={W}_{j}^{*}]}}},}}\hfill \end{array}$$

(7)

where

$${\widehat{\pi}}_{\mathit{sl}}({W}_{j}^{*})\frac{{\displaystyle \sum _{i\in V}{I}_{[{Y}_{i}=l,{C}_{i}=s,{W}_{i}^{*}={W}_{j}^{*}]}}}{{\displaystyle \sum _{i\in V}{I}_{[{W}_{i}^{*}={W}_{j}^{*}]}}}.$$

The inference on can be carried out the same way as outlined in the continuous case.

The asymptotic properties of the proposed estimator for continuous *W** are summarized in the following theorems. We assume that for *r* = 1, … ,*R*, *k* = 1, … ,*K*, and as *N* → ∞, *n _{rk}*/

**Theorem 1**. *(consistency) Under regularity conditions let denote an estimate of β that solves the score equation *(*β*)/*β* = 0, *where a kernel estimate of P _{β}*(

**Theorem 2**. *(asymptotic normality) Under regularity conditions, the proposed estimator has an asymptotic normal distribution*

$$\sqrt{N}(\widehat{\beta}-\beta )\underrightarrow{d}N(0,\sum )$$

(8)

*with the variance matrix*

$$\sum ={I}^{-1}(\beta )+{\displaystyle \sum _{s=1}^{R}{\displaystyle \sum _{l=1}^{K}\frac{{\pi}_{\mathit{sl}}^{2}}{{\varphi}_{\mathit{sl}}}}}{I}^{-1}(\beta ){\displaystyle {\sum}_{\mathit{sl}}(\beta ){I}^{-1}(\beta )},$$

(9)

*where*

$$\begin{array}{c}\hfill I(\beta )=-{\displaystyle \sum _{r=1}^{R}{\displaystyle \sum _{k=1}^{K}\left\{{\varphi}_{\mathit{rk}}{E}_{\mathit{rk}}\phantom{\rule{thinmathspace}{0ex}}\left[\frac{{\partial}^{2}\mathrm{log}{P}_{\beta}(Y|X,Z)}{\partial \beta \partial {\beta}^{\prime}}\right]+({\pi}_{\mathit{rk}}-{\varphi}_{\mathit{rk}}){E}_{\mathit{rk}}\phantom{\rule{thinmathspace}{0ex}}\left[\frac{{\partial}^{2}\mathrm{log}{P}_{\beta}(Y|Z,W)}{\partial \beta \partial {\beta}^{\prime}}\right]\right\}}},\hfill \\ \hfill {\displaystyle {\sum}_{\mathit{sl}}\mathit{var}}\phantom{\rule{thinmathspace}{0ex}}\left\{{\displaystyle \sum _{r=1}^{R}{\displaystyle \sum _{k=1}^{K}(1-\frac{{\phi}_{\mathit{rk}}}{{\pi}_{\mathit{rk}}})\phantom{\rule{thinmathspace}{0ex}}{E}_{\mathit{rk}}\phantom{\rule{thinmathspace}{0ex}}[{\pi}_{\mathit{rk}}({W}_{i}^{*}){S}_{{X}_{i}{W}_{i}}|{W}_{i}^{*}]}}\right\},\hfill \\ \hfill {S}_{{X}_{i},{W}_{i}}=\frac{\partial {P}_{\beta}(Y|X,Z)/\partial \beta}{{P}_{\beta}(Y|Z,W)}-\frac{\partial {P}_{\beta}(Y|X,Z)/\partial \beta}{{[{P}_{\beta}(Y|Z,W)]}^{2}}{P}_{\beta}(Y|X,Z),\hfill \end{array}$$

*and E _{rk}*(.)

**Theorem 3**. *(consistent variance estimator) A consistent estimator for the asymptotic covariance-variance matrix in (9) is*

$$\widehat{\sum}(\widehat{\beta})={\widehat{I}}^{-1}(\widehat{\beta})+\frac{1}{N}{\displaystyle \sum _{s=1}^{R}{\displaystyle \sum _{l=1}^{K}\frac{{N}_{\mathit{sl}}^{2}}{{n}_{{V}_{\mathit{sl}}}}{\widehat{I}}^{-1}(\widehat{\beta})}}{\widehat{\sum}}_{\mathit{sl}}(\widehat{\beta}){\widehat{I}}^{-1}(\widehat{\beta}),$$

(10)

*where*

$$\widehat{I}(\beta )=-\frac{1}{N}\frac{{\partial}^{2}\widehat{l}(\beta )}{\partial \beta \partial {\beta}^{\prime}},{\widehat{\sum}}_{\mathit{sl}}(\beta )=\widehat{\mathit{var}}{X}_{i}\phantom{\rule{thinmathspace}{0ex}}\{{\widehat{S}}_{{X}_{i},{W}_{i}^{*},}\phantom{\rule{thinmathspace}{0ex}}i\in {V}_{\mathit{sl}}\},$$

*where*
* _{sl}*(

$$\begin{array}{cc}{\widehat{S}}_{{X}_{i},{W}_{i}^{*}}& =\{{\displaystyle \sum _{r=1}^{R}{\displaystyle \sum _{k=1}^{K}{\displaystyle \sum _{j\in {\overline{V}}_{rk}}\frac{{\overline{n}}_{rk}}{{N}_{rk}}{\widehat{\pi}}_{rk}({W}_{i}^{*})[\frac{\partial {P}_{\beta}({Y}_{j}|{X}_{i},{Z}_{j})/\partial \beta}{{\widehat{P}}_{\beta}({Y}_{j}|{Z}_{j},{W}_{j})}-\frac{\partial {\widehat{P}}_{\beta}({Y}_{j}|{Z}_{j},{W}_{j})/\partial \beta}{{[{\widehat{P}}_{\beta}({Y}_{J}|{Z}_{j},{W}_{J})]}^{2}}}}}\\ & \times {P}_{\beta}({Y}_{j}|{X}_{i},{Z}_{j})]{K}_{h}({W}_{j}^{*}-{W}_{i}^{*})\}/\left\{{\displaystyle \sum _{r=1}^{R}{\displaystyle \sum _{k=1}^{K}{\displaystyle \sum _{j\in {\overline{V}}_{rk}}{K}_{h}({W}_{j}^{*}-{W}_{i}^{*}}}}\right\}.\end{array}$$

A sketch of the proof is given in the Appendix, available as supplementary material. The asymptotic properties of the proposed estimator for discrete *W** can be similarly stated.

Simulation is conducted to evaluate the performance of the proposed estimators. The outcome *Y* is generated according to a logistic model

$${P}_{\beta}(Y=1|X,Z)=\frac{\text{exp}({\beta}_{0}+{\beta}_{1}X+{\beta}_{2}Z)}{1+\text{exp}({\beta}_{0}+{\beta}_{1}X+{\beta}_{2}Z)}.$$

(11)

*X* and *Z* are independently generated from a standard normal distribution. Let *W* = *X* + where ~ *N* (0, 4) which corresponds to *corr*(*X, W*) = 0.45. Let *C* = 1 if *W* (−∞,1] and *C* = 2 if *W* (1, ∞). Notice that *W** = *W* under this setup.

At first, we examine the statistical efficiency of the proposed estimators under three subsampling designs: SRS, ODS and OADS. In the SRS design, the *n _{V}* subjects consist of a simple random subsample from the study cohort of size

Four estimators are compared: *P*_{1} denotes the proposed estimator for continuous *W** assuming the OADS design; *P _{2}* is the counterpart of

Table 1 shows the mean of _{1} (Mean) and the standard deviation of _{1} (SE) from 1000 independent runs. In (a), where *N* = 3000, *n _{V}* = 240,

Next, we evaluate the performance of the proposed estimators *P*_{1} and P_{2} under the OADS design relative to competing methods. Four competing methods are considered: (i) *CC* is the naïve method that only analyzes data from the OADS subsample; (ii) *WL* denotes the weighted likelihood method that employs the Horvitz-Thompson weighting (e.g., Flanders & Greenland, 1991; Zhao & Lipsitz, 1992); (iii) *CL* denotes the conditional likelihood method (Breslow & Cain, 1988) that models the conditional probability of selecting a subject into the OADS subsample; (iv) *BH* denotes the semiparametric method studied by Breslow & Holubkov (1997). The last three methods were proposed for a two-stage study. To apply them to the OADS design that we consider, one needs ignore the auxiliary variable *W* and the covariates *Z* from those subjects who are not in the OADS subsample.

Table 2 lists the results for the cases with 36 and 60 subjects from each of the four strata defined by *Y* × *C* in a study cohort of size *N* = 900 and *N* = 1500, respectively. This corresponds to 16% of the size of the OADS subsample relative to the size of the study cohort. Data are generated using the same model and parameters as Table 1(b). All estimators yield consistent estimates for the regression parameters *β*_{0}, *β*_{1} and *β*_{2}. As shown by the smaller standard errors of _{1},*P*_{1},*P*_{2} and *BH* are constantly more efficient than *CC, WL* and *CL*, while *P*_{1} tends to be the most efficient. It is worth pointing out that *CC* yields consistent estimates for *β*_{1} and *β*_{2} only because a logit link function is used in simulation. Prentice & Pyke (1979) showed that valid estimates of model parameters in a logistic regression can be obtained from case-control data by fitting the model as if the data were obtained for a prospective study. As a reference, *AL* gives the results from the hypothetical situation in which all subjects in the study cohort have *X* observed. In addition, we notice that *P*_{2} and *P*_{1} yield more precise estimates for _{2} than do the competing methods. Increasing the sample size, especially the size of the OADS subsample, will improve estimation precision. The variance estimator $\widehat{\mathit{var}}(\widehat{\beta})$ for the proposed method perform well with the relatively small size of the OADS subsample. The coverage of the 95% confidence interval based on the proposed variance estimator is close to its nominal level.

Similar patterns are observed for different combinations of (*N, n _{V}*) in the OADS design. Table 3 shows the efficiency gain on

We conclude that the proposed estimator *P*_{1} performs well in analyzing data from the OADS design in finite samples; the estimates from the proposed estimator are unbiased and the proposed variance estimator yields good nominal coverage from a 95% confidence interval. Compared with competing methods, it yields more efficient estimation on the effect of the biomarker *X* as well as those for other covariates *Z*.

When the auxiliary variable *W* is continuous, we further explore the sensitivity of choosing different cutoff points on *W* via a simulation study. To simplify the problem, we assume *C* = 1 if *W* (−∞, *c*_{1}] and *C* = 2 otherwise and a balanced allocation of subjects among the four strata defined by *Y* × *C*.

Again, we generate data under the similar setting as Table 1(b). *X* and *Z* are independent standard normal variables and *W* = *X* + where is a standard normal error. The cutoff point *c*_{1} on *W* varies from 0.0 to 1.75 by a step of 0.25 such that subjects with extremely high *X* have an increasing chance of being selected into the second-stage subsample. As shown in Figure 1, the efficiency gain by over sampling extremely high *X* is not monotonically increasing, as evidenced by the standard errors of all estimators decreasing as *c*_{1} moves from 0.0 to 0.75, reaching a minimal value when *c*_{1} is around 0.75 ~ 1.0, and increasing again as *c*_{1} moves away from 1.0. The patterns are consistent across different strengths of correlation of *X* and *W*, as seen in (a): ~ *N*(0, 9) and *corr*(*X,W*) = 0.32 and (b): ~ *N*(0, 4) and *corr*(*X, W*) = 0.45. This suggests that to achieve a bigger efficiency gain it is not necessary to set the cutoff point at highly extreme values; the 60% – 70% percentiles of the auxiliary distribution *W* may be a good choice. This finding is important in guiding investigators not to choose extreme cutoff points on *W*; extreme segmentation will not only introduce difficult in ascertaining subjects with extreme values of *W* but also may lower estimation efficiency. At a 65% percentile cutoff point on *W*, we also investigate via simulation the effect of unbalanced allocation of subjects among the strata defined by outcome and auxiliary. It is found that equal allocation tends to give the best efficiency for . The phenomenon that a better efficiency is achieved at equal allocation was reported by other authors such as Breslow & Chatterjee (1999). For both simplicity and efficiency consideration, we recommend balanced allocation among the OADS strata.

In the Introduction section, the EGFR mutations example is used to illustrate the outcome and auxiliary-dependent subsampling (OADS) design. This proposal is under review by the lung cancer EGFR consortium such that no dataset is available for analysis. Instead, we illustrate the proposed method using a dataset from another study, CALGB 9761 (Maddaus *et al*., 2006). The researchers investigated whether the presence of occult micrometastases (OM) in histologically negative lymph nodes is associated with faster cancer recurrence among stage I non-small cell lung cancer (NSCLC) patients following surgical resection. The standard treatment for stage I NSCLC is surgical resection, but a considerable proportion of these patients are at high risk of cancer recurrence and a subsequent death within 2 years. It is hypothesized that the poor survival of these patients is due to the presence of occult micrometastatses (OM). Reverse transcriptase-PCR (RT-PCR) is a molecular technique which is considered a sensitive measure on the level of tumor specific MUC1 mRNA in negative lymph nodes. A total of 207 eligible patients were registered and had their OM measured by RT-PCR. All 207 patients were followed for cancer recurrence for at least 5 years. Baseline patient characteristics were captured, including age, performance status, race, sex, histology, pathological stage and smoking status. The extent of OM measured by RT-PCR (PCR) is of primary interest, which is measured on a 1–7 intensity scale with 1 “absolutely no signal” to 7 “very strong signal”.

Since CALGB 9761 doesn’t have an OADS design, to illustrate the proposed method, we first augment the original dataset by resampling it into a large cohort of 890 patients, which predominantly consists of patients who were cancer recurrence free within 2 years of follow-up (70%) and patients who absolutely have no signal of OM detected by RT-PCR (PCR > 1) (83%). We also create an auxiliary variable *W*, which is the predicted likelihood score of observing PCR > 1 and is computed according to a logistic regression model for all 890 patients. The predictors included in the logistic model are the OM measured by immunohistochemistry (IHC), female, age, adenocarcinoma and non- or former smoker. The IHC method is considered a less sensitive but much cheaper measure for OM. As shown in Table 4, we generate an illustrative data structure with three sampling components: SRS, OADS and . The SRS subsample consists of a random sample of 60 patients. The OADS subsample includes all 256 patients who had cancer recurrence within 2 years and all 60 patients who remain cancer recurrence free for more than 2 years but have a high likelihood score (> 0.70) for observing PCR > 1. The consists of the rest of 524 patients.

Table 5 lists the results of analysis of five estimators. Besides the extent of OM measured by RT-PCR (PCR), other covariates in the logistic model include age, race (non-white vs. white), performance status (2 vs. 0,1). Other baseline variables are not significant and are excluded from the final model. The continuous variables are age and PCR. Age is centered at its mean. The weighted likelihood method (*W L*) is the least efficient in estimating the PCR effect, which is contrary to the simulation result and may be caused by some peculiarity of the illustrative dataset. As evidenced by their narrower 95% CIs on the odds ratios for the PCR, the methods *CL* and *BH* yield more efficient estimates than the method *CC*. Notice that the efficiency gain is only observed for the PCR effect while the standard errors of other covariates are largely unchanged. This makes sense because *W* is strongly correlated only with the PCR. The proposed method *P*_{1} is the most efficient among all methods both for the PCR effect of (OR=0.940, 95%CI: 0.877–1.007) and all other covariates. The analysis suggests that the extent of OM measured by RT-PCR (PCR) is not significantly related to cancer recurrence within 2 years of surgical resection. Those patients who are non-white (OR=0.236, 95%CI: 0.184–0.302) and have lower performance status (OR=0.649, 95%CI; 0.543–0.775) and older age (OR=0.836, 95%CI: 0.802–0.872) have lower odds for remaining cancer recurrence free within 2 years than their counterparts without these features. We notice that these findings are consistent with those from the analysis based on the original dataset.

Our work is motivated by the need for developing an efficient method to determine that non-small cell lung cancer patients with EGFR mutations benefit much more from being treated with EGFR inhibition drugs. We developed an estimated likelihood method to accommodate the OADS design as illustrated in the motivating example. Nevertheless, it is worth noticing a potentially broader applicability of the proposed method. For example, the proposed method can be used to evaluate the role of biomarker in predicting clinical outcomes in an observational study as well as to evaluate the treatment effect for patients with positive biomarker in an randomized clinical trial. For the later setting, see Baker and Kramer (2005) for more examples and discussion. The method is also applicable to studies in which the problem of missing or mismeasured covariates exists as long as *P*(*δ* = 1|/*Y,W,X,Z*) = *P*(*δ* = 1|*Y,W,Z*) holds where *δ* is the selection indicator.

In the presentation, we assume the existence of an auxiliary variable *W* that is correlated with the expensive or invasive biomarker *X*, and *W* is conveniently available to all subjects of the study cohort. Efficiency gain can be achieved by utilizing the auxiliary variable *W* in study design. As compared to an outcome-dependent subsampling (ODS) and a simple random subsampling (SRS), the outcome and auxiliary-dependent subsampling (OADS) generally improves the efficiency in estimating the effect of the biomarker *X* with a subsample of a fixed size. Besides the role of the auxiliary variable *W* in study design, additional efficiency gain can be achieved by incorporating the auxiliary variable *W* in statistical inference. The proposed estimator is studied under two cases. When the auxiliary variable *W* is continuous, a nonparametric kernel smoother is used to estimate the unknown quantity of the likelihood. We derived asymptotic distribution theory for the proposed estimator when the auxiliary variable is continuous and the optimal bandwidth rate is applied. The *ad hoc* bandwidth used in the simulation is a convenient choice. In analyzing real data with continuous auxiliary variable, it is useful to consider some bandwidth selection criteria such as the generalized cross validation (GCV). Boundary points are not a big problem based on our simulation, but one may consider a boundary kernel (Eubank, 1999) or the locally linear smoother (Fan, 1992). When the auxiliary variable *W* is discrete, we show that a similar empirical estimator based on the level of the auxiliary variable *W* applies and similar asymptotic properties hold. As compared to the competing methods developed for the two-stage studies, the proposed estimator uses the exact value of the auxiliary variable *W*, allowing a more precise estimation of the unknown quantity *P*(*Y* |*Z ,W*) and consequently a better precision for the estimates of regression parameters. Also, the proposed estimator is able to incorporate the information of the model covariates *Z* from the subjects not being selected onto the subsample, resulting in a considerable efficiency gain in estimating the regression parameters of *Z*.

When *W** has multiple continuous variables, a multivariate kernel smoother has to be used to estimate *G*(*X*|*W**). For the reason of curse of dimensionality, the proposed method will not work well if the dimension of the kernel smoother is high (e.g. *p* > 3). One way to avoid the potential issue due to high dimensionality, the auxiliary variable can be created using a predicted model with possible multiple predictors, such as the predicted likelihood score of EGFR mutations in the motivating example, such that a single dimensional kernel smoother is applicable. Finally, we should point out that although it is desirable to have the auxiliary variable *W* as a stratification factor in subsampling and as an intermediate variable for the proposed estimator, the existence of an auxiliary variable for the biomarker of interest is not a perquisite to use the proposed method.

The first author was supported by National Cancer Institute grant CA-131596 and the Duke Clinical and Translational Science Award UL1-RR024128. The second author was supported by National Institutes of Health grant CA-79949.

**Supplementary Materials**

The appendix gives a sketched proof of the asymptotic properties of the proposed estimator. It is available at the Biometrics website http://www.biometrics.tibs.org as a supplementary material. The authors would like to thank the editor, the associate editor and two referees for their valuable comments.

- Baker SG, Kramer BS. Statistics for weighing benefits and harms in a proposed genetic substudy of a randomized cancer prevention trial. Journal of the Royal Statistical Society Series C. 2005;54:941–954.
- Breslow NE, Cain KC. Logistic regression for two-stage case-control data. Biometrika. 1988;75:11–20.
- Breslow NE, Chatterjee N. Design and analysis of two-phase studies with binary outcome applied to Wilms tumor prognosis. Applied Statistics. 1999;48:457468.
- Breslow NE, Day NE. Statistical methods in cancer research I: The analysis of case-control studies. Lyon: International Agency for Research on Cancer; 1980.
- Breslow NC, Holubkov R. Maximum likelihood estimation of logistic regression parameters under two-phase, outcome-dependent sampling. Journal of the Royal Statistical Society, B. 1997;59:447–461.
- Carroll RJ, Wand MP. Semiparametric estimation in Logistic measurement error models. Journal of the Royal Statistical Society, B. 1991;53:573–585.
- Cosslett SR. Maximum likelihood estimator for choice-based samples. Econometrica. 1981;49:12891316.
- Eberhard DA, Giaccone G, Johnson BE. on behalf of the Molecular Assays in NonSmall-Cell Lung Cancer Working Group. Biomarkers of response to epidermal growth factor receptor inhibitors in nonsmall-cell lung cancer: Standardization for use in the clinical trial Setting. Journal of Clinical Oncology. 2008 in press. [PubMed]
- Eubank RL. Nonparametric Regression and Spline Smoothing. Marcel Dekker, Inc.; 1999.
- Fan J. Local linear regression smoothers and their minimax efficiencies. Annals of Statistics. 1993;21:196–216.
- Flanders WD, Greenland S. Analytic methods for two-stage case-control studies and other stratified designs. Statistics in Medicine. 1991;10:739–747. [PubMed]
- Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association. 1952;47:663–685.
- Jänne PA, Wang XF, Kratzke R. Evaluation of EGFR and K-ras mutations in patients with non-small-cell lung cancer. unpublished CALGB Study Concept. 2008
- Lynch TJ, Bell DW, Sordella R, et al. Activating mutations in the epidermal growth factor receptor underlying responsiveness of non-small-cell lung cancer to gefitinib. New England Journal of Medicine. 2004;350:2129–2139. [PubMed]
- Maddaus MA, Wang XF, Vollmer RT, Abraham NZ, D’Cunha J, Herzan DL, Patterson A, Kohman LJ, Green MR, Kratzke RA. CALGB 9761: A prospective analysis of IHC and PCR based detection of occult metastatic disease in stage I NSCLC. it Journal of Clinical Oncology, 2006 ASCO Annual Meeting Proceedings Part I. 2006;Vol 24 No. 18S (June 20 Supplement), 2006: 7030.
- Nadaraya EA. On estimating regression. Theory of Probability and Its Applications. 1964;10:186–190.
- Paez JG, Jäanne PA, Lee JC, et al. EGFR mutations in lung cancer: correlation with clinical response to gefitinib therapy. Science. 2004;304:1497–1500. [PubMed]
- Pepe MS, Fleming RR. A nonparametric method for dealing with mismeasured covariate data. Journal of the American Statistical Association. 1991;86:108–113.
- Prentice RL. Surrogate endpoints in clinical trials: Definition and operational criteria. Statistics in Medicine. 1989;8:431440. [PubMed]
- Prentice RL, Pyke R. Logistic disease incidence models and case control studies. Biometrika. 1979;66:403–411.
- Schill W, Jöckel K-H, Drescher K, Timm J. Logistic analysis in case-control studies under validation sampling. Biometrika. 1993;80:339–352.
- Scott AJ, Wild CJ. Fitting regression models to case-control data by maximum likelihood. Biometrika. 1997;84:57–71.
- Wacholder S, Weinberg CR. Flexible maximum likelihood methods for assessing joint effects in case-control studies with complex sampling. Biometrics. 1994;50:350–357. [PubMed]
- Wand MP, Jones MC. Kernel Smoothing. London: Chapman and Hall; 1995.
- Wang XF, Zhou HB. A semiparametric empirical likelihood method for biased sampling schemes in epidemiologic studies with auxiliary covariates. Biometrics. 2006;62(4):1149–1160. [PubMed]
- Watson GS. Smooth regression analysis. Sankhya, Ser. A. 1964;26:359–372.
- Weaver MA, Zhou HB. An estimated likelihood method for continuous outcome regression models with outcome-dependent sampling. Journal of the American Statistical Association. 2005;100:459–469.
- Wild CJ. Fitting prospective regression models to case-control data. Biometrika. 1991;78:705–717.
- White JE. A two stage design for the study of the relationship between a rare exposure and a rare disease. American Journal of Epidemiology. 1982;115:119–128. [PubMed]
- Zhao LP, Lipsitz S. Designs and analysis of two-stage studies. Statistics in Medicine. 1992;11:769–782. [PubMed]
- Zhou H, Chen J, Cai J. Random effects logistic regression analysis with auxiliary covari-ates. Biometrics. 2002;58:352–360. [PubMed]
- Zhou H, Chen J, Rissanen T, Korrick S, Hu H, Salonen JT, Longnecker MP. An efficient sampling and inference procedure for studies with a continuous outcome. Epidemiology. 2007;18(4):461–468. [PMC free article] [PubMed]
- Zhou H, Pepe MS. Auxiliary covariate data in failure time regression. Biometrika. 1995;82:139–149.
- Zhou H, Wang CY. Failure time regression analysis with measurement error in Covariates. Journal of the Royal Statistics Society, B. 2000;62:657–665.
- Zhou H, Weaver MA. Outcome dependent selection models. Encyclopedia of Environmetrics. 2001 3:14991502.
- Zhou H, Weaver MA, Qin J, Longnecker MP, Wang MC. A semiparametric empirical likelihood method for data from an outcome-dependent sampling design with a continuous outcome. Biometrics. 2002;58:413–421. [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |