PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of biostsLink to Publisher's site
 
Biostatistics. Jul 2011; 12(3): 521–534.
Published online Jan 20, 2011. doi:  10.1093/biostatistics/kxq080
PMCID: PMC3114654
Semiparametric inference for a 2-stage outcome-auxiliary-dependent sampling design with continuous outcome
Haibo Zhou*
Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599-7420, USA ; zhou/at/bios.unc.edu
Yuanshan Wu
Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599-7420, USA and School of Mathematics and Statistics, Wuhan University, Wuhan, Hubei 430072, China
Yanyan Liu
School of Mathematics and Statistics, Wuhan University, Wuhan, Hubei 430072, China
Jianwen Cai
Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599-7420, USA
*To whom correspondence should be addressed.
Received February 25, 2010; Revised December 5, 2010; Accepted December 6, 2010.
Two-stage design has long been recognized to be a cost-effective way for conducting biomedical studies. In many trials, auxiliary covariate information may also be available, and it is of interest to exploit these auxiliary data to improve the efficiency of inferences. In this paper, we propose a 2-stage design with continuous outcome where the second-stage data is sampled with an “outcome-auxiliary-dependent sampling” (OADS) scheme. We propose an estimator which is the maximizer for an estimated likelihood function. We show that the proposed estimator is consistent and asymptotically normally distributed. The simulation study indicates that greater study efficiency gains can be achieved under the proposed 2-stage OADS design by utilizing the auxiliary covariate information when compared with other alternative sampling schemes. We illustrate the proposed method by analyzing a data set from an environmental epidemiologic study.
Keywords: Auxiliary covariate, Kernel smoothing, Outcome-auxiliary-dependent sampling, 2-stage sampling design
Biomedical studies are often designed to assess the relationship between some exposure X of interest and the corresponding outcome Y of individual adjusted by some confounding covariates Z. In many situations, due to limited budget, the assessment of X is not feasible to be conducted on all subjects under study. One useful approach to accommodating this issue is to use a 2-stage stratified sampling design, originally introduced by Neyman (1938), to enhance the study efficiency while minimizing the costs. At the first stage of a typical 2-stage design, a relatively large random sample is drawn and measurements are conducted on outcome Y and Z, which are easier to measure, while at the second stage, ascertainments on the X are made for a subsample drawn randomly, without replacement.
There is great literature on the variations of 2-stage sampling designs with binary outcomes. For example, White (1982) proposed a stratified case–control design of a rare disease (i.e. Y) and a rare exposure (i.e. X), where a large preliminary random sample is drawn at the first stage, from which strata are identified on the basis of both the disease and the exposure. At the second stage, a subsample is drawn from within the strata identified in the first stage and measurements of the potential confounding variables are made on the subsample. Compared with the simple random sampling at the second stage regardless of either the disease or the exposure status, great efficiency gains can be achieved by selecting the desirable number of cases and controls within each stratum identified in the first stage. Rathouz and others (2002) considered a matched case–control study with binary outcome using the conditional logistic regression method. Recently, Schildcrout and Rathouz (2010) extended this stratified case–control design to a more general case where the response is a longitudinal binary variable.
On the other hand, when there exists an additional auxiliary variable W for the expensive X, which is easily obtained for all subjects under study at the first stage, it is necessary to incorporate the information implied by W into the statistical analysis. For instance, in a lung cancer biomarker study, one of the aims is to assess the epidermal growth factor receptor (EGFR) mutations (X) as a predictive biomarker for whether a subject responds to a greater extent to EGFR inhibitor drugs (Y). Due to high cost of genotyping EGFR genes, it is prohibitive to ascertain the genotype of EGFR genes on all samples at the first stage. However, the likelihood score of EGFR mutations (W) obtained by a designed questionnaire has been shown to relate to the EGFR mutations and can be easily observed for all patients in Paez and others (2004). Wang and Zhou (2010) considered inference of the 2-stage outcome-auxiliary-dependent sampling (OADS) design to increase the study efficiency by utilizing the auxiliary covariate information when the outcome is categorical. Zhang and others (2008) and Lu and Tsiatis (2008) also showed that using the available baseline auxiliary covariate information can achieve more efficient estimators in the analysis of randomized clinical trials and survival data, respectively.
As the scope of biomedical studies inquiry grows, it is important to investigate the relationships between continuous biological outcomes and exposure of interest adjusted by other covariates. It is cost-effective to adopt a 2-stage design when the exposure is hard to obtain. However, most current 2-stage designs have been developed for categorical outcomes, the statistical method for the 2-stage design with continuous outcome is limited. When an auxiliary W does not exist, Chatterjee and others (2003) considered a pseudoscore estimator for regression parameter with a 2-stage sampling. Weaver and Zhou (2005) proposed a 2-stage outcome-dependent sampling (ODS) design for continuous outcome regression models, wherein the subsample was drawn at the second stage within the stratum that was achieved by subdividing the range of continuous outcome variable into class intervals.
In this paper, we proposed a 2-stage OADS design when outcome Y is continuous and there exists auxiliary variable W at the first stage. Specifically, outcome Y, auxiliary variable W for exposure X, and other covariates Z are all observed for all patients at the first stage. Then we selected the subsample within each stratum defined by the partition of the domain of Y×W to ascertain the value of X at the second stage. An estimated likelihood function by estimating its infinite-dimensional nuisance parameter through the kernel smoother is proposed and the estimator maximizing the estimated likelihood is used to estimate the regression parameter. The proposed 2-stage OADS design with continuous outcome is shown to be more efficient than other alternative competing sampling schemes.
The rest of this paper is structured as follows. We describe the 2-stage OADS design, data structure, and the model in Section 2. The estimated likelihood function method and the asymptotic properties of the resulting estimator are presented in Section 3. We conduct a simulation study to assess the small sample approximation under the 2-stage OADS design in Section 4. In Section 5, a real data example is analyzed to illustrate our proposed method. Some conclusions are provided in Section 6, and the proof of the asymptotic properties of proposed estimator is investigated in the supplementary material available at Biostatistics online.
2.1. Two-stage OADS design and data structure
To fix notation, let Y denote a continuous outcome variable, {Z,X} be a covariate vector, and W be a continuous auxiliary variable for X. We assume that the conditional distribution of Y given Z and X is known up to a finite vector of unknown parameters, that is,
An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx1_ht.jpg
(2.1)
where β0 is the true value of q-vector regression parameter β of interest. Assume that W offers no additional information regarding the outcome Y given covariate X.
Assume that the domain of (Y,W) is denoted by Y×W. Let Y be partitioned into J mutually exclusive and exhaustive strata by the known constants − = a0 < a1 < (...) < aJ − 1 < aJ = , and let the jth stratum be denoted by Aj = (aj − 1,aj], for j = 1,…,J. Similarly, let W be partitioned into T mutually exclusive and exhaustive strata by the known constants − = b0 < b1 < (...) < bT − 1 < bT = , and let the tth stratum be denoted by Bt = (bt − 1,bt], for t = 1,…,T. For subsequential use, we define B0 = ( − ,) when T = 0, which indicates that there is no partition on W. Therefore, we have Y×W partitioned into J×T mutually exclusive and exhaustive rectangles Aj×Bt, for j = 1,…,J and t = 1,…,T. For simplicity, we rewrite these rectangles as Δk for k = 1,…,K. Hence, {Aj×Bt:j = 1,…,Jandt = 1,…,T} = {Δk:k = 1,(...),K} and An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx2_ht.jpg
At the first stage, N subjects are sampled at random from a population with (Yi,Zi,Wi)i = 1N being observed. Suppose that there are Nk observations of (Y,W) falling into stratum Δk, then An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx3_ht.jpg. The second stage, where X is observed, are comprised of 2 components: (i) a simple random sample (SRS) of size n0 and (ii) a supplemental OADS sample of size nk from the kth stratum Δk for k = 1,…,K. Let Ri be an indicator for the ith subject whether Xi is observed (Ri = 1) or not (Ri = 0). Let n0k denote the number of subjects in the SRS falling into the kth stratum Δk. Furthermore, let An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx4_ht.jpg denote all the subjects in the SRS and define Vk = {i:Ri = 1,(Yi,Wi)[set membership]Δk} and An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx5_ht.jpg, then nk + n0k = |Vk| and An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx6_ht.jpg, where and hereafter, we use notation |A| to denote the cardinality of a set A. Let An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx7_ht.jpg representing the supplemental OADS samples in the stratum Δk, where AB is defined as the set consisting of elements that are in set A but not in set B. Let V = cupk = 1KVk and An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx8_ht.jpg, representing the validation set (set with X observed, i.e. the second-stage set) and nonvalidation set (i.e. the first stage set that are not sampled at the second stage), respectively. Hence, the observed data structure for the proposed 2-stage OADS design with continuous outcome can be summarized as follows: the first stage: {Yi,Zi,Wi} for i = 1,(...),N; the second stage: (i) the SRS sample: {Yi,Xi,Zi,Wi} for i [set membership] An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx4_ht.jpg; (ii) the OADS sample: {Xi|(Yi,Wi)[set membership]Δk,Zi} for An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx9_ht.jpg and k = 1,…,K; and (iii) the nonvalidation sample: {Yi,Zi,Wi} for An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx10_ht.jpg.
To better illustrate the proposed OADS design with continuous outcome, we present Figure 1 when J = T = 3. At the second stage, except for the SRS samples, the supplemental OADS samples are selected within strata at the 4 corners Δ1 = A1×B12 = A1×B33 = A3×B1, and Δ4 = A3×B3 based on the consideration that these combinations of the extreme values of both Y and W contain more information for the relationship of interest between outcome Y and exposure X. Hence, the advantage of such 2-stage OADS design is that, while providing overall information about the population from the SRS samples, it allows the investigator to oversample certain segments of the population that are believed to be more informative.
Fig. 1.
Fig. 1.
Illustration for the proposed 2-stage OADS design with continuous outcome. Y-axis denotes outcome variable Y. X-axis denotes auxiliary variable W.
The 2-stage ODS design proposed by Weaver and Zhou (2005) assumed that only the outcome variable is observed in the first stage and the covariates are ascertained for a subsample drawn at the second stage from strata defined by the outcome. Our proposed 2-stage OADS design includes this design when T = 0 and the information in Z and W is discarded. We call this design a 2-stage ODS design with only the outcome observed at the first stage. However, in many studies, some covariates such age, gender, and race so forth can be observed for all subjects in the cohort study. To this point, we extended the design by Weaver and Zhou (2005) to this more practical situation. When the auxiliary information is available for all subjects, our proposed 2-stage OADS design can accommodate the 2-stage ODS design with outcome, some covariates, and auxiliary observed at the first stage by letting T = 0. It is worth noting that the subsequential methodology development on the 2-stage OADS design is still valid for the 2-stage ODS design in several abovementioned scenarios.
2.2. Likelihood function
Let G(x|z,w) and g(x|z,w) be the conditional cumulative distribution function and the conditional probability function of X given (Z,W). We will construct the likelihood function based on all the observations under the 2-stage OADS design. First, the contribution from the SRS at the second stage to the full likelihood is proportional to
An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx11_ht.jpg
(2.2)
Second, the likelihood for the supplemental OADS sample at the second stage can be shown to be proportional to (Zhou and others, 2002)
An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx12_ht.jpg
(2.3)
Furthermore, the observations in the nonvalidation sample contribute the following term to the full-information likelihood function:
An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx13_ht.jpg
(2.4)
where f(Y|Z,W;β) = ∫Xf(Y|Z,x;β)dG(x|Z,W).
Finally, as shown by Weaver and Zhou (2005), conditional on the component size of the OADS being fixed, the kth stratum size for the nonvalidation sample An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx14_ht.jpg follows a multinomial law such that
An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx15_ht.jpg
(2.5)
Conditional on the observed size An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx16_ht.jpg, the observations in the nonvalidation sample are independent of those in the validation sample. After combining and simplifying terms (2.2–2.5), we have derived the full likelihood based on all the observations under the 2-stage OADS design, which is proportional to
An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx17_ht.jpg
(2.6)
The presence of the nuisance function G(x|z,w) makes the inference for β challenging. Obviously, direct maximization of LF(β) is not feasible since the function G(x|z,w) cannot be factored out. A simple method is to assume a parametric distribution for G(x|z,w), but this could lead to a biased conclusion if the underlying model is misspecified in that, generally, the relationship between W and X may not be known to be specified through a parametric model. A more attractive approach is to model it nonparametrically.
In the estimated likelihood method, an unspecified nuisance parameter, such as the conditional distribution function G(x|z,w) in (2.6), is replaced by a consistent estimator. When the validation sample is a simple random sample, one could estimate G(x|z,w) using data from validation sample by an empirical imputation method for discrete auxiliary (Pepe and Fleming, 1991) and by kernel smoothing (Carroll and Wand, 1991) for continuous auxiliary. Zhou and Pepe (1995), Zhou and Wang (2000), and Liu and others (2009) applied the estimated likelihood approach to time-to-event data subject to random censoring.
Due to the 2-stage OADS design, the validation sample is not a simple random sample so we cannot use a simple global empirical distribution function to estimate G(x|z,w). Hence, one should account for the sampling mechanism under the 2-stage OADS design to estimate G(x|z,w) nonparametrically. Let S denote the informative components of (Z,W) in the sense that G(X|Z,W) = G(X|S) almost surely. Without loss of generality, assume that S is a continuous variable with dimension d. Note that An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx18_ht.jpg where πk(s) = Pr((Y,W)[set membership]Δk|s) and Gk(x|s) = G(x|s,(Y,W)[set membership]Δk). Then we estimate πk(s) and Gk(x|s), respectively, by An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx19_ht.jpg and An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx20_ht.jpg where I(·) is an indicator function and An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx21_ht.jpg is a d-dimensional kernel function with the bandwidth hN. For simplicity, we suppress the subscript of hN hereafter. Hence, G(x|s) can be subsequently estimated by An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx22_ht.jpg which is a consistent estimator as shown in the supplementary material available at Biostatistics online.
The estimated likelihood function is obtained by substituting G(x|s) in (2.6) with An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx23_ht.jpg and the corresponding estimated log-likelihood function is denoted by An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx24_ht.jpg, where
An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx25_ht.jpg
with
An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx26_ht.jpg
and An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx27_ht.jpg, which is not dependent on β.
The solution to the estimated score equations An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx28_ht.jpg, denoted by An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx29_ht.jpg, is used to estimate β0, where
An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx30_ht.jpg
with f(y|z,x;β) = [partial differential]f(y|z,x;β)/[partial differential]β. One can adopt the Newton–Raphson iteration method to obtain the estimator An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx29_ht.jpg. A simple ad hoc bandwidth selection An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx31_ht.jpg can be used if S = W almost surely, where An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx32_ht.jpg is the sample standard error of {Wi,i[set membership]Vk}.
The true value of parameters are indicated by superscript “0.” Let Ek denote a conditional expectation given (Y,W)[set membership]Δk, under the true parameters. Assume that |V|/N→ρV > 0 and nk/|V|→ρk ≥ 0 for k = 0,(...),K, as N. Let γk = Pr{(Y,W)[set membership]Δk}. The regularity conditions needed to derive the asymptotic properties are given in the supplementary material available at Biostatistics online. Then the asymptotic properties of the proposed estimator An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx29_ht.jpg are summarized in the following theorem.
THEOREM 1.
Under the regularity conditions, An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx29_ht.jpg converges in probability to β0, while An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx33_ht.jpg converges weakly to a normal distribution with mean zero and covariance Σ(β0), where
An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx34_ht.jpg
The proof of Theorem 1 is provided in the separate supplementary material available at Biostatistics online. The consistent variance estimator is stated in the following theorem.
THEOREM 2.
Under the regularity conditions, a consistent estimator for the asymptotic covariance matrix Σ(β0) is
An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx35_ht.jpg
where An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx36_ht.jpg and An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx37_ht.jpg with An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx38_ht.jpg.
We conducted a simulation study to assess the small sample performance of our proposed estimator. The data were generated from a linear regression model of the form:
An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx39_ht.jpg
where X, Z, and ς were generated independently from standard normal distribution. Thus, the conditional distribution of Y given X and Z is normal with mean β0 + β1X + β2Z and variance 4. Let W = X + ϵ, where ϵ was generated from a zero-mean normal distribution with variance σ2. Note that the value of σ2 indicates the strength of information contained in W for X. We set σ = 1 in simulation, which represents a moderate association between the W and X. Here, we take S = W.
Suppose there are N subjects available at the first stage. Let ai and bi denote the i/3 percentile of Y and W, respectively, for i = 1,2. First, we use the method depicted in Figure 1 to obtain the second stage samples for the 2-stage OADS design. Then the size of the validation set is An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx40_ht.jpg Second, while selecting the same SRS sample of size n0, we also select the 2 supplemental ODS samples in the stratum A1 of size n1 + n2 and stratum A3 of size n3 + n4, respectively, to mimic the design proposed by Weaver and Zhou (2005). Note that the sizes of validation set V obtained at the second stage through the above 2 sampling designs are the same.
Having obtained the data under the 2-stage OADS design, we denote the proposed estimator by An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx41_ht.jpg. We also denote the reduced proposed estimator by An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx42_ht.jpg for the 2-stage ODS design with (Y,Z,W) observed at the first stage. We compare estimators An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx42_ht.jpg and An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx41_ht.jpg with some competing estimators. The first estimator, denoted by An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx43_ht.jpg, is the inverse probability weighted estimator (Horvitz and Thompson, 1952) based on the 2-stage OADS design. The second estimators to be compared, as discussed in the Section 2.1, are the estimator An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx44_ht.jpg for the 2-stage ODS design with (Y,Z) observed at the first stage and, similarly, the estimator An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx45_ht.jpg for the 2-stage ODS design with only Y observed at the first stage and (X,Z) observed at the second stage. The bandwidth An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx46_ht.jpg is used for these estimators involving kernel smoothing, where An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx32_ht.jpg is the sample standard error of {Wi,i[set membership]Vk}. Finally, as a benchmark, we also consider the efficient linear regression estimator, denoted by An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx47_ht.jpg, which is a hypothetical situation in which all subjects at the first stage have X observed, and the ordinary linear regression estimator, denoted by An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx48_ht.jpg, from a simple random sample of the same size as the validation set at the second stage. Note that the efficiency difference for methods βY1, βY2, βP1, and βP2 should be attributed to the study design instead of estimating procedure. However, βP2 and βW are different estimating procedures under the same 2-stage OADS design.
For narrative simplicity, we define an allocation function denoted by allocation(μ,ν) to allocate the validation set of size μ + 4ν at the second stage, which means that n0 = μ and n1 = n2 = n3 = n4 = ν under the 2-stage OADS design as illustrated in Figure 1. Under the 2-stage ODS design, allocation(μ,ν) means SRS sample of size μ and 2 supplemental ODS samples in the stratum A1 of size 2ν and in stratum A3 of size 2ν are allocated. We also investigate the impact on the parameter estimation of different allocations of total validation sample size between the SRS sample and the supplemental OADS (ODS) samples at the second stage, with (N012) = (1500,0.5,0.3,0.5) fixed.
For each simulation configuration, 1000 replicated samples were generated and the results were presented in Table 1. Under the model studied, we make the following observations on the estimator An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx49_ht.jpg, the parameter of interest. Note that the estimator An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx50_ht.jpg works quite well in all scenarios. First, all the methods in all the scenarios yield consistent estimators, the variance estimators accurately reflect the true variations, and the confidence intervals have proper coverage probabilities. Second, the proposed estimators An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx51_ht.jpg and An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx54_ht.jpg are more efficient than the estimators An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx52_ht.jpg and An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx53_ht.jpg, which indicates that taking auxiliary information into consideration indeed gains substantial estimation efficiency. Furthermore, An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx54_ht.jpg is more efficient than An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx51_ht.jpg. This fits our expectation since An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx54_ht.jpg not only utilizes the auxiliary in the stratification (i.e. study design) but also incorporates it into the estimation procedure, while An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx51_ht.jpg uses it just in the estimation procedure. On the other hand, although the precision of estimator An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx53_ht.jpg and that of An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx52_ht.jpg are almost the same in the scenarios considered, the efficiency gains of An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx55_ht.jpg over An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx56_ht.jpg are apparent due to the fact that the covariate Z is observed for all subjects in An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx44_ht.jpg. The estimator An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx57_ht.jpg is less efficient than An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx54_ht.jpg since An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx57_ht.jpg just utilizes the second-stage sample and sampling probability under the 2-stage OADS design. Third, when we increase the size of the validation set from |V| = 240 to |V| = 360, more accurate estimators (including An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx51_ht.jpg, An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx54_ht.jpg, An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx52_ht.jpg, An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx53_ht.jpg, An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx57_ht.jpg, and An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx58_ht.jpg) are obtained as expected. Here, we consider 3 different ways to add the additional 120 samples to the validation set |V| = 240. It can be seen that more efficiency gains are achievable through the way from allocation(120,30) to allocation(180,45), that is, putting half of the additional 120 samples to the SRS part and the other half to the OADS part averagely, than that from allocation(120,30) to allocation(240,30), that is, putting the additional 120 samples to the SRS part. Efficiency gains are also achieved through the way from allocation(120,30) to allocation(120,60), which puts the additional 120 samples to the OADS part evenly. These different allocation patterns indicate that adding the additional sample to both the SRS part and the supplemental OADS part or the supplemental OADS part is better than to the SRS part only. Finally, under the allocation(120,60), when the cutpoints vary from the An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx59_ht.jpg to An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx60_ht.jpg, that is, when the product sample space Y×W is stratified by more extreme cutpoints, more precise estimators (including An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx52_ht.jpg, An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx53_ht.jpg, An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx51_ht.jpg, and An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx54_ht.jpg) are obtained, and the efficiency advantage of An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx51_ht.jpg over An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx54_ht.jpg becomes more obvious. We also investigate the effect of the strength of W for X, represented by σ, on the efficiency of estimator An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx49_ht.jpg, under the methods considered. Please see Figure A.1 in the supplementary material available at Biostatistics online.
Table 1.
Table 1.
Simulation study for the proposed estimators. Results are based on 1000 replicated data sets with 1500 subjects at the first stage for each data set
It should be noted that in above simulation results, the covariate X was generated independently from Z. Therefore, we took S = W and then adopted a univariate kernel smoothing method to estimate the function g(X|Z,W) = g(X|W) nonparametrically. As suggested by one of the referees, here we intend to investigate our proposed estimators when g(X|W) is specified parametrically instead of being estimated by kernel smoothing. Note that in our above simulation setups g(X|W) is a normal density function with mean W and variance 2. The resultant estimate is denoted by An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx65_ht.jpg. Furthermore, we also consider this estimate in the misspecified situation in which the X was generated from the model An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx66_ht.jpg but the working model remains to be X = W + ϵ. The related results are formulated in Table 2. Obviously, when g(X|W) is correctly specified, the estimate An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx65_ht.jpg outperforms the nonparametric methods. However, when g(X|W) is misspecified, the estimate An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx65_ht.jpg is biased with low coverage probability while the nonparametric smoothing estimates, including our proposed estimates An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx42_ht.jpg and An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx41_ht.jpg, still work well.
Table 2.
Table 2.
Simulation study for the proposed estimators. Results are based on 1000 replicated data sets with 1500 subjects at the first stage and allocation pattern allocation(120, 60) at the second stage under the cutpoints An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx59_ht.jpg for each data set
On the other hand, as suggested by another referee, in some practice, d, the dimension of W, could be greater than one, and then multivariate kernel smoothing method would be involved. Hence, it is of practical importance to see how sensitive the resulting inference on the parameters of interest is with regard to the dimension d of kernel smoothing. We explore this issue with some modifications of the simulation models, where we generate Z from model Z = W2 + ϵ2, where W and ϵ2 are both generated independently from a standard normal distribution. We keep the remaining parametric simulation settings unchanged. We use 2 dimensional product standard normal kernels to estimate g(X|Z,W) with bandwidth matrix diag(h1,h2), where An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx67_ht.jpg, h2 is defined in a similar pattern, and An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx68_ht.jpg is the sample standard error of {Zi,i[set membership]Vk}. The corresponding estimates are listed in Table 3. It can be seen that when the dimension of kernel smoothing d equals 2, the resultant estimates of β1 of main interest are slightly biased with low coverage probability except for the inverse probability estimate An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx42_ht.jpg. Even then, our proposed estimators An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx1_ht.jpg and An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx41_ht.jpg outperform An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx45_ht.jpg and An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx44_ht.jpg.
Table 3.
Table 3.
Simulation study for the proposed estimators. Results are based on 1000 replicated data sets with 1500 subjects at the first stage and allocation pattern allocation(120, 60) at the second stage under the cutpoints An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx59_ht.jpg for each data set with S = (Z, W) (more ...)
As an illustration, we applied our proposed method to a data set from the Collaborative Perinatal Project (CPP) to evaluate the effect of maternal pregnancy serum level of polychlorinated biphenyls (PCB) of a mother on her children's intelligence quotient (IQ) test performance. Pregnant mothers were enrolled through university-affiliated medical clinics and data were collected on the mothers each prenatal visit. The children born during the study were also followed for various outcomes for up to 8 years. One hypothesis is that PCB levels are related to the performance on the Weschler Intelligence Scale for children at 7 years of age (Longnecker and others, 1997). To investigate the in utero exposure of PCB in relation to neurodevelopmental abnormality, the PCB levels were measured by analyzing the third trimester blood serum specimens that had been preserved from mothers in the CPP study. Due to the expense of conducting the blood serum assay to measure the PCB level, the study investigators decided to assess the PCB levels for an overall simple random sample of 849 subjects from the underlying population. In addition to the PCB level as the exposure variable of interest, other confounding variables available for all subjects under study include socioeconomic status of the child's family (SES), gender (SEX) and race (RACE) of the child indicating for female and black, respectively, the mother's education (EDU) and age (AGE).
To illustrate our methods, we use the simple random sample of 849 subjects as our underlying population. We then construct a 2-stage OADS design for this base population as an illustration. The first stage sample is the 849 subjects, that is, N = 849. We first explore the relationship between SES and PCB based on the first-stage sample data. A linear model fit for PCB given SES yields the estimate of slope 0.154(p < 0.0001), which indicates a linear association between SES and PCB. On the other hand, in terms of practical consideration in environmental epidemiology, higher SES usually leads to higher PCB level. Hence, we use SES as the auxiliary variable for PCB.
The 1/3 and 2/3 sample quantiles of IQ are 3.7 and 5.3, and the 1/3 and 2/3 sample quantiles of SES are 90 and 101, respectively. Hence, we can take a1 = 3.7,a2 = 5.3,b1 = 90, and b2 = 101. With respect to the second-stage samples, assume that 60 SRS samples and 30 supplemental OADS samples in each corner are selected under the allocation pattern allocation(60,30). We use the chi-square statistics to test the independence between IQ and SES, given PCB. In particular, we discretize PCB by dPCB = (PCB > median(PCB)). Under condition dPCB = 0, we can also define dIQ and dSES in a similar pattern, and then use the chi-square test yielding p-value 0.6038. Similarly, under condition dPCB = 1, the chi-square test yields p-value 0.4386. Hence, we think conditioning on PCB level, IQ does not further depend on SES. The fitted model is
An external file that holds a picture, illustration, etc.
Object name is biostskxq080fx69_ht.jpg
where ε is a zero-mean normal variable with unknown variance.
The results for the CPP data analysis are summarized in Table 4. Note that since the other confounding covariates such EDU, SES, AGE, and so on are observed for all subjects, the method βY1 which assumes that only the outcome is observed at the the first stage is not considered in the data analysis. First, we are interested in the estimate for PCB under various methods. It is evident that all the analyses confirm that the PCB level of mother's third-trimester blood serum specimen is not significantly related to the IQ scores for children at 7 years of age. Second, a more precise 95% confidence interval ( − 0.432,1.002) is achieved for the estimate of PCB using method βP2. For example, the 95% confidence intervals for the estimates of PCB are ( − 0.425,1.253), ( − 0.531,1.303), and ( − 0.791,1.149) using methods βW, βY2, and βP1, respectively. Meanwhile, the estimated standard error for the estimate of PCB in the hypothetical case βE is the smallest one among all the methods considered. Also, the method βE yields the most accurate 95% confidence interval ( − 0.190,0.702) for the estimate of PCB. Third, the estimators for the remaining covariates under various methods considered are all almost the same as confirmed in the simulation study. Finally, despite that slightly different conclusions are obtained under methods βR and βW, the methods βE, βY2, βP1, and βP2 all confirm that SES, EDU, and RACE have a positive impact on the IQ scores of children while there is no evidence that both the AGE and SEX have any effect on the IQ scores.
Table 4.
Table 4.
Analysis results for the CPP study
We proposed a new 2-stage OADS design in which the selected supplemental samples at the second stage are allowed to depend on both a continuous outcome variable and a continuous auxiliary variable. This 2-stage OADS design can be easily reduced to the 2-stage ODS design with auxiliary covariate information. An estimated likelihood function based on nonparametric kernel smoothing method is developed to accommodate the 2-stage OADS design with continuous outcome variable. The proposed estimator is shown to be consistent and asymptotically normal. The simulation study suggests that greater efficiency can be gained in estimating the effect of the exposure variable on the outcome using the proposed 2-stage OADS design over the existing or other competing 2-stage ODS designs. Additionally, using the available auxiliary data information can also substantially improve the efficiency of the study. A real data analysis is provided to illustrate our proposed method.
When the dimension d of S is moderately large (e.g. d > = 3), the proposed method will not work well due to the curse of high dimensionality. One possible way is to specify g(X|S) parametrically. However, this parametric method could lead to some biased results when g(X|S) is misspecified. In practice, we suggested using our proposed method when d < = 2 and using the parametric method when d > = 2.
The proposed 2-stage OADS design allows the investigators to focus their attention on the subjects who are more informative for study aims. Generally, the issue of how to appropriately divide the domain of Y×W to obtain the strata Δks may affect the efficiency of estimators. Taking the CPP data as an example, we want to select those subjects with very high or low IQ scores and SES values as much as possible. On the other hand, the number of those subjects that we can sample is decreasing along with higher or lower values of both the IQ scores and SES. Hence, one needs to balance between the 2 above points when using a 2-stage OADS design. Our experience shows that the cutpoints consisting of 1/3 (or 1/4) and 2/3 (or 3/4) quantiles of both the outcome and auxiliary are usually feasible in practice.
FUNDING
National Institutes of Health (R01 CA79949 to H.Z., Y.W.; R01 HL57444 to J.C.); China Postdoctoral Science Foundation (20100480877 to Y.W.); National Nature Science Fund of China (10771163 to Y.L.).
Supplementary Material
Supplementary Data
Acknowledgments
The authors are very grateful for the valuable comments and suggestions from the editor and the referees. They also thank Ms. Beth Horton for careful reading of the manuscript. Conflict of Interest: None declared.
  • Carroll RJ, Wand MP. Semiparametric estimation in logistic measurement error models. Journal of the Royal Statistical Society, Series B. 1991;53:573–585.
  • Chatterjee N, Chen Y-H, Breslow NE. A pseudoscore estimator for regression problems with two-phase sampling. Journal of the American Statistical Association. 2003;98:158–168.
  • Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association. 1952;47:663–685.
  • Liu Y, Zhou H, Cai J. Estimated pseudo-likelihood method for correlated failure time data with auxiliary covariates. Biometrics. 2009;65:1184–1193. [PMC free article] [PubMed]
  • Longnecker M, Klebanoff M, Zhou H, Wilcox A, Berendes H, Hoffman H. Proposal to Study in Utero Exposure to DDE and PCBs in Relation to Male Birth Defects and Neurodevelopmental Outcomes in the Collaborative Perinatal Project. Study Proposal. Washington, DC: National Institute of Environmental Health Science; 1997.
  • Lu X, Tsiatis AA. Improving the efficiency of the log-rank test using auxiliary covariates. Biometrika. 2008;95:679–694.
  • Neyman J. Contribution to the theory of sampling from human populations. Journal of the American Statistical Association. 1938;33:101–116.
  • Paez JG, Jänne PA, Lee JC, Tracy S, Greulich H, Gabriel S, Herman P, Kaye FJ, Lindeman N, Boggon TJ. and others. EGFR mutations in lung cancer: correlation with clinical response to gefitinib therapy. Science. 2004;304:1497–1450. [PubMed]
  • Pepe MS, Fleming TR. A nonparametric method for dealing with mismeasured covariate data. Journal of the American Statistical Association. 1991;86:108–113.
  • Rathouz PJ, Satten GA, Carroll RJ. Semiparametric inference in matched case-control studies with missing covariate data. Biometrika. 2002;89:905–916.
  • Schildcrout JS, Rathouz PJ. Longitudinal studies of binary response data following case-control and stratified case-control sampling: design and analysis. Biometrics. 2010;66:365–373. [PMC free article] [PubMed]
  • Wang X, Zhou H. Design and inference for cancer biomarker study with an outcome and auxiliary-dependent subsampling. Biometrics. 2010;66:502–511. [PMC free article] [PubMed]
  • Weaver MA, Zhou H. An estimated likelihood method for continuous outcome regression models with outcome-dependent sampling. Journal of the American Statistical Association. 2005;100:459–469.
  • White JE. A two stage design for the study of the relationship between a rare exposure and a rare disease. American Journal of Epidemiology. 1982;115:119–128. [PubMed]
  • Zhang M, Tsiatis AA, Davidian M. Improving efficiency of inferences in randomized clinical trials using auxiliary covariates. Biometrics. 2008;64:707–715. [PMC free article] [PubMed]
  • Zhou H, Pepe MS. Auxiliary covariate data in failure time regression analysis. Biometrika. 1995;82:139–149.
  • Zhou H, Wang CY. Failure time regression with continuous covariates measured with error. Journal of the Royal Statistical Society, Series B. 2000;62:657–665.
  • Zhou H, Weaver MA, Qin J, Longnecker MP, Wang MC. A semiparametric empirical likelihood method for data from an outcome-dependent sampling scheme with a continuous outcome. Biometrics. 2002;58:413–421. [PubMed]
Articles from Biostatistics (Oxford, England) are provided here courtesy of
Oxford University Press