Home | About | Journals | Submit | Contact Us | Français |

**|**Biostatistics**|**PMC3114650

Formats

Article sections

- Abstract
- 1. INTRODUCTION
- 2. DATA STRUCTURE, MODEL, AND THE PENALIZED LOG-LIKELIHOOD
- 3. MAXIMUM ESTIMATED PENALIZED LIKELIHOOD ESTIMATION
- 4. SIMULATION STUDIES
- 5. ANALYSIS OF THE COLLABORATIVE PERINATAL PROJECT DATA SET
- 6. DISCUSSION
- FUNDING
- SUPPLEMENTARY MATERIAL
- References

Authors

Related links

Biostatistics. 2011 July; 12(3): 506–520.

Published online 2010 December 14. doi: 10.1093/biostatistics/kxq070

PMCID: PMC3114650

Guoyou Qin

Department of Biostatistics, School of Public Health and Key Laboratory of Public Health Safety, Ministry of Education of China, Fudan University, Shanghai 200032, People's Republic of China

Department of Biostatistics, University of North Carolina at Chapel Hill, NC 27599, USA

Haibo Zhou^{*}

Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA, Email: ude.cnu.soib@uohz

Received 2010 July 26; Revised 2010 October 25; Accepted 2010 October 26.

Copyright © The Author 2010. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

This article has been cited by other articles in PMC.

The outcome-dependent sampling (ODS) design, which allows observation of exposure variable to depend on the outcome, has been shown to be cost efficient. In this article, we propose a new statistical inference method, an estimated penalized likelihood method, for a partial linear model in the setting of a 2-stage ODS with a continuous outcome. We develop the asymptotic properties and conduct simulation studies to demonstrate the performance of the proposed estimator. A real environmental study data set is used to illustrate the proposed method.

Two-stage design has been widely recognized as an efficient study design for epidemiological studies. For the discrete outcome, White (1982) proposed a 2-stage case–control design for a rare disease and exposure scenario, where a large simple random sample (SRS) sample is drawn in the first stage, and further subsamples with additional potential confounding variables are drawn in the second stage from the strata identified based on the disease status and the exposure from the first-stage sample. Greater efficiency can be obtained through the 2-stage sampling design (e.g., Breslow and Cain, 1988; Zhao and Lipsitz, 1992; Langholz and Borgan, 1995; Breslow *and others*, 2003).

For the continuous case, Weaver and Zhou (2005) considered a 2-stage outcome-dependent sampling (ODS) design, where in the second stage, in addition to an SRS sample, some supplemental samples were collected in the second stage via an ODS way, that is, the subsamples were drawn from the strata identified by the outcome from the first stage. The ODS design has attracted much attention in the last several decades since it is a cost-effective way to improve study efficiency. For example, the case–control study is a well-known ODS design with a binary outcome variable. One can ignore the sampling scheme when the underlying model is logistic. However, this is not true for other link function or the nonbinary outcomes. For inference of the data from ODS design with a continuous response, the usual methods will be not appropriate since the ODS is a biased sampling scheme. There are several methods that are developed to inference the data from ODS design, for example, the weighted estimating equation methods by Horvitz and Thompson (1952), the semiparametric empirical likelihood method proposed by Zhou *and others* (2002), the pseudoscore estimation methods by Chatterjee *and others* (2003), and the estimated likelihood method proposed by Weaver and Zhou (2005).

A recent epidemiological study (Gray *and others*, 2005) employed the 2-stage ODS design of Weaver and Zhou. In this study, the investigators were interested in how the children's intelligence quotient (IQ) at 7 years of age is related to an environmental pollutant, polychlorinated biphenyls (PCBs). The study subjects are children who were born into the Collaborative Perinatal Project (CPP) which is a prospective cohort designed to provide precise data for studies of a wide variety of neuropsychological outcomes and birth defects (Niswander and Gordon, 1972). Since it is expensive to obtain the PCB measurement for all the first-stage sample, the investigators decided to obtain the PCB measurement for a sample that was sampled in an ODS way from the first-stage sample based on the observed IQ scores (Gray *and others*, 2005). Then the ODS sample with measured PCBs consists of the second-stage sample. The literature on the ODS (Zhou *and others*, 2002, Weaver and Zhou, 2005) generally assumes that the effect on the outcome of the exposure is linear, which is chosen mainly for its mathematically convenience. However, in practice, the true relation between them is usually unknown. For this CPP data, significant relation between the children's IQ and PCB was not found in the framework of linear model although the efficient ODS design was adopted (e.g., Zhou *and others*, 2002, Weaver, 2001).

The partial linear model (PLM) for continuous outcomes (Zeger and Diggle, 1994, He *and others*, 2002), where the outcome is assumed to depend on some covariate *W* nonparametrically and some other covariates *Z* parametrically, is an important inference tool and has been widely applied in many fields. It would be a particular advantage in the study of Gray *and others* (2005) to have a more flexible PLM approach to investigate the relation of low-level PCB exposure and childrens cognitive development. Motivated by the CPP study, in this article, we studied the inference of a PLM in a 2-stage ODS design with a continuous outcome, where the first-stage data are an SRS, but the second-stage data are collected via an ODS design. The softwares to fit the generalized additive model (GAM) (Ruppert *and others*, 2003) were generally developed based on methods that assume SRS design, and consequently the likelihood composition would not handle the complex data structures of ODS design. Hence, they cannot be directly applied to the proposed 2-stage ODS design nor a minor modification to softwares for GAM can accomplished the task.

In our simulation study, we compared the proposed estimator, denoted by the P estimator, with 3 competing methods: the semiparametric empirical likelihood estimator incorporating the P-spline method based on the second-stage sample denoted by the SEPLE-ODS estimator, which is an extension of Zhou *and others* (2002), the penalized inverse-probability weighted estimator denoted by the IPW estimator, which is the extension of Horvitz and Thompson (1952) and defined as the maximizer of the following penalized weighted likelihood function:

(1.1)

where and the penalized maximum likelihood method based on the SRS sample with the sample size equal to the second-stage sample, denoted by the PMLE-SRS estimator. Compared with the SEPLE-ODS method, the proposed method incorporates the outcome information of the first-stage sample. Compared with the PMLE-SRS method, a more efficient ODS design, than the SRS design, is used to sample second-stage data. Further more outcome information of the first-stage sample are also used for inference. The proposed method is preferred to the IPW method because we utilize the actual observation of the outcome variable, while the IPW method only uses the strata proportion.

The rest of this article is organized as follows. In Section 2, we describe the 2-stage ODS design and the PLM. The penalized likelihood is proposed. The inference method and main asymptotic results of the proposed estimator will be given in Section 3. In Section 4, we present simulation studies to investigate the performance of the proposed method. In Section 5, we illustrate our proposed method with the analysis of the CPP data set. We conclude with a brief discussion in Section 6.

Let *Y* denote a continuous outcome variable. Assume that the domain of *Y* is a union of *K* mutually exclusive intervals: with *c*_{k} being some known constants such that Thus, the collection of intervals {*C*_{k},*k* = 1,…,*K*} partition the study population into *K* strata. We consider a 2-stage ODS design as follows. In the first stage, an SRS sample is drawn from the underlying population of interests with sample size *N*. It is assumed that the outcome variable for the *N* individuals is observed. In the second stage, an ODS sample is drawn from the *N* individuals, which consists of an overall SRS of size *n*_{0} and stratified random samples from these *K* strata (Supplemental samples) with size *n*_{k} for the supplemental sample from the *k*th stratum. Both the outcome and the covariates will be observed for the individuals who are selected in the second stage, that is, the ODS sample, whereas the covariates will not be observed for those not selected in the second stage.

To fix notations, let be the total size of the second-stage sample for which we observe both the outcome and covariates, and let be the number of individuals in the population for whom only the outcome variable is observed but the covariates are not observed. We refer to the *n*_{V} complete observations as the second-stage sample and the incomplete observations as the incomplete first-stage sample. Let *V* denote the index set of all observations in the second-stage sample and denote the index set of all observations in the incomplete first-stage sample. Furthermore, let the *V*_{k} and denote the index sets for the observations in the second-stage sample and incomplete first-stage sample from the *k*th stratum.

We denote *X* as the covariates, then the data structure can be summarized as

- Stage 1: {
*Y*_{i}}, for*i*= 1,…,*N*; - Stage 2: SRS sample: {
*Y*_{i},*X*_{i}}, for*i*= 1,…,*n*_{0}; - Supplemental sample: {
*Y*_{i},*X*_{i}|*Y*_{i}*C*_{k},*k*= 1,…,*K*}, for*i*=*n*_{0}+ 1,…,*n*_{V}.

We assume that the conditional mean of the outcome is related to covariates *X* = (*W*^{T},*Z*^{T})^{T} as

(2.1)

where *g*(·) is an unknown nonparametric function of the exposure variable *W* and *γ* is a vector of *p*-dimensional regression coefficients. Our goal in this article is to inference *g*(·) and *γ* in model (2.1).

To further introduce additional notations: let *F*_{X} denote the distribution function of *X*, *f*(*Y*|*X*;*θ*) and *f*_{Y}(*Y*|*θ*) denote the conditional and marginal density functions of *Y*.

Several nonparametric smoothing methods can be adopted to estimate the nonparametric function *g*(·). An incomplete list of publications include: Lin and Carroll (2001), Yu and Ruppert (2002), Huang *and others* (2007), and Zhu *and others* (2008). As in Yu and Ruppert (2002) who studied the estimation of partially linear single-index model, we adopt penalized splines (P-spline) to estimate the nonparametric function *g*(·). P-spline is an extension of smoothing spline allowing more flexible choice of knots and penalty. The penalty is used to achieve a smooth fit of the nonparametric function. Following Yu and Ruppert (2002), under the working assumption that *g*(·) is a *r*th degree spline function with *T* fixed knots *t*_{1},…,*t*_{T} , we then have *g*(*w*) = *π*^{T}(*w*)*α*, where is a *r*-degree truncated power spline basis with knots and *α* is a *r* + *T* + 1-dimensional vector. Thus, we have

(2.2)

where *D*_{i} = (*π*^{T}(*W*_{i}),*Z*_{i}^{T})^{T} and *θ* = (*α*^{T},*γ*^{T})^{T}.

For the second-stage sample, the likelihood is

(2.3)

where

The likelihood for the incomplete first-stage sample is

(2.4)

Finally, note that the stratum size for the incomplete first-stage sample follows a multinomial law such that

(2.5)

where *n*_{0,k} is a random variable representing the number of observation in the SRS that belong to the *k*th stratum.

As noted by Weaver and Zhou (2005), by combining (2.3–2.5), the full information likelihood of both the first- and the second-stage samples is shown proportional to

(2.6)

To achieve a smooth fit, we introduce a penalized term into the log-likelihood, and our penalized log-likelihood is expressed as follows:

(2.7)

where *θ*^{T}*ψ** ^{T}* is a common quadratic penalty function and

The penalized log-likelihood (2.7) involves an unspecified marginal distribution function of *x*, *F*_{X}(*x*). Inference of *θ* will depend on how one handles *F*_{X}(*x*). A commonly used idea is to replace the *F*_{X}(*x*) in the likelihood with a consistent estimator and this will lead to an estimated likelihood. Such idea has been widely used in the statistical literature, for example, Pepe and Fleming (1991), Reilly and Pepe (1995), Reilly and Pepe (1997), Lawless *and others* (1999), Zhou and Pepe (1995), Zhou and Wang (2000), and Weaver and Zhou (2005).

We adopt the following estimator, which is proposed by Weaver and Zhou (2005) to specially accommodate the ODS sampling nature in the second stage, as our estimator of *F*_{X}(*x*):

(3.1)

where

(3.2)

is the empirical cumulative distribution function for the covariates based on all sampled observations from the *k*th stratum. Note that the *X*_{i} is a vector representing the covariates {*W*_{i},*Z*_{i}}. The value of *I*{*X*_{i} ≤ *x*} in (3.2) is equal to 1 if each component of *X*_{i} is less than or equal to the corresponding component of *x*, otherwise *I*{*X*_{i} ≤ *x*} = 0.

Using (3.1), we can obtain a consistent estimate of the marginal distribution of *Y* as

(3.3)

assuming that *n*_{k} + *n*_{0,k} > 0 for all *k* = 1,…,*K*.

Substituting (3.3) into (2.7), we obtain the following estimated penalized log-likelihood function for *θ*:

(3.4)

We define the maximum estimated penalized likelihood estimator to be the maximizer for the estimated penalized likelihood functions, and it can be obtained by invoking the Newton–Raphson procedure.

Under some regularity conditions, the asymptotic property of is summarized in the following theorem.

THEOREM 3.1

(i) If the smoothing parameter *λ* = o(1), converges to *θ*_{0} with probability one. (ii) If the smoothing parameter , the maximizer is asymptotically distributed as a normal distribution , where

with ,

with *E*_{k} denote expectation conditional on *Y**C*_{k}, and

with

All the quantities involved are evaluated at the true parameter value *θ*_{0}.

The regularity conditions and a brief proof for Theorem 3.1 is provided in Section 1 of the supplementary material available at *Biostatistics* online. A consistent estimator of the covariance matrix Σ is

(3.5)

where .

REMARK 3.2

One can make inference on the nonparametric function and the regression coefficients using Theorem 3.1. In particular, one can construct joint confidence region and test hypotheses. For example, if we want to test the null hypothesis *H*_{0}:*B**θ*_{0} − *s*_{0} = 0 where *B* is a *d*_{1}×*d*(*d* = *T* + *r* + 1 + *p*) matrix with full rank *d*_{1} ≤ *d*, then the test can be implemented by using the Wald test, with the test statistic , that has an asymptotic chi-square distribution with *d*_{1} degree of freedom.

REMARK 3.3

Particular interest of the Wald test is that one can use it to test if the nonparametric function describing the relation between an exposure variable *W* and a response *Y* is linear, that is, whether a linear model is enough to model the relation between *Y* and *W*. To do so, we repress the *T* + *r* + 1-dimensional vector *α* as (*α*_{1}^{T},*α*_{2}^{T}), where *α*_{1} = (*α*_{11},*α*_{12})^{T} is a 2D vector and *α*_{2} is a *T* + *r* − 1-dimensional vector. We are interested in testing the following null hypothesis: *α*_{2} = *α*_{2}^{0} = (0,…,0)^{T}. Under this null hypothesis, we have *g*(*W*) = *α*_{11} + *α*_{12}*W*, that is, the exposure variable *W* is related to response *Y* in a linear fashion.

In this section, we use simulated data to evaluate the performance of our proposed estimator. We compare the proposed estimator, denoted by the P estimator, with 3 competing methods: the semiparametric empirical likelihood estimator incorporating the P-spline method based on the second-stage sample denoted by the SEPLE-ODS estimator, the penalized inverse-probability weighted estimator denoted by the IPW estimator,which is the extension of the Horvitz and Thompson (1952) and defined as the maximizer of the following penalized weighted likelihood function:

where and the penalized maximum likelihood method based on the SRS sample with the sample size equal to the second-stage sample, denoted by the PMLE-SRS estimator. For all simulations, we generate 1000 simulated data sets. We consider 2 PLMs with different nonparametric functions. One is a monotonic function, while the other is a unimode function in threshold regions. We adopt a 3-degree truncated power spline basis and choose 10 fixed knots selected as the sample quantiles.

*Study 1.* The data were generated according to the following partial linear mixed model,

where *g*_{1}(*W*) = 3Φ(3.2*W*) is a standard normal distribution function, *W* denotes a continuous exposure variable of interest and *e*~*N*(0,*σ*_{0}^{2}). We assumed that *W*~*N*(0,0.25^{2}) and *Z*~*N*(0,0.3^{2}). We fixed *γ* = 1 and *σ*_{0}^{2} = 0.4. In all the simulation presented here, we allowed *Y* to be observed for the entire study population but assumed that both *W* and *Z* were observed only for the second-stage sample. The size of the first-stage sample was set to be *N* = 2000, and the second-stage sample fraction was set to *ρ* = 0.20, that is, the size of the second-stage sample is 400 = 2000×0.20.

The second-stage sample consists of an SRS sample (*n*_{0}) supplemented with additional samples from individuals with *Y* values in the upper and lower tails of the marginal distribution (i.e., *n*_{1} = *n*_{3} and *n*_{2} = 0) with cut-points *μ*_{Y}±*a**σ*_{Y}, where *μ*_{Y} and *σ*_{Y} represent the mean and standard deviation of *Y* and *a* is taken to be 0.7 and 1.0, respectively. We considered 2 cases of the allocation of the sample sizes as *n*_{0} = 300,*n*_{1} = *n*_{3} = 50, and *n*_{0} = 200,*n*_{1} = *n*_{3} = 100.

We computed the averages of the mean square error (MSE), the absolute value of the bias and the Monte Carlo variance of the estimated nonparametric function *g*_{1}(*x*) over 401 equal spaced grid points in [ − 0.75,0.75] (the mean of *X* minus and plus 3 times the standard deviation of *X*) over 1000 replications. The relative efficiency (REF) of the P estimator of nonparametric function over the the other 2 estimators is also calculated. The REF is defined by where M denote P, SEPLE-ODS, IPW, and PMLE-SRS estimators. Moreover, we calculated mean, Monte Carlo standard error, estimated standard error using large sample properties and coverage probability of 95% nominal confidence interval for the estimator of regression coefficient *γ*.

The results are summarized in Table 1. From Table 1, we can find that the proposed estimator of nonparametric function is most efficient with the smallest average mean square error (AMSE over 401 grid points) among all the estimators compared. For the estimation of the regression coefficient *γ*, the proposed estimators are generally more efficient than the other 2 estimators with smaller variance. It was also found that the nominal 95% confidence intervals based on the proposed standard errors for the regression coefficient *γ* provide good coverage. The proposed estimator is more efficient than the SEPLE-ODS estimator, which indicates that efficiency gain can be achieved by incorporating actual observation of the outcome from the first stage. Moreover, the SEPLE-ODS estimator is more efficient than the PMLE-SRS one, which means that the ODS design can achieve smaller standard errors of the interested estimates using the same sample size as the SRS design. Therefore, the ODS design is a cost-effective way to improve the study efficiency.

Figure 1 presents curves of the true function and the average P-spline estimates of *g*_{1} by P, SEPLE-ODS, IPW, and PMLE-SRS estimators over 1000 simulation and gives the confidence bands obtained by the P, SEPLE-ODS, IPW, and PMLE-SRS estimators for comparison. The confidence bands are based on a normal approximation using the Monte Carlo standard error and the cut-point equal to 0.7 and 1.0. In Figure 1, we can find that the confidence bands by the SEPLE-ODS, IPW, and PMLE-SRS estimators are obviously wider than those by the P estimator, specially obvious in the tail of the distribution of *X*. The reason is that the actual observation of the outcome from the first stage can provide more useful information in efficiency improvement.

Confidence band comparison in Study 1. The plots from the first and second lines correspond to cut point *a* = 0.7 and 1.0, respectively. In each plot, the thicker dotted curve in the middle is the true function. The solid, dashed, dot-dashed, and dotted **...**

In addition, it appears that there exists some finite-sample bias (see Figure 1). We conducted additional simulation with larger sample size and found that the bias will be reduced when the sample size is increased. The additional simulation suggests that while the method delivers consistent estimates (as *n* approaches infinity), there is a potential for appreciable finite-sample bias. The additional simulation results are presented in Section 4 of the supplementary material available at *Biostatistics* online.

*Study 2.* The data were generated according to the following PLM,

where *g*_{2}(*W*) = 1.5sin(1.5*W*), *W* denotes a continuous exposure variable of interest. *e*~*N*(0,*σ*_{0}^{2}). We assumed that *W*~*N*(1,0.5^{2}) and *Z*~*N*(0,0.3^{2}). Then we fixed *γ* and *σ*_{0}^{2} the same value as those in Study 1.

The averages of the MSE, the absolute value of the bias and the Monte Carlo variance of the estimated nonparametric function *g*_{2}(*x*) were also calculated over 401 equal spaced grid points in [ − 0.5,2.5] (the mean of *X* minus and plus 3 times the standard deviation of *X*) over 1000 replications. We also plotted the average P-spline estimate of *g*_{2} by P, SEPLE-ODS, IPW, and PMLE-SRS estimators over 1000 replicates, and the confidence bands obtained by these estimators. The corresponding results including the table and figure are presented in Section 3 of the supplementary material available at *Biostatistics* online, and the conclusions are similar to Study 1.

The proposed method is applied to CPP data set to assess the relationship between maternal pregnancy serum level of PCBs and children's subsequent IQ test performance. We use the Weschler Intelligence Scales for children at 7 years of age (IQ) as outcome variable and the PCBs level as the exposure variable. We are mainly interested in the effect of PCB on IQ measurement. In addition to PCB, other covariates include socioeconomic status of the child's family (SES), the gender (SEX) and race (RACE) of the child, and the parents eduction (EDU).

There are 38709 subjects consisting of the first-stage sample with completely observed variables IQ, SES, SEX, RACE, and EDU. In the second stage, an ODS design was conducted based on the first-stage sample with the sample size of 1038. The samples obtained by the ODS design are the completely observed samples with additional measured exposure variable of interest PCB. The second-stage sample consists of an SRS of 849 subjects and 2 supplemental subgroups which are defined by children's IQ scores that are one standard deviation (14) above and below the mean (96) of the population IQ scores, with 81 subjects in the low-IQ group and 108 subjects in the high-IQ group. Let *Y* denote IQ, *W* denote PCB, and *Z* denote (SES, SEX, RACE, and EDU). Then the CPP data structure can be summarized as

- Stage 1: {
*Y*_{i},*Z*_{i}},*N*= 38709; - Stage 2: SRS sample: {
*Y*_{i},*X*_{i},*Z*_{i}},*n*_{0}= 849; - Supplemental sample:
- 1) {
*Y*_{i},*X*_{i},*Z*_{i}|*Y*_{i}< 96.06 − 14.29},*n*_{1}= 81; - 2)
*n*_{2}= 0; - 3) {
*Y*_{i},*X*_{i},*Z*_{i}|*Y*_{i}> 96.06 + 14.29},*n*_{3}= 108.

A statistical description for the study variables is presented in Table 2.

Zhou *and others* (2002) analyzed the validation sample of the CPP data in the framework of linear model. We consider a PLM using a nonparametric function to describe the relation between PCB and IQ as

where *e* is a normal error with zero mean. To estimate the nonparametric function *g*(.), we here adopt a 2-degree truncated power function basis with 10 fixed knots. Then the above model can be rewritten as IQ = *π*^{T}(*P**C**B*)*α* + *β*_{1}*E**D**U* + *β*_{2}*S**E**S* + *β*_{3}*R**A**C**E* + *β*_{4}*S**E**X* + *e*, where *α* = (*α*_{1},…,*α*_{13})^{T}. We first made the lack of fit test for the considered PLM using the SRS sample and found that the *P* value of the test is 0.29 which indicates that the PLM is suitable. Then we adopt our proposed method to fit this model. The estimates of the regression coefficient is given in Table 3, and the estimate of nonparametric function is presented in Figure 2.

The estimated function *g* on PCB for CPP data. Thicker solid, dashed, and dot-dashed curves correspond to estimates obtained by the Proposed, SEPLE-ODS, and IPW methods, respectively. And the solid dashed and dot-dashed curves correspond to estimated confidence **...**

We further conduct the proposed the Wald test to test whether the nonparametric function is a linear function or a quadratic function as follows:

*Test 1*: *H*_{0}:*α*_{3} = *α*_{4} = = *α*_{13} = 0 vs *H*_{1}: at least an *α*_{i}≠0 for some *i* ≥ 3.

The *H*_{0} can be also written as *g*(PCB) = *α*_{1} + *α*_{2}PCB. The Wald test statistic *W* = 30.85 > *χ*_{0.05}^{2}(11) = 19.68, *P*-value = 0.0012. Therefore, the linear relation (*H*_{0}) is rejected at the 5% level of significance.

*Test 2*: *H*_{0}:*α*_{4} = *α*_{5} = = *α*_{13} = 0 vs *H*_{1}: at least one *α*_{i}≠0 for some *i* ≥ 4.

The *H*_{0} can be also written as *g*(PCB) = *α*_{1} + *α*_{2}*P**C**B* + *α*_{3}*P**C**B*^{2}. The Wald test statistic *W* = 21.83 > *χ*_{0.05}^{2}(10) = 18.31, *P*-value = 0.0160. Therefore, the quadratic relation (*H*_{0}) is rejected at the 5% level of significance, and we can conclude that a simple quadratic function is not enough to describe the relation between the IQ and the PCB.

The estimated curve in Figure 2 detects the nonlinear relation between the IQ score and the PCB level. At first, the IQ score increases with the increased PCB level but when the PCB level is higher than 7.32(according to the nonparametric estimate by the proposed estimator), the IQ score begins to decrease. In addition, it can be found that the proposed estimator (P estimator) provides more precise confidence bands than those obtained by the SEPLE-ODS and IPW estimators.

While the results in Figure 3 may suggest a positive relation of IQ and PCB intake in lower PCB level. We caution not to overinterpret this result as this is likely due to that the effect of low level of PCBs may reflect beneficial aspects of lifestyle not captured by other covariates. For example, fish intake during pregnancy is related to both higher IQ in offspring (Daniels *and others*, 2004) and to higher levels of serum PCBs (Halldorsson *and others*, 2008). Thus, finding the positive slope at the lower range of exposure alerts us to the possibility of confounding of the PCB coefficient.

From Table 3, the P, SEPLE-ODS, and IPW estimates are similar and the variance estimates for the proposed P estimators are the smallest. All the estimates confirm that the socioeconomic status and education of the parents have a positive impact on the IQ of the children, while there is no evidence that the gender has any effect on the IQ.

In this article, we considered a PLM in the setting of a 2-stage ODS design with a continuous outcome. Compared with Zhou *and others* (2002) in which only the ODS sample (the second-stage sample in this article) is used to make inference, the first-stage sample is incorporated into the proposed method for efficiency improvement. The inference of the PLM is different from the usual method because the ODS is a biased sampling scheme. We proposed an estimated penalized likelihood method to achieve the inference of PLM. Our simulation results demonstrate the efficiency improvement of the proposed method over some alternative methods can be used in these situations, such as the SEPLE-DOS method and the traditional PMLE-SRS method.

The IPW method is built on the completely observed data (those with (*Y*_{i},*X*_{i}) observed, see formula (1.1)). For those observation with missing *X*, they are reflected in the IPW method through a weight that is proportional to the missing proportion. The real value of *Y* for those with missing *X* is not used, while our proposed method utilizes the actual observation of *Y* whatever its corresponding *X* is observed or not. Therefore, the proposed method performs better than the IPW method as more information of the data is used. Our method do rely on the specification of *f*(*Y*|*X*) to be known. Choosing the error distribution as normal is a common practice in linear model analysis. Generally speaking, IPW method and empirical likelihood (EL) method do not require *f*(*Y*|*X*) to be known but only need to specify the conditional expectation *E*(*Y*|*X*). However, the IPW estimator compared in our paper relies on the specification of *f*(*Y*|*X*) because it is really a weighted likelihood estimator and not the usual semiparametric IPW estimator. Likewise, the semiparametric EL estimator compared in our paper requires *f*(*Y*|*X*) to be known because it is also based on the likelihood and the semiparametric EL inference procedure is used to deal with the unknown distribution of the covariates involved in the likelihood, which is treated as a nuisance parameter. The efficiency gain of the proposed method over SEPLE-ODS is due to the inclusion of individuals in the nonvalidation set (those with missing *X*) in the inference.

All the covariates were completely unobserved for the individuals of the incomplete first-stage sample through the article. However, in practice, some important covariates are easy and cheap to measure, so they can be observed for every individual of the population. How to incorporate the information into the inference is interesting. More details about this issue are referred to Weaver and Zhou (2005).

Some issues on the design of ODS such as the optimal size for the second-stage ODS sample, the optimal allocation of the cut points, and the optimal allocation of the second-stage ODS sample across different strata are deserved further investigation in the future study. Moreover, there are several interesting topics on the PLM for the ODS design that deserve further study in the future. For example, how to apply the doubly robust estimation method to our case and how to conduct the lack of fit test for the PLM in the ODS design.

National Institute of Health (CA79949 to H.Z., G.Q.); National Natural Science Foundation of China (10801039 to G.Q.).

Supplementary material is available at http://www.biostatistics.oxfordjournals.org.

We thank the editor and the associate editor for their constructive suggestions that largely improved the presentation of this paper. *Conflict of Interest:* None declared.

- Breslow N, McNeney B, Wellner JA. Large sample theory for semiparametric regression models with two-phase, outcome dependent sampling. The Annals of Statistics. 2003;31:1110–1139.
- Breslow NE, Cain KC. Logistic regression for two-stage case-control data. Biometrika. 1988;75:11–20.
- Carroll RJ, Wand MP. Semiparametric estimation in logistic measurement error models. Journal of the American Statistical Association. 1991;98:158–168.
- Chatterjee N, Chen YH, Breslow NE. A pseudoscore estimator for regression problems with two-phase sampling. Journal of Royal Statistics Society, Series B. 2003;53:573–585.
- Daniels JL, Longnecker MP, Rowland AS, Golding J. ALSPAC Study Team. University of Bristol Institute of Child Health. Fish intake during pregnancy and early cognitive development of offspring. Epidemiology. 2004;15:394–402. [PubMed]
- Gray KA, Klebanoff MA, Brock JW, Zhou H, Darden R, Needham L, Longnecker MP. In utreo exposure to background levels of polychlorinated biphenyls and cognitive functioning among school-age children. American Journal of Epidemiology. 2005;162:17–26. [PubMed]
- Halldorsson TI, Thorsdottir I, Meltzer HM, Nielsen F, Olsen SF. Linking exposure to polychlorinated biphenyls with fatty fish consumption and reduced fetal growth among Danish pregnant women: a cause for concern? American Journal of Epidemiology. 2008;168:958–965. [PubMed]
- He X, Zhu ZY, Fung WK. Estimation in a semiparametric model for longitudinal data with unspecified dependence structure. Biometrika. 2002;89:579–590.
- Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association. 1952;47:663–685.
- Huang JZ, Zhang L, Zhou L. Efficient estimation in marginal partially linear models for longitudinal/clustered data using spline. Scandinavian Journal of Statistics. 2007;34:451–477.
- Langholz B, Borgan Ø. Counter-matching: a stratified nested case-control sampling method. Biometrika. 1995;82:69–79.
- Lawless JF, Kalbfleisch JD, Wild CJ. Semiparametric methods for response-selective and missing-data problems in regression. Journal of Royal Statistics Society, Series B. 1999;61:413–438.
- Lin X, Carroll RJ. Semiparametric regression for clustered data using generalized estimating equations. Journal of the American Statistical Association. 2001;96:1045–1056.
- Niswander KR, Gordon M. The Women and Their Pregnancies. U.S. Department of Health, Education, and Welfare Publication (NIH) Washington, DC: Government Printing Office; 1972. pp. 73–379.
- Pepe MS, Fleming TR. A nonparametric method for dealing with mismeasured covariate data. Journal of the American Statistical Association. 1991;86:108–113.
- Reilly M, Pepe MS. A mean score method for missing and auxiliary covariate data in regression models. Biometrika. 1995;82:299–314.
- Reilly M, Pepe MS. The relationship between hot-deck multiple imputation and weighted likelihood. Statistics in Medicine. 1997;16:5–19. [PubMed]
- Ruppert D, Wand MP, Carroll RJ. Semiparametric Regression. New York: Cambridge University Press; 2003.
- Weaver MA. Semiparametric Methods for Continuous Outcome Regression Models with Covariate Data from an Outcome-Dependent Subsample, [Doctoral Dissertation] Chapel Hill, NC: University of North Carolina; 2001.
- Weaver MA, Zhou H. An estimated likelihood method for continuous outcome regression models with outcome-dependent sampling. Journal of the American Statistical Association. 2005;100:459–469.
- White JE. A two stage design for the study of the relationship between a rare exposure and a rare disease. American Journal of Epidemiology. 1982;115:119–128. [PubMed]
- Yu Y, Ruppert D. Penalized spline estimation for partially linear single index models. Journal of the American Statistical Association. 2002;97:1042–1054.
- Zeger SL, Diggle PJ. Semiparametric model for longitudinal data with application to CD4 cell numbers in HIV seroconvertiers. Biometrics. 1994;50:689–699. [PubMed]
- Zhao LP, Lipsitz S. Designs and analysis of two-stage studies. Statistics in Medicine. 1992;11:769–782. [PubMed]
- Zhou H, Pepe MS. Auxiliary covariate data in failure time regression. Biometrika. 1995;82:139–149.
- Zhou H, Wang CY. Failure time regression analysis with measurement error in covariates. Journal of Royal Statistics Society, Series B. 2000;62:657–665.
- Zhou H, Weaver MA, Qin J, Longnecker MP, Wang MC. A semiparametric empirical likelihood method for data from an outcome dependent sampling scheme with a continuous outcome. Biometrics. 2002;58:413–421. [PubMed]
- Zhu ZY, Fung WK, He X. On the asymptotics of marginal regression splines with longitudinal data. Biometrika. 2008;95:907–917.

Articles from Biostatistics (Oxford, England) are provided here courtesy of **Oxford University Press**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |