Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC3182522

Formats

Article sections

- Summary
- 1. Introduction
- 2. Models and Likelihood Function for PLM with ODS Data
- 3. A Semiparametric Empirical Likelihood Approach
- 4. Simulation Studies
- 5. Analysis of the Collaborative Perinatal Project Data Set
- 6. Discussion
- Supplementary Material
- References

Authors

Related links

Biometrics. Author manuscript; available in PMC 2012 September 1.

Published in final edited form as:

Published online 2010 October 29. doi: 10.1111/j.1541-0420.2010.01500.x

PMCID: PMC3182522

NIHMSID: NIHMS238903

The publisher's final edited version of this article is available at Biometrics

See other articles in PMC that cite the published article.

Outcome-dependent sampling (ODS) has been widely used in biomedical studies because it is a cost effective way to improve study efficiency. However, in the setting of a continuous outcome, the representation of the exposure variable has been limited to the framework of linear models, due to the challenge in terms of both theory and computation. Partial linear models (PLM) are a powerful inference tool to nonparametrically model the relation between an outcome and the exposure variable. In this article, we consider a case study of a partial linear model for data from an ODS design. We propose a semiparametric maximum likelihood method to make inferences with a PLM. We develop the asymptotic properties and conduct simulation studies to show that the proposed ODS estimator can produce a more efficient estimate than that from a traditional simple random sampling design with the same sample size. Using this newly developed method, we were able to explore an open question in epidemiology: whether *in utero* exposure to background levels of PCBs is associated with children’s intellectual impairment. Our model provides further insights into the relation between low-level PCB exposure and children’s cognitive function. The results shed new light on a body of inconsistent epidemiologic findings.

Polychlorinated biphenyls (PCBs) are ubiquitous, persistent environmental pollutants. PCBs were once made in large quantities for use in capacitors, transformers, carbonless copy paper, and other applications (Dobson and van Esch, 1993). Although production has been banned worldwide due to persistence and toxicity, low-level human exposure to these ubiquitous environment contaminants continues via diet (Syracuse Research, 2000). Early life exposure has been associated with cognitive deficits among children (Schantz et al., 2003). Exposure to PCBs in utero has been implicated as neurotoxic more often than has subsequent exposure (Patandin et al., 1999). *In vitro* and *in vivo* models demonstrate that PCBs cause altered synaptic transmission in the central nervous system, reduction in the striatum dopamine concentration, and decreased thyroxine levels (e.g., Gilbert and Liang, 1998). Although it has been demonstrated in rodent and primate studies that perinatal PCB exposure is usually associated with impairment of learning and cognition, human observational studies of perinatal exposure to background level PCBs in relation to cognitive functioning in children, on the other hand, have given inconsistent results. For example, a Michigan study of children followed from birth found that prenatal PCB exposure was associated with lower performance on the verbal and memory scales for children at 4 years of age and with lower full-scale and verbal intelligence quotient (IQ) scores at age 11 years (Jacobson and Jacobson, 1996). In a North Carolina birth cohort (Rogan and Gladen, 1991), however, performance on similar tests at ages 3, 4, and 5 years was unrelated to PCB exposure. These and other studies (Patandin et al., 1999; Schantz et al., 2003) suggest that in order to improve assessment of risks associated with PCB exposure, a better understanding of the relation between low-level PCB exposure and performance on examinations of cognitive function is needed. A more flexible and powerful statistical approach could play a critical role in understanding the risks associated with PCB exposure.

Our research was motivated by addressing the above scientific question through taking advantage of a large study conducted by the NIH and reanalyzing the data of Gray et al. (2005), where the maternal third trimester serum PCB levels were obtained in an outcome-dependent way for a subgroup of children born in the Collaborative Perinatal Project (CPP). The CPP study was a prospective study designed to identify determinants of neurodevelopmental deficits in children. Pregnant women were recruited from 1959 to 1965 from 12 US study centers resulting about 55,908 pregnancies recruited into the study (Longnecker et al., 2001). Gray et al. (2005) employed the outcome-dependent sample design (ODS) design developed by Zhou et al. (2002) based on eligible children who met the following criteria: 1) they were liveborn, 2) they were singletons, and 3) a 3-ml third trimester maternal serum specimen was available. From the 43,628 eligible children, 1,256 subjects were selected at random. An additional 207 children were randomly selected from eligible children whose 7-year full scale IQ score was either one or more standard deviations below the mean or one or more standard deviations above the mean. The mother’s third trimester’s blood specimen for the subgroup of subjects selected in the above ODS fashion were then thawed and the PCB levels were assessed at the US Centers for Disease Control and Prevention laboratory in Atlanta.

In the remainder of this section we give brief overviews of the two statistical areas that play a key role in this paper: the ODS design and the partial linear model (PLM).

To understand the determinants of diseases in humans, epidemiologic observational studies are conducted to characterize the relation between individuals’ exposure and a disease outcome. As large cohort studies are usually very costly to conduct, small studies using the case-control design are often preferred, especially for rare diseases, because they can include equal number of individuals with a rare disease in a smaller overall study. The case-control design (Cornfield, 1951) is a binary special case of a general ODS design where investigators can observe the exposure or covariates with a probability that depends on an outcome variable where the outcome can be either continuous or discrete. The principle idea of ODS is to concentrate resources on the parts of the studied population having the greatest amount of information. By allowing selection probability of each individual in the ODS samples to depend on the outcome, the investigators can improve the efficiency and reduce the cost of the studies. Historically, the ODS design has been implemented in the binary case (White, 1982; Rathouz et al., 2002; Wang and Zhou, 2006; Haneuse and Wakefield, 2007; Schildcrout and Heagerty, 2008), while development of the ODS design with a continuous outcome has lagged behind until recently. For example, Zhou et al. (2002) considered a semiparametric empirical likelihood method for a continuous ODS design. Chatterjee et al. (2003) considered a pseudoscore estimation method and Weaver and Zhou (2005) proposed an estimated likelihood method for inference based on data from the two-stage ODS design. The gains in statistical power and the reduction in sample size for a given budget relative to a traditional simple random sample is shown in Zhou et al. (2007).

All models considered in the literature are limited to the framework of a linear model. The PLM for continuous outcomes (He et al., 2002; Wang et al., 2005), where the outcome is assumed to depend on some covariate *X* nonparametrically and some other covariates *Z* parametrically, is an important inference tool and has been widely applied in many fields. It would be a particular advantage in the Gray et al. (2005) study to have a more flexible PLM approach to investigate the relation of low level PCB exposure and children’s cognitive development. In this article, we consider the PLM for data from an ODS design. The functions offered by standard commercial software for the PLM were generally developed under the simple random sample setting, therefore can not be directly applied to our case. We propose a semiparametric maximum likelihood estimation method to achieve the inference of the PLM using penalized splines (P-spline).

The rest of this article is organized as follows. In section 2, we propose the PLM and derive the penalized likelihood for the data from an ODS design. In section 3, we propose a semiparametric empirical likelihood estimator and present its asymptotics. Some simulation studies are conducted in section 4 to investigate the performance of the proposed method, and in section 5 we show that with our new method, we identified some new findings in the data from the Gray et al. study that will shed some light on the ongoing debate on the relationship of PCB and children’s cognitive development.

Let *Y* denote a continuous outcome, *X* an exposure variable and *Z* a *p*-dimensional covariate vector. The notation *A ^{T}* denotes the transpose of the matrix or vector

$$E(Y\mid X,Z)=g\left(X\right)+{Z}^{T}\gamma ,$$

(1)

where *g*(·) is an unknown smooth function of *X* and *γ* is a *p*-dimensional vector of unknown regression coefficients. Model (1) is the PLM considered in our ODS design. The introduction of the *g*(·) is a key difference with the previous literature on the ODS design.

Various nonparametric smoothing methods can be applied to estimate the nonparametric function *g*(.) such as kernel and spline methods, e.g., Lin and Carroll (2001); Yu and Ruppert (2002); Huang et al. (2007). Particularly, Yu and Ruppert (2002) studied estimation of a partially linear single-index model by penalized splines (P-spline) to estimate the nonparametric function, which is also adopted in this article. P-spline is an extension of smoothing splines, allowing a more flexible choice of knots and penalty. Following Yu and Ruppert (2002), under the working assumption that *g*(.) is a *r*-degree spline function with *T* fixed knots *t*_{1}, …, *t _{T}*, we then have

We consider the Zhou et al. sampling scheme with two components (Zhou et al., 2002). First, an overall random sample of the population is taken. Second, one or more supplemental random samples are taken, in which the probability of selection depends on the level of the outcome variables. Assume that the domain of *Y* is the union of *K* non-overlapping intervals: *C _{k}* = (

Denote *F*(*y|x, z*; *θ*) and f(*y|x, z*; *θ*) as the conditional cumulative distribution function and density function for *Y* given *X* and *Z*. Define *ψ _{k}*(

$$\left\{\prod _{j=1}^{{n}_{0}}f({y}_{0j}\mid {x}_{0j},{z}_{0j};\theta ){f}_{X,Z}({x}_{0j},{z}_{0j})\right\}\left\{\prod _{k=1}^{K}\prod _{j=1}^{{n}_{k}}f({y}_{kj},{x}_{kj},{z}_{kj}\mid {y}_{kj}\in {C}_{k};\theta )\right\},$$

(2)

where *f*(*y _{kj}, x_{kj}, z_{kj}|y_{kj}*

$${L}_{n}\left(\theta \right)=\left\{\prod _{j=1}^{{n}_{0}}f({y}_{0j}\mid {x}_{0j},{z}_{0j};\theta ){f}_{X,Z}({x}_{0j},{z}_{0j})\right\}\times \prod _{k=1}^{K}\left\{\prod _{j=1}^{{n}_{k}}\frac{f({y}_{kj}\mid {x}_{kj},{z}_{kj};\theta ){f}_{X,Z}({x}_{kj},{z}_{kj})}{\int {\psi}_{k}(x,z,\theta )d{F}_{X,Z}(x,z)}\right\}.$$

Denote *π _{k}* = ∫

$${l}_{n}[\theta ,\{{p}_{kj}\},\{{\pi}_{k}\left\}\right]={l}_{1n}\left(\theta \right)+\sum _{k=0}^{K}\sum _{j=1}^{{n}_{k}}\mathit{log}{p}_{kj}-\sum _{k=1}^{K}{n}_{k}\mathit{log}{\pi}_{k},$$

(3)

where ${l}_{1n}\left(\theta \right)={\sum}_{k=0}^{K}{\sum}_{j=1}^{{n}_{k}}\mathit{log}f({y}_{kj}\mid {x}_{kj},{z}_{kj};\theta )$ is a function only involving *θ*.

As a P-spline is used to estimate the nonparametric function in model (1), incorporating the penalty into (3), we obtain the following penalized log-likelihood function,

$$p{l}_{n}[\theta ,\{{p}_{kj}\},\{{\pi}_{k}\};{q}_{s}]={l}_{1n}\left(\theta \right)+\sum _{k=0}^{K}\sum _{j=1}^{{n}_{k}}\mathit{log}{p}_{kj}-\sum _{k=1}^{K}{n}_{k}\mathit{log}{\pi}_{k}-\frac{1}{2}n{q}_{s}{\theta}^{T}\Psi \theta ,$$

(4)

where $\Psi =\mathrm{diag}\left\{{({0}_{(r+1)\times 1}^{T},{1}_{T\times 1}^{T},{0}_{p\times 1}^{T})}^{T}\right\}$ and *θ*^{T}Ψ*θ* is a common quadratic penalty function. Here *q _{s}* is the smoothing parameter.

To estimate *θ* through (4), methods for handling *f _{X,Z}*(

$$pp{l}_{n}(\theta ;{q}_{s})=\underset{{p}_{kj}}{\mathrm{sup}}\phantom{\rule{thinmathspace}{0ex}}p{l}_{n}[\theta ,\{{p}_{kj}\},\{{\pi}_{k}\};{q}_{s}\}].$$

(5)

To get (5), it suffices to maximize

$${l}_{np}\left[\right\{{p}_{kj}\left\}\right]=\sum _{k=0}^{K}\sum _{j=1}^{{n}_{k}}\mathrm{log}{p}_{kj}-\sum _{k=1}^{K}{n}_{k}\mathrm{log}{\pi}_{k},$$

(6)

for fixed [*θ*, {*π _{k}*}] subject to ${\sum}_{k=0}^{K}{\sum}_{j=1}^{{n}_{k}}{p}_{kj}\left\{{\psi}_{i}\right({x}_{kj},{z}_{kj};\theta )-{\pi}_{i}\}=0$,

$$H[\theta ,\{{p}_{kj}\},\{{\pi}_{i}\left\}\right]={l}_{np}\left[\right\{{p}_{kj}\left\}\right]+\rho (1-\sum _{k,j}{p}_{kj})+n\sum _{i=1}^{K-1}{\lambda}_{i}\sum _{k,j}{p}_{kj}\left\{{\psi}_{i}\right({x}_{kj},{z}_{kj};\theta )-{\pi}_{i}\},$$

where *ρ* and *λ′s* are Lagrange multipliers. Taking derivatives of *H* with respect to {*p _{kj}*} and solving the score equations, we can obtain that

Typically, the true value of the Lagrange multiplier is zero in unbiased sampling problems. However, due to the nature of the ODS sampling design, the Lagrange multipliers, *λ′s* are not centered around zero. To unify the notation, we center them by reparameterizing as ${v}_{k}={\lambda}_{k}-\frac{{n}_{k}}{n{\pi}_{k}}$, *k* = 1, …, *K* − 1. Denote = (*π ^{T}*,

$$\begin{array}{cc}\hfill pp{l}_{n}(\eta ;{q}_{s})& ={l}_{1n}\left(\theta \right)-\sum _{k=0}^{K}\sum _{j=1}^{{n}_{k}}\mathrm{log}\{1+{v}^{T}h({x}_{kj},{z}_{kj}\left)\right\}-\sum _{k=1}^{K}\sum _{j=1}^{{n}_{k}}\mathrm{log}\left\{Q\right({x}_{kj},{z}_{kj}\left)\right\}\hfill \\ \hfill & -\sum _{k=1}^{K}{n}_{k}\mathrm{log}{\pi}_{k}-\frac{1}{2}n{q}_{s}{\theta}^{T}\Psi \theta ,\hfill \end{array}$$

(7)

where *h*(*x _{kj}, z_{kj}*) = {

(1) Choose initial values *η*^{(0)} = (*θ*^{(0)T}, *π*^{(0)T}, *v*^{(0)T})^{T}. The initial value *θ*^{(0)} can be selected as the penalized maximum likelihood estimate based on the SRS sample from the ODS design, *π*^{(0)} can be chosen as the estimates of probability *P*(*y* *C _{k}*),

(2) Compute ${\eta}^{(j+1)}={\eta}^{\left(j\right)}-{\left\{\frac{{\partial}^{2}}{\partial \eta \partial {n}^{T}}pp{l}_{n}\right({\eta}^{\left(j\right)};{q}_{s}\left)\right\}}^{-1}\frac{\partial}{\partial \eta}pp{l}_{n}({\eta}^{\left(j\right)};{q}_{s})$. Set *j* = *j* + 1.

(3) Continue step 2 until convergence is achieved. Choose *η*^{(j+1)} as the final estimate.

The following theorem summarizes the asymptotic properties for the proposed estimators of the PLM from an ODS design.

Theorem 1: Under conditions A1-A4, (i) If the smoothing parameter q_{s} = o(1), $\widehat{\eta}$ converges to η_{0} with probability one. (ii) If the smoothing parameter ${q}_{s}=o(1\u2215\sqrt{n})$. The maximizer $\widehat{\eta}={({\widehat{\theta}}^{T},{\widehat{\pi}}^{T},{\widehat{v}}^{T})}^{T}$ is asymptotically distributed as a normal distribution as $\sqrt{n}(\widehat{\eta}-{\eta}_{0})\to N({0}_{d\times 1},\Sigma )$.

The conditions A1-A4 and a brief proof for Theorem 1 are given in the Supplementary Material (Web Appendix B). A consistent estimator $\widehat{\Sigma}$ of the covariance matrix Σ is ${\widehat{V}}^{-1}\widehat{U}{\widehat{V}}^{-1}$, where $\widehat{U}$ and $\widehat{V}$ are also defined in the Web Appendix. The conditions on smoothing parameter *q _{s}* in Theorem 1 are commonly assumed (Yu and Ruppert, 2002; Qu and Li, 2006). The generalized cross-validation (GCV) method, given in the Supplementary Material (Web Appendix A), is a very popular method to select a smoothing parameter and performs quite well in practice. However, it is a challenge to obtain the rate of the smoothing parameter chosen by the GCV method, which is an important theoretical problem and deserves further study. We note that the asymptotic properties developed in this paper are for fixed knots P-spline, which means that the number of the knots does not grow as sample size grows to infinity.

From Theorem 1, we can construct a joint confidence region and test hypotheses, e.g., with a Wald test. For example, if we want to test the null hypothesis *H*_{0} : *Bθ*_{0} − *s*_{0} = 0 where *B* is a *d*_{1} × {*d* − 2(*K* − 1)} matrix with full rank *d*_{1} ≤ {*d* − 2(*K* − 1)}, then the test can be based on the Wald test statistic $W={(B\widehat{\theta}-{s}_{0})}^{T}{\left(B{\widehat{\Sigma}}_{\theta}{B}^{T}\right)}^{-1}(B\widehat{\theta}-{s}_{0})$ with ${\widehat{\Sigma}}_{\theta}$ being the consistent estimate of the covariance matrix of $\widehat{\theta}$, which has an asymptotic chi-square distribution with *d*_{1} degrees of freedom.

For this PLM, it is of particular interest to test whether the nonparametric function describing the relation between exposure variable *X* and response *Y* is linear. We can use the previously mentioned Wald test to deal with this problem. To do so, we re-express the *T*+*r*+1-dimensional vector *α* as $({\alpha}_{1}^{T},{\alpha}_{2}^{T})$, where *α*_{1} = (*α*_{11}, *α*_{12})^{T} is a two-dimensional vector and *α*_{2} is a *T* + *r* − 1 dimensional vector. We are interested in testing the null hypothesis: ${\alpha}_{2}={\alpha}_{2}^{0}={(0,\dots ,0)}^{T}$. Under this null hypothesis, we have *g*(*X*) = *α*_{11} + *α*_{12}*X*, i.e. the exposure variable *X* is related to response *Y* in a linear fashion.

We conducted simulations to investigate the performance of our proposed estimator. For all simulations, we generated 1000 simulated datasets, each with 500 independent subjects. We consider two models with different nonparametric functions. The first one is a monotonic function while the second is a unimode function in threshold regions. In the simulation, we adopt a three-degree truncated power spline basis and choose ten fixed knots selected as the equally spaced sample quantiles. The computational aspects of the simulation studies are given in last paragraph of this section.

*Study 1*. The data were generated according to the following PLM,

$$Y=3\Phi \left(3.2X\right)+{Z}_{\gamma}+{e}_{0},$$

where Φ(.) is a standard normal distribution function, *X* denotes a continuous exposure variable of interest, ${e}_{0}\sim N(0,{\sigma}_{0}^{2})$. We assume that *X* ~ *N*(0, 0.25^{2}) and *Z* ~ *N*(0, 0.5^{2}). Then $f(y\mid x,z;\theta )\sim N\left({g}_{1}\right(x)+z\gamma ,{\sigma}_{0}^{2})$, where *g*_{1}(*x*) = 3Φ(3.2*x*). We take *γ* = 1, ${\sigma}_{0}^{2}=0.2$.

In this simulation, we investigated the effect of different cut points for creating the supplemental samples and the impact on the estimation efficiency through varying the allocation of the sample size between the SRS and supplemental samples. Similar to Zhou et al. (2002), our ODS design consists of a SRS sample of *n*_{0}, a supplemental sample of *n*_{1}+*n*_{3} from individuals with *Y* values in the two tails of the marginal distribution of *Y* (*n*_{1} = *n*_{3}, *n*_{2} = 0), defined by cut points *μ _{Y}* ±

We compared the proposed estimator based on the ODS sample including both the SRS sample and supplemental samples (P estimator) with two additional estimators: (i) a penalized maximum likelihood estimator (PMLE) that is based on the SRS samples only from the ODS sample (V estimator); and (ii) a PMLE (S estimator) that is based on a SRS sample but the same sample size as the ODS sample. The comparison against the V estimator will demonstrate the efficiency gain of the P estimator using more information, and the comparison against the S estimator will demonstrate the efficiency gain of the ODS design over the SRS design with the same sample size. We computed the averages of the mean square error(MSE), the absolute value of the bias and the Monte Carlo variance of the estimated nonparametric function *g*_{1}(*x*) over 501 equal spaced grid points in [−0.75, 0.75] (the mean of *X* minus and plus three times standard deviation of *X*) over 1000 replications. Moreover, we calculated mean, Monte Carlo standard error, and estimated standard error using large sample properties and coverage probability of the 95% nominal confidence interval for the estimator of regression coefficient *γ*.

For the sample size *n*_{0} = 300, *n*_{1} = *n*_{3} = 100, we summarize the results in Table 1. From Table 1, we find that the proposed estimator of the nonparametric function is most efficient with the smallest AMSE (average MSE over 501 grid points) among all the estimators compared. For the estimation of the regression coefficient *γ*, the proposed estimator is generally more efficient than the other two estimators with smaller variance when *a* is 1.0 and 1.3. It was also found that the nominal 95% confidence intervals based on the proposed standard errors for the regression coefficient *γ* provide good coverage. The proposed estimator is more efficient than the S estimator, which supports the notion that ODS design can be more efficient than the SRS design.

Figure 1 presents curves of the true function and the average P-spine estimates of *g*_{1} by the P, V and S estimators over 1000 simulations. The difference between the true function *g*_{1} and its estimates is barely visible, which shows good fit of the P-spline estimates. Figure 1 also gives the confidence bands obtained by the P estimator, V estimator and the S estimator for comparison. The confidence bands are based on a normal approximation using the Monte Carlo standard error with the sample size *n*_{0} = 300, *n*_{1} = *n*_{3} = 100 and *n*_{0} = 200, *n*_{1} = *n*_{3} = 150, and the cut point equal to 0.7 and 1.3. In Figure 1, we see that in the tail of the distribution of *X*, the confidence bands by the S estimator are wider than those by the P estimator, and this is more apparent when the cut point is further out or the proportion of the SRS samples in the ODS is decreased. This is because, compared with SRS design, more individuals in the tails of the distribution of *X* can be sampled in the ODS design, which leads to improvement of the estimation efficiency of the P estimator. In addition, the confidence bands by the V estimator are the widest ones and the bands become wider when the proportion of the SRS sample is decreased.

Simulation results in Study 1. Confidence band comparison with sample size allocation of [300, 100, 100] and [200, 150, 150]. The plots from the first and second lines correspond to cut point *a* = 0.7 and 1.3 respectively. In each plot, the dotted curve **...**

Under the same simulation model, we study the relative efficiency of the P estimator of nonparametric function over the S estimator for the proportion of the SRS sample in the ODS (*ρ* = *n*_{0}/*n*). Table 2 shows the relative efficiency $\left(\mathit{AMSE}\right({\widehat{g}}_{S})\u2215\mathit{AMSE}({\widehat{g}}_{P}\left)\right)$ comparison across *ρ* = *n*_{0}/*n* under *a* = 0.7, 1 and 1.3, where ${\widehat{g}}_{S}$ and ${\widehat{g}}_{P}$ denote the P and S estimators of *g* respectively. The total sample size is *n* and the sample size of *n*_{1} and *n*_{3} are determined by (1 − *ρ*)*n*/2. We can find that the proposed estimator is more efficient than the S estimator and the efficiency can generally be further improved when the proportion decreases and the cut point is selected to be further out.

*Study 2*. The data were generated according to the following PLM,

$$Y=\mathrm{sin}\left(1.5X\right)+{Z}_{\gamma}+{e}_{0},$$

where *X* denote a continuous exposure variable of interest, ${e}_{0}\sim N(0,{\sigma}_{0}^{2})$. We assume that *X* ~ *N*(1, 0.5^{2}) and *Z* ~ *N*(0, 0.5^{2}). Then $f(y\mid x,z;\theta )\sim N\left({g}_{2}\right(x)+z\gamma ,{\sigma}_{0}^{2})$, where *g*_{2}(*x*) = *sin*(1.5*x*). We take *γ* = 1, ${\sigma}_{0}^{2}=0.2$. The results are calculated over 501 equal spaced grid points in [−0.5, 2.5] (*X* ±3*SD _{X}*) over 1000 replications. The corresponding results are similar to Study 1, and similar conclusions to Study 1 can be drawn.

In the setting of our simulation, convergence is relatively fast and the number of iterations required for convergence is generally less than 25. Non-convergence may arise occasionally, e.g., when the sample size is small and the cut points are selected further out (a=1.3 in our case), but this problem can be avoided by increasing the sample size. At a sample size of 1000, we do not have any problem with the CPP data analysis. The codes for simulation and real data analysis were written in Matlab and can be downloaded from http://www.bios.unc.edu/~zhou/software/Zhou-Qin-Longnecker-Biometrics/.

As introduced in the beginning of the paper, we hope a more flexible partial linear approach in the ODS setting could provide additional insights into the relationship between the low dose maternal pregnancy serum level of polychlorinated biphenyls (PCBs) and children’s subsequent IQ test performance based on the CPP data set. Among the 1463 subjects selected by the ODS design (Gray et al., 2005), only 1038 subjects were observed during pregnancy and at age 7 years and had complete data on all covariates used in the analysis. The details of the data structure available to us are such that, in the data set, there is a simple random sample (SRS) of 849 subjects and two supplemental subsamples. The first supplemental sample is a simple random sample of the children in the CPP population whose IQ scores were at least one standard deviation (14) above the mean (96). The second subsample is a simple random sample from those whose IQ scores were at least one standard below the mean. In all, we have 81 subjects in the low IQ subsample and 108 subjects in the high IQ subsample.

The Weschler Intelligence Scale for children at 7 years of age (IQ) was the outcome variable and the prenatal PCB level, measured in *μ*g/liter, was the exposure variable. We built a partial linear model of these data using our proposed method. In addition to PCBs, other covariates include socioeconomic status of the child’s family (SES), the gender (SEX) and race (RACE) of the child and the mother’s eduction (EDU). The covariate EDU is the highest education level obtained by the mother at the child’s birth divided into 18 levels from the lowest level 1 (EDU=1) to the highest level (EDU=18). Here we introduced two dummy variables EDU1 and EDU2 to replace the original covariate EDU in our analysis to achieve a reasonable fit because the relationship between IQ and EDU is nonlinear. These two dummy variables are coded as: (EDU1,EDU2)=(0,0) corresponds to (EDU≤8) representing low education level; (EDU1,EDU2)=(1,0) corresponds to (9 ≤EDU ≤12) representing middle education level and (EDU1,EDU2)=(0,1) corresponds to (EDU≥13) representing high education level. The covariate SEX was coded 0 for males and 1 for females and the covariate RACE was coded 1 for Black and 0 for others. Table 3 describes the basic characteristics of the data in these three samples.

Zhou et al. (2002) reported no association between PCB exposure and the children’s IQ score. Gray et al. (2005), using a probability weighted approach, did not find either a linear or quadratic association. Here, we consider the following PLM to describe the relation between PCB and IQ:

$$IQ=g\left(\mathit{PCB}\right)+{\beta}_{1}\mathit{EDU}1+{\beta}_{2}\mathit{EDU}2+{\beta}_{3}\mathit{SES}+{\beta}_{4}\text{RACE}+{\beta}_{5}\text{SEX}+e,$$

where *e* is a normal error with zero mean. To estimate the nonparametric function *g*(·), we adopted a two-degree truncated power function basis $M\left(x\right)={(1,x,{x}^{2},{(x-{T}_{1})}^{2},\dots ,{(x-{T}_{10})}_{+}^{2})}^{T}$ with ten fixed knots *T*_{1}, …, *T*_{10} selected as the equally spaced sample quantiles of PCB level, i.e. 1.16, 1.70, 2.17, 2.61, 3.05, 3.52, 4.06, 4.63, 5.54, 6.88. Under these specifications, the above model can be rewritten as *IQ* = *M ^{T}*(

Analysis results of CPP data. The estimated function *g* on PCB. Thicker solid and dashed curves correspond to estimates obtained by the P and V methods, and the thicker dot-dashed line corresponds to the linear fit using Zhou et al. (2002) method. And **...**

From Figure 2, it can be seen that the IQ is related to PCB nonlinearly. The detection of this nonlinear relationship is due to the use of P-splines and has nothing to do with the sampling design. To formally testing this nonlinear effect, we conduct the proposed Wald test to test the hypothesis: *H*_{0} : *α*_{3} = *α*_{4} = … = *α*_{13} = 0 which is equivalent to testing *H*_{0} : *g*(*PCB*) = *α*_{1} + *α*_{2}*PCB*. The *H*_{0} is rejected at the 5% significance level as the Wald statistic $W=26.37>{\chi}_{0.05}^{2}\left(11\right)=19.68$. The corresponding p-value is 0.006.

Next, we test whether the *g*(·) is a simple quadratic function. The corresponding null hypothesis is *H*_{0} : *α*_{4} = *α*_{5} = … = *α*_{13} = 0. This null is rejected at 5% level with a p-value at 0.029 based on the *χ*^{2} distribution with 10 degrees of freedom. This result indicates that although the relationship in Figure 2 is unimodal, a simple quadratic function is not enough to describe the relation between the IQ score and the PCB level. Apart from the difference in the fitted PCB effect, the PLM model (*β _{P}* ) and the linear model (

The fitted relationship in Figure 2 suggests that the level of PCB exposure was positively related to the children’s IQ performance within the lower range of exposure, which is possibly due to residual confounding, by either socioeconomic status Gray et al. (2005) or other factors not captured by the existing covariates. This relation began to diminish as the level of PCB exposure moved closer to the mean PCB level and then it became detrimental in the higher range of PCB exposure. The measured PCB level in most of the studies, the CPP study included, was considered background level exposure, which is low compared with the PCB exposure levels in the Yucheng poisoning incident, in which the lower IQ among children exposed at higher levels in utero was clear Rogan et al. (1988). With a large data set and a more powerful and flexible modeling approach, we were able to obtain further insight on the relationship of children’s IQ and prenatal PCB exposure.

The contribution of this paper is twofold. First we introduced a PLM for ODS data and developed a semiparametric inference procedure that is based on a penalized likelihood function. We established the asymptotic properties of the proposed method and most importantly through simulation, we showed that the proposed ODS design with PLM will yield more efficient estimates than the design using a SRS with the same sample size. The advantages of the ODS design is clearly demonstrated in this paper and in practice; this advantage will translate into more cost-effective studies, or, for a fixed budget, more powerful studies.

Second, the application of the proposed method to the analysis of *in utero* exposure to background levels of polychlorinated biphenyls (PCBs) with children’s intellectual development measured by IQ at 7 years of age showed some interesting results. Epidemiologic studies on *in utero* exposure to PCBs have been controversial and inconclusive with some studies suggesting that PCBs cause intellectual impairment among children and others either not supporting this or suggesting opposite effects. We thought the reasons for this phenomenon were due to the following: a) the sample sizes in these studies were usually small (200 to 400); b) the PCB exposure level in Western countries are all background level and thus primarily within the lower range; and c) the statistical approaches were relatively simple and may not be the most powerful.

We had an unique opportunity in that 1) we had a large sample size compared with those published, 2) the study was designed using an ODS strategy such that it is more informative than a usual study with the same sample size; 3) we have developed a flexible yet efficient analytical tool (PLM) to take full advantage of the data from the ODS structure. Indeed, our analysis of the CPP data reveals something that has not been reported before. The nonlinear curvature of the fitted model helps to explain why in the previous smaller studies some find an effect while others do not.

Our simulation results demonstrate the efficiency improvement of the ODS design over some alternative methods that can be used in these situations such as the traditional SRS design with the same sample size. Generally, the efficiency gain from the ODS design is higher with a low proportion of subjects in the SRS sample and high concentration of sampling at the extreme tails of the outcome. In practice, the choice of *ρ* and cut points may be affected by other considerations such as study purpose. Some detailed discussion about this is in Zhou et al. (2002).

We thank the Editor, the Associate Editor and two referees for their constructive suggestions that largely improved the presentation of this paper. This work is supported by National Institute of Health grants CA 79949 (Zhou and Qin), the intramural research program of the NIH, National Institute of Environmental Health Sciences (Longnecker) and the National Nature Science Fund of China grants of China 10801039 (Qin).

Supplementary Materials Web Appendices A and B referenced in Section 3, are available under the Paper Information link at the Biometrics website http://www.tibs.org/biometrics.

Chatterjee N, Chen YH, Breslow NE. A pseudoscore estimator for regression problems with two-phase sampling. Journal of the American Statistical Association. 2003;98:158–168.

Cornfield J. A method of estimating comparative rates from clinical data. Applications to cancer of lung, breast and cervix. Journal of the National Cancer Institute. 1951;11:1269–1275. [PubMed]

Dobson S, van Esch GJ. Polychlorinated Biphenyls and Terphenyls. 2nd ed World Health Organization; Geneva: 1993. pp. 79–84.

Gilbert ME, Liang D. Alterations in synaptic transmission and plasticity in hippocampus by a complex PCB mixture, Aroclor 1254. Neurotoxicology and Teratology. 1998;20:383–389. [PubMed]

Gladen BC, Rogan WJ. Effects of perinatal polychlorinated biphenyls and dichlorodiphenyl dichloroethene on later development. The Journal of Pediatrics. 1991;119:58–63. [PubMed]

Gray KA, Klebanoff MA, Brock JW, Zhou H, Darden R, Needham L, Longnecker MP. In utreo exposure to background levels of polychlorinated biphenyls and cognitive functioning among school-age children. American Journal of Epidemiology. 2005;162:17–26. [PubMed]

Haneuse S, Wakefield J. Hierarchical models for combining ecological and case-control data. Biometrics. 2007;63:128–136. [PubMed]

He X, Zhu ZY, Fung WK. Estimation in a semiparametric model for longitudinal data with unspecified dependence structure. Biometrika. 2002;89:579–590.

Holt D, Smith TMF, Winter PD. Regression analysis of data from complex survey. Journal of Royal Statistical Society, Ser.A. 1980;143:474–487.

Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association. 1952;47:663–685.

11. Huang JZ, Zhang L, Zhou L. Efficient estimation in marginal partially linear models for longitudinal/clustered data using spline. Scandinavian Journal of Statistics. 2007;34:451–477.

Jacobson JL, Jacobson SW. Intellectual impairment in children exposed to polychlorinated biphenyls in utero. New England Journal of Medicine. 1996;335:783–789. [PubMed]

Lehmann E. Theory of point estimation. Wiley; New York: 1983.

Lin X, Carroll RJ. Semiparametric regression for clustered data using generalized estimating equations. Journal of the American Statistical Association. 2001;96:1045–1056.

Longnecker MP, Klebanoff MA, Zhou H, Brock JW. Association between maternal serum concentration of the DDT metabolite DDE and preterm and small-for-gestational-age babies at birth. Lancet. 2001;358:110–114. [PubMed]

Patandin S, Lanting CI, Mulder PG, et al. Effects of environmental exposure to polychlorinated biphenyls and dioxins on cognitive abilities in Dutch children at 42 months of age. Journal of Pediatrics. 1999;134:33–41. [PubMed]

Qin J, Lawless JF. Empirical Likelihood and General Estimating Equations. The Annals of Statistics. 1994;22:300–325.

Qu A, Li R. Quadratic inference functions for varying coefficient models with longitudinal data. Biometrics. 2006;62:379–391. [PMC free article] [PubMed]

Rathouz PJ, Satten GA, Carroll RJ. Semiparametric inference in matched case-control studies with missing covariate data. Biometrika. 2002;89:905–916.

Rogan WJ, Gladen BC, Hung KL, Koong SL, Shih LY, Taylor JS, Wu YC, Yang D, Ragan NB, Hsu CC. Congenital poisoning by polychlorinated biphenyls and their contaminants in Taiwan. Science. 1988;241:334–336. [PubMed]

Rogan WJ, Gladen BC. PCBs, DDE, and child development at 18 and 24 months. Annals of Epidemiology. 1991;1:407–413. [PubMed]

Ruppert D. Selecting the number of knots for penalized splines. Journal of Computational and Graphical Statistics. 2002;11:735–757.

Ruppert D, Carroll R. Cornell University, School of Operations Research and Industrial Engineering; 1997. Penalized Regression Splines. working paper. (Available at www.orie.cornell.edu/david/papers)

Schantz SL, Widholm JJ, Rice DC. Effects of PCB exposure on neuropsychological function in children. Environmental Health Perspectives. 2003;111:357–376. [PMC free article] [PubMed]

Schildcrout JS, Heagerty PJ. On outcome-dependent sampling designs for longitudinal binary response data with time-varying covariates. Biostatistics. 2008;9:735–749. [PMC free article] [PubMed]

Syracuse Research Toxicological Profile for Polychlorinated Biphenyls (PCBs) 2000. pp. 556–576. prepared for U.S. Department of Health and Human Services, Public Health Service, Agency for Toxic Substances and Disease Registry, U.S. Department of Health and Human Services, Public Health Service, Agency for Toxic Substances and Disease Registry, Atlanta, GA.

Wang N, Carroll RJ, Lin X. Efficient semiparametric marginal estimation for longitudinal/clustered data. Journal of the American Statistical Association. 2005;100:147–157.

Wang X, Zhou H. A Semiparametric Empirical Likelihood Method For Biased Sampling Schemes In Epidemiologic Studies With Auxiliary Covariates. Biometrics. 2006;62:1149–1160. [PubMed]

Weaver MA, Zhou H. An estimated likelihood method for continuous outcome regression models with outcome-dependent sampling. Journal of the American Statistical Association. 2005;100:459–469.

White JE. A two stage design for the study of the relationship between a rare exposure and a rare disease. American Journal of Epidemiology. 1982;115:119–128. [PubMed]

Yu Y, Ruppert D. Penalized Spline Estimation for Partially Linear Single Index Models. Journal of the American Statistical Association. 2002;97:1042–1054.

Zhou H, Weaver MA, Qin J, Longnecker MP, Wang MC. A semiparametric empirical likelihood method for data from an outcome dependent sampling scheme with a continuous outcome. Biometrics. 2002;58:413–421. [PubMed]

Zhou H, Chen J, Rissanen TH, Korrick SA, Hu H, Salonen JT, Longnecker MP. Outcome dependent sampling: an efficient sampling and inference procedure for studies with a continuous outcome. Epidemology. 2007;18:461–468. [PMC free article] [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |