Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
J R Stat Soc Ser C Appl Stat. Author manuscript; available in PMC 2012 August 1.
Published in final edited form as:
J R Stat Soc Ser C Appl Stat. 2011 August; 60(4): 559–574.
doi:  10.1111/j.1467-9876.2010.00756.x
PMCID: PMC3181132

A Partially Linear Regression Model for Data from an Outcome-Dependent Sampling Design


The outcome dependent sampling scheme has been gaining attention in both the statistical literature and applied fields. Epidemiological and environmental researchers have been using it to select the observations for more powerful and cost-effective studies. Motivated by a study of the effect of in utero exposure to polychlorinated biphenyls on children’s IQ at age 7, in which the effect of an important confounding variable is nonlinear, we consider a semi-parametric regression model for data from an outcome-dependent sampling scheme where the relationship between the response and covariates is only partially parameterized. We propose a penalized spline maximum likelihood estimation (PSMLE) for inference on both the parametric and the nonparametric components and develop their asymptotic properties. Through simulation studies and an analysis of the IQ study, we compare the proposed estimator with several competing estimators. Practical considerations of implementing those estimators are discussed.

Keywords: Outcome dependent sampling, Estimated likelihood, Semiparametric method, Penalized spline

1 Introduction

An outcome dependent sampling (ODS) design is an attempt to enhance study efficiency in a cost-effective way. Under an ODS design, the primary covariate, the exposure variable, is observed only on some subsets of the study subjects, conditional on the values of the response variable and possibly some other auxiliary covariates for the exposure. The principle motivation for ODS designs is to concentrate resources where there is the greatest amount of information. By allowing the selection probability of each individual in the ODS sample to be dependent on outcome, the investigators can enhance the efficiency and reduce the cost of the study.

In a recent example that employed the ODS design(Gray et al. 2005), investigators were interested in how children’s IQ at 7 years of age is related to polychlorinated biphenyls (PCBs). The study subjects are children who were born into the Collaborative Perinatal Project (CPP) which is a prospective cohort designed to provide precise data for studies of a wide variety of neuropsychological outcomes and birth defects (Niswander and Gordon, 1972). Since it was too expensive to assay the PCB exposure for the entire study population of 44,075 subjects, the investigators decided to obtain exposure measurements for a sample that was sampled in an ODS way from the population based on the observed IQ scores (Gray et al., 2005). In the following, for simplicity, we refer to the data set including the PCB measurements (Gray et al., 2005) as the CPP data set.

Several authors have studied statistical inference for data from an ODS design. For example, Breslow and Holubkov (1997) developed maximum likelihood estimation of logistic regression coefficients for a hybrid two-phase design. Lawless, Kabfleisch and Wild (1999) considered a full semiparametric likelihood method. For a continuous outcome variable, Zhou et al. (2002) considered a general two-component ODS scheme where an overall simple random sample and additional supplementary samples are observed. Chatterjee, Chen and Breslow (2003) proposed a pseudoscore estimation method for regression problems with two-phase ODS. Breslow, McNeney and Wellner (2003) derived a large sample theory for semiparametric regression models. Weaver and Zhou (2005) proposed an estimated maximum likelihood method using the estimated likelihood technique (e.g., Carroll and Wand 1991, Pepe and Fleming 1991, Zhou and Pepe 1995) for incorporating additional information in the non-ODS sample. Wang and Zhou (2006) considered the case of an ordinal outcome variable with an auxiliary covariate. Zhou et al. (2007) further demonstrated the improved efficiency obtained by using the ODS design, and its applicability in a wide range of settings.

These existing methods are based on the assumption that the effect on the outcome of the covariates is linear. This assumption is chosen mainly for mathematical convenience. In practice, the true parametric relationship between the outcome and covariates is rarely known. For example, in the above mentioned epidemiological study (Gray et al. 2005), investigators were interested in identifying the relationship of the children’s IQ at age 7 to in utero exposure to PCBs, after adjusting for potential confounders, including the highest education level attained by the mother. Maternal education is often the strongest confounding factor in studies of environmental determinants of child IQ (Walkowiak et al. 1998; Angelsen et al. 2001; Bohm et al. 2002). The relation of maternal education to child IQ is not linear, with mother’s years in college having a much greater effect on child IQ than do years of education in primary and secondary school (e.g., Breslau et al. 2005; Oddy et al. 2003). Given the strength of this confounding factor, the manner in which education is modeled could affect the amount of bias in the coefficient for the exposure of interest. Thus, we were motivated to develop a partial linear method of modeling a covariate in the ODS setting.

Handling the nonparametric component in semiparametric models is generally challenging. One approach is to use nonparametric tools, e.g. the kernel estimator (Speckman 1988). This method is computationally intensive. Another method is to parametrize the nonparametric component using some flexible functions and then use some parametric tools, such as Fourier series approximation, Demmler-Reinsch series approximation, or wavelets. However, selecting the truncation parameter and allocating the knots in these methods can be challenging. An alternative approach is the penalized spline method, using a roughness penalty for the nonparametric regression function (e.g, Eilers and Marx 1996). The idea of a roughness penalty on splines is not new (e.g, O’Sullivan 1986), though the technique has become popular recently due to its effectiveness with penalized splines. With penalized splines, the number and location of knots are no longer crucial as long as the minimum number of knots is reached and the smoothing parameter is used to balance the goodness-of-fit and smoothness (Ruppert and Carroll 2000, Yu and Ruppert 2002, Wu and Yu 2004).

To model nonlinear covariate effects under the ODS sampling scheme, we consider a penalized spline maximum likelihood estimator (PSMLE) for the parametric and nonparametric components in a partially linear regression model. Using a simulation study and an analysis of the Collaborative Perinatal Project (CPP) data set, we present a case-study for comparing the PSMLE with several competing methods.

2 Penalized Spline Maximum Likelihood Estimation for ODS Design

2.1 Partial Linear Model and ODS Data Structure

To fix notation, let Y denote an outcome variable, and X and Z denote covariates. Let fY|X,Z (y|x, z) be the conditional density of Y given X = x and Z = z. We specify fY|X,Z (y|x, z) through the following regression model


where m(·) is a known link function, β is an unknown parameter vector corresponding to X and α(·) is an unknown function to be estimated. For simplicity, we write the conditional model as f(y|x, z; β, α(·)). Model (1) can be viewed as a semiparametric model if α(·) is unspecified. It is also referred to as a partial linear model since part of the covariate vector (Z) is modeled as a nonlinear function.

We assume that the observed data are from an ODS sampling scheme (Zhou et al. 2002; Weaver and Zhou 2005). Specifically, assume that the domain of Y is union of K nonoverlapping intervals Ck = (ak−1, ak], with ak being known constants satisfying a0 [equivalent] −∞ < a1 < a2 < … < aK [equivalent] ∞. The choice of the number of the intervals generally depends on the regions in the domain of the outcome variable which may contain great amount of information. A three Interval ODS scheme (K=3) can be selected for simplicity in practice (e.g., Gray et al. 2005). This will also be an over-representation of the tails of the distribution of the outcome that would be otherwise missing in a standard SRS scheme.

We assume there exists a base population (of sample size N) that is a simple random sample of the underlying study population, on which we observe {Y,Z}. The exposure variable X is only observed for a subset of this base population that is selected in an ODS way. In particular, observation of X comes from two components: First, we observe X on a simple random sample (SRS) of size n0. Secondly, for each stratum defined by {Y [set membership] Ck}, k = 1,…,K, we observe X on a supplementary random sample of size nk. The first component is sometimes omitted, so that n0 = 0. Hence the data set where X is observed is

{Xi,i=1,,n0}SRS component and{Xi|YiCk,i=1,,nk}Supplementary component,k=1,,K.

Let nV=n0+k=1Knk denote the size of the ODS subsample for which we observe (Y,X,Z), and let nV = NnV be the number of individuals for whom only (Y,Z) is observed. To borrow some terms from the measurement error literature, we will refer to the nV complete observations as the validation sample, and nV incomplete observations as the nonvalidation sample. Let V represent the index set of all validation observations, and let V represent the index set of all nonvalidation observations. Further, let Vk and Vk represent the index sets for observations in the kth stratum (Y [set membership] Ck) in validation and nonvalidation samples, respectively.

Of the 44,075 children from the previously mentioned CPP data set, 38,709 have complete data (no missing data other than PCB), which will represent our study population. Furthermore, a sample of size 1038 with measured PCB levels is obtained from the study population through the above ODS design. In particular, the domain of the outcome IQ was divided into 3 intervals, i.e., C1 = (−∞, 82], C2 = (82, 110] and C3 = (110,∞), where 82 and 110 equal to the mean of IQ (96) minus or plus one standard deviation of IQ (14). It was anticipated that a sampling design in which children with extreme IQ scores were oversampled would enhance the efficiency of the study relative to an SRS design of the same size. Therefore, in addition to a SRS sample of size 849, 81 children with IQ < 82 and 108 children with IQ > 110 are randomly selected from intervals C1 and C3. Thus, in the sampling notation, we have that nV = 1038, nV = 37671, a1 = 82, a2 = 110, n0 = 849, n1 = 81, n2 = 0 and n3 = 108. Table 1 gives summary of the specific data structure.

Table 1
Number of subjects in the CPP dataset according to subgroup membership.

When data are collected through the ODS scheme described above, several levels of information could be used for inference about β and α(·). The simplest possibility is to use only those observations that make up the SRS portion (if this exists) of the ODS design. Alternatively, one could try to use the complete data portion (V). Clearly, a more efficient estimate can be achieved if one uses all available data.

Using Bayes formula and the multinomial distribution for finite population sampling, Weaver and Zhou (2005) show that the full-information likelihood based on all available data is proportional to

(β,α(·),GX|Z)[iVf(Yi|Xi,Zi;β,α(·))]  [jV¯fY(Yj|Zj;β,α(·))]



and GX|Z(x|Z) is the conditional distribution of X given Z.

Since the sampling mechanism used to obtain X in the validation sample is not a simple random sample, we cannot use a simple global empirical distribution function to estimate GX|Z. Proper accommodation for the ODS nature of the validation sample is needed. By the Law of Total Probability, the distribution function of X|Z can be written as


Hence, we can estimate GX|Z(x|z) by the kernel smoother (e.g., Nadaraya 1964, Watson 1964):


with Ĝk(x|z) = Σi[set membership]Vk I(Xix)L(Ziz)/Σi[set membership]Vk L(Ziz), where L(·) = L(·/ℎ)/ℎ and ℎ > 0 is the bandwidth. L(·) is called the kernel function and is a piecewise smooth function satisfying ∫ L(u)du = 1. We use a standard normal density function in our computations. For further details on the kernel smoother see Eubank (1988).

Then we can obtain an estimator of f(Yj|Zj, β, α(·)) as


2.2 Penalized Spline for Modeling α(·)

For convenience of presentation, we assume Zi is an univariate variable. The unknown function α(·) can be estimated by a penalized spline (Ruppert and Carroll 2000 and Ruppert 2002). Assume that


where {ϑk}k=1κ are spline knots. Model (2) uses the so-called truncated power function basis, though other bases (e.g., B-splines) could also be used. Define the spline coefficient vector δ = (δ, δ1,…,δm)τ and spline basis


Our spline model is α(z) = δτB(z). Denote ζ = (βτ, δτ)τ. The PSMLE of [zeta] = ([beta]τ, [delta with circumflex]τ)τ is defined as ζ that maximizes



^F(β,δ)=iVln f(Yi|Xi,Zi;β,δ)+jV¯ln f^(Yj|Xi,Zi;β,δ)=iVln f(Yi|Xi,Zi;β,δ)+jV¯lnk=1K(iVkf(Yj|Xi,Zj;β,δ)L(ZiZj)))  iVkV¯kL(ZiZj)iVkL(ZiZj)iVV¯L(ZiZj),

λ ≥ 0 is a smoothing parameter, and D is an appropriate positive semi-definite symmetric matrix such that δτDδ=min(Zi)max(Zi)[α(z)]2dz, which yields the usual quadratic integral penalty (Ruppert 2002).

We now describe some asymptotic results that are summarized in three theorems with outline proofs, in the Appendix.

First, under some regularity conditions, the proposed estimator is consistent and asymptotically normally distributed. Furthermore, suppose we are interested in testing a constraint on the parameters as in the hypothesis


We define a likelihood ratio statistic R(ζ) that is based on the Qλ,N (β, δ) as


where ψ(·) is a q × 1 vector function (q < p + m + κ + 1). The likelihood ratio statistic is asymptotically distributed as χq2 under H0. Our likelihood ratio test (LRT) differs from that in Hastie and Tibshirani (1990) in the computation of the degrees of freedom. Specifically, the degrees of freedom for our proposed LRT is computed by the difference between the number of the parameters in the null model and the unrestricted model rather than the effective degrees of freedom in Hastie and Tibshirani (1990).

For the partial linear model, it is of particular interest to test whether the nonparametric function is linear. We can use the previously described likelihood ratio test to do this by re-expressing the m + κ + 1-dimensional vector δ as (δ1T,δ2T), where δ1 = (δ11, δ12)T is a two-dimensional vector and δ2 is a m + κ −1 dimensional vector. We are then interested in testing the null hypothesis H0:δ2=δ20=(0,,0)T,i.e.,H0:ψ(ζ)=δ2=δ20=0. Under H0, xτβ + α(z) = xτβ + δ11 + δ12z, i.e. the variable Z is related to response Y linearly.

2.3 Selection of smoothing Parameter, Knots and Penalty

To implement the proposed method in practice, it is desirable to have an automatic data-driven method for estimating the smoothing parameter λ. Generalized cross-validation (GCV) is an attractive way to choose λ since it is computationally expedient and does not need a prior estimate of error variance. Following the conventional technique of penalized least squares (e.g., Ruppert 2002), we define

e(λ)=tr [{QN,λ(ζ^)}1F(ζ^)]

where {QN,λ(ζ^)}1F(ζ^) is the smoothing or hat matrix. In nonparametric regression, the trace of the smoothing matrix is often called the degrees of freedom of the fit. It has the rough interpretation as the equivalent number of parameters (Yu and Ruppert 2002). The GCV statistic is defined by


where RSS = 2[ell]F ([zeta]) is the residual sum of squares corresponding to [zeta], given λ. We select [lambda with circumflex] = argminλ{GCV(λ)}.

Since the complicated knot selection problem is reduced to the choice of a single smoothing parameter λ the selection of the number of knots and knot locations is no longer crucial for the penalized spline. Ruppert (2002), Wu and Yu (2004) and Yu (2008) have observed that the choice of the number of knots κ is not too important, provided it is large enough. Ruppert (2002) and Wu and Yu (2004) suggested choosing approximately min(n/4, 35) or min(n/4, 40) knots, respectively, where n is the number of distinct values of the sample of the nonparametric covariate. However, in many practical situations where the regression function is smooth and either monotonic or unimodal, 10 to 20 knots are very adequate, as suggested in Yu (2008). Given our choice of nonlinear function, the number of knots are chosen from 10 to 30 in our simulation which works quite well (see Figure 1). Given a fixed number of knots, Wu and Yu (2004) and Yu (2008) recommended that the knots are placed at equally-spaced sample quantiles of the index Z.

Figure 1
The estimated function α(·) in Cases 1 and 2 in the simulation studies. Left column: [alpha]P (z)(solid line), [alpha]HT (z) (dashed line), [alpha]BC (z) (dash-dotted line) and true α(z)(bold dotted ...

As in Wu and Yu (2004) and Yu (2008), we take a quadratic penalty of the form λδT in our paper. When the nonparametric function has discontinuity, the nonquadratic penalty functions may be a better choice. Ruppert and Carroll (1997) gave a general penalty of the form t=1T|αr+t|γ,γ>0, and pointed out that penalties with γ ≤ 1 can perform better than a quadratic penalty for discontinuous functions. Otherwise, the quadratic penalty is preferred.

3 Simulation Studies

We investigate the small sample behavior of the proposed method through simulation studies, comparing the proposed estimator with several potential competing estimators. Mimicking the design of CPP study, we generate data according to the following regression model:

Yi=β1X1i+β2X2i+α(Zi)+εi,  i=1,,

where X1i ~ N(1, 0.25), X2i ~ N(0, 1), β1 = 1, β2 = 1.5, Zi = ξi + X1iI(|X1i| ≤ 1) with ξi ~ U(0, 1), and ξi ~ N(0, 1). We take (n0, n1, n2,N) = (100, 25, 25, 600), (200, 25, 25, 500), (200, 50, 50, 1200) and C1 = (−∞, µy − σy) and C2 = (µY + σY,∞). We consider two choices for α(z) that represent nonlinear forms commonly observed in practice:

  • Case 1: α(z) = sin(2πz),
  • Case 2: α(z) = (0.02 exp(10(z − 1)))/(1 + exp(8(z − 1.5))).

The first case has a cyclic pattern (Hickey et al. 1984, Strum and Pinsky 2006, Elkum et al. 2008) while the second has a flat response at the beginning and a sharp rise at the end of the range (e.g, Figure 1(a) and (c)).

Under each setting, we compare three estimators:

  • [beta]P: the proposed penalized spline maximum likelihood estimator.
  • [beta]HT: the modified Horvitz-Thompson weighted-likelihood method. The estimator of ζ = (βτ, δτ)τ maximizes
    k=1K1/(n0/N+nk/Nk)iVkln f(Yi|Xi,Zi;β,δ)λNδτDδ
    where λδτDδ has the same definition as in (3).
  • [beta]BC: the modified Breslow-Cain pseudo-likelihood method. The estimator of ζ = (βτ, δτ)τ maximizes
    iV{ln f(Yi|Xi,Zi;β,δ)lnk=1K1/(n0/N+nk/Nk)Pr(YiCk|Xi)}λNδτDδ.

We take the number of knots as 15, 13 and 30 respectively corresponding to the sample size 600, 500 and 1200 in the simulations. The means, standard errors (SE), estimators of standard errors (SE^) and coverages of 95% nominal confidence intervals (CI) were calculated from 1,000 independent runs. Table 2 lists results for the above estimators under various configurations.

Table 2
The finite sample performance for the parametric components.

Clearly, [beta]P, [beta]HT and [beta]BC are approximately unbiased for both β1 and β2. The proposed estimator ([beta]P) is always more efficient than [beta]HT and [beta]BC. This supports the notion that taking the nonvalidation sample into account can improve the efficiency of estimation. The nominal 95% confidence intervals based on the proposed standard errors provide good coverage for the cases studied for [beta]P, [beta]HT, [beta]BC.

Figure 1 (a) and (b) show the estimators of α(z) = sin(2πz) and their corresponding pointwise SEs. Figure 1 (c) and (d) show the estimators of α(z) = (0.02 exp(10(z−1)))/(1+exp(8(z − 1.5))) and their corresponding pointwise SEs. From Figure 1, we see that [alpha]P (z), [alpha]HT (z) and [alpha]HT(z) are approximately unbiased. Furthermore, [alpha]P (z) has smallest pointwise SE among [alpha]HT (z) and [alpha]BC(z), suggesting that the proposed method is indeed a more efficient approach.

4 Analysis of the CPP Data

We analyze the CPP data to identify the relationship of the children’s IQ at 7 years of age to in utero exposure to polychlorinated biphenyls (PCBs), after adjusting for potential confounders, including the highest education level attained by the mother (EDU).

Additional covariates in the analysis are socioeconomic status of the child’s family (SES), the gender of the child (SEX, with female=1 and male=0) and the race of the child (RACE, with black=1 and other=0). To model the nonlinear effect of education noted in the Introduction, we consider the following partial linear model,


where α(EDU) is an unspecified function to be estimated along with βi, i = 1,…,4. To estimate the nonparametric function α(·), we adopted a three-degree truncated power function basis M(z)=(1,z,z2,z3,(zϑ1)+3,,(zϑ5)+3)T with five fixed knots [theta]1,…,[theta]5 selected as the equally spaced sample quantiles of EDU, i.e., 2,5,9,12,15. Under these specifications, the above model can be rewritten as IQ = β1PCB2SES3RACE4SEX+MT (EDU)δ+ε, where δ = (δ0,…, δ8)T is the parameter vector associated with the nonparametric function α(·).

We analyzed the CPP data with the following methods using a penalized spline for α(EDU): the proposed method (P), the modified Horvitz-Thompson weighted likelihood method (HT), the modified Breslow-Cain pseudo-likelihood method (BC), and the MLE estimator based on the SRS sample (MLE-SRS). The smoothing parameter was selected as 0.0853 by the proposed GCV method. Additionally, for the proposed method, we also considered modeling α(EDU) as a linear, quadratic, or cubic function of EDU. Furthermore, we considered using a restricted cubic spline (Herndon and Harrell 1990) for α(EDU) and obtained the corresponding estimate through maximizing [l with hat]F (β, δ). The restricted cubic spline, which has a linearly constrained tails which is slightly different from the general cubic spline function and can be used directly to fit models without penalty.

The estimated [alpha](EDU) from the different methods and their corresponding 95% confidence intervals given in Figure 2. The fitted [alpha](EDU) tell a similar story in that there is a clear nonlinear trend present in all fitted lines. The most noticeable difference is the width of the confidence interval band, which indicates which method is more efficient. A careful inspection of the trend of [alpha](EDU) reveals that the rate of rise of [alpha](EDU) is much faster after around year 12 (i.e. after high school education). This agrees with the previous published results (e.g., Breslau et al., 2005; Oddy et al., 2003) that mother’s years in college have a much greater effect on child IQ.

Figure 2
The estimated function α(·) on EDU for CPP data. Plot a: the curve obtained by proposed method; Plot b: the curve obtained by HT method; Plot c: the curve obtained by BC method; Plot d: the curve obtained by MLE method based on the SRS ...

We conducted likelihood ratio tests for testing if the nonlinear fit of α(EDU) from the proposed method can be represented by a simple polynomial function. The following three tests on the form of α(EDU) are for linear, quadratic, and cubic functions, respectively.

  • Test 1: H0 : α(EDU) = δ0 + δ1EDU,
  • Test 2: H0 : α(EDU) = δ0 + δ1EDU + δ2EDU2,
  • Test 3: H0 : α(EDU) = δ0 + δ1EDU + δ1EDU2 + δ3EDU3.

The test statistic for the Test 1–3 are: for linear α(EDU), T1=314.40>χ0.952(7)=14.07 with p< 0.001; for quadratic α(EDU), T2=94.90>χ0.952(6)=12.59 with p< 0.001; and for cubic α(EDU), T3=65.97>χ0.952(5)=11.07 with p< 0.001, respectively. These results suggest that, although the cubic fit in Figure 2(e) may be sufficiently close to the fully nonparametric fit in Figure 2(a) for practical purposes, there is still statistical evidence suggesting that [alpha](EDU) may be more complex than a cubic function.

The parameter estimates from the six methods are presented in Table 4 which also includes the analysis with a linear effect for EDU using the Zhou et al. (2002) method which is based on the ODS data only. Overall, the point estimates from the above methods are similar. The most obvious difference across the methods is that the standard error estimates from the proposed methods (the P, cubic and restricted cubic spline) are much smaller for the covariates. This reflects the fact that these three methods utilized the real values of the covariates in the entire study cohort while the others only used the fraction as weight in the inference. In addition, we computed the values of the penalized log-likelihood function (3) for these three methods, which are Qp = −150441.651, Qc = −150474.636, QR = −150452.816 respectively corresponding to the P, cubic and restricted cubic spline methods, indicating that the P method is more suitable for this CPP data set than the other two methods. However, for practical purposes, the restricted cubic spline method is a viable alterative in this case.

Table 4
Analysis results for the CPP data set.

5 Discussion

In this paper we proposed a semiparametric regression model to analyze data obtained by outcome dependent sampling. We only partially parametrize the relationship between the response variable and the covariates. By combining the estimated semiparametric likelihood and penalized spline techniques, we propose a penalized spline maximum likelihood estimation method for the key parametric and nonparametric components. The resulting estimators were shown to be consistent and asymptotically normal. In practice, our penalized spline maximum likelihood estimation offers a few additional advantages. For example, as a direct approach, penalized spline maximum likelihood estimates can be obtained through standard penalized maximum likelihood estimation. The algorithm is efficient and convergence is fast; moreover, as a global smoothing method, the penalized spline maximum likelihood approach yields a parsimonious model, which is convenient for inference and forecasting.

The smoothing parameter λ is used to balance goodness-of-fit and smoothness. Compared with the restricted cubic spline, use of the penalized spline can help avoid undersmoothing in some cases, e.g., large number of knots are needed when the nonparametric function has many local mimima and maxima. We only focused on the setting where Z is univariate. When Z is bivariate or multivariate, we can approximate the unknown function α(·) by bivariate or multivariate basis functions (Ruppert, Wand and Carroll 2003). In addition, when Z is bivariate or multivariate, to avoid excess dimensionality one can use structural nonparametric regression models such as a varying-coefficient model, additive model, or a single-index model. We believe the proposed penalized spline maximum likelihood estimation can be extended to these corresponding semiparametric regression models without significant modification.

Table 3
Simple summary statistics for the CPP data set.


This work is supported by a grant from National Institute of Health (R01CA 79949) (for Zhou, You and Qin) and in part by the intramural research program of the NIH, National Institute of Environmental Health Sciences (for Longnecker). Qin’s work is also partially supported by the NSF of China (10801039).


Outline of Proofs for the Main Results

The asymptotic properties of the proposed estimators based on the estimated penalized likelihood (3) and model (2) are summarized in the following theorems.

Theorem 1 Under some regularity conditions (Weaver and Zhou 2005), if the smoothing parameter λ = o(1), then [zeta] is a strong consistent estimator of ζ.

Theorem 2 Under same regularity conditions above, if the smoothing parameter λ = o(N−1/2), then [zeta] has asymptotic distribution,

N(ζ^ζ)DNp+m+κ+1(0,Ω)    as  N,



where dk = πk(1 − ρ0ρV) − ρkρV, πk = πk(β, α(·),GX,Z), ρk = limN→∞ nk/nV and ρV = limN→∞ nV /N,

I(ζ)=k=0KρkρVEk[2log f(Y|X,Z;ζ)ζζτ]k=1KdkEk[2log f(Y|Z;ζ)ζζτ],

Ek denotes expectation conditional on Y [set membership] Ck,




A consistent estimator for Ω can be constructed using sample quantities.

Theorem 3 Under same regularity conditions above, if the smooth parameter λ = o(N−1/2), then for testing the null hypothesis H0 : ψ(ζ) = 0, we have Rχq2.

Proof of Theorem 1. Let the full-information log likelihood be

F(β,δ)=k=0KiVkln f(Yi|Xi,Zi;β,δ)+k=1KjV¯kχf(Yj|Zj,x;β,δ)dGX|Z(x|Zj).



According to the proof of Theorem 3.1 in Weaver (2001) it holds that

1N(^F(β,δ)ζF(β,δ)ζ)p0  and  1NF(β,δ)ζp0  as  N.

Therefore, combining the fact that λ = o(1) we have [partial differential]QN(β, δ)/[partial differential]ζp 0 as N → ∞. In addition,


According to the proof of Theorem 3.1 in Weaver (2001) it holds that

1N(2^F(β,δ)ζζτ2F(β,δ)ζζτ)p0(p+m+κ+1)×(p+m+κ+1)  and  1N2F(β,δ)ζζτpI(ζ)

uniformly for ζ [set membership] Θ as N → ∞. Therefore, combining the fact that λ = o(1) we have

2QN,λ(β,δ)ζζτpI(ζ)  uniformly for  ζΘ  as  N.

Thus, if we let fN(ζ)=QN,λ(β,δ)ζ we can apply Lemma 3.3 in Weaver (2001) to conclude that ζ^=fN(1)(0) exists in the set Θ with probability approaching one as N → ∞, and since the size of Θ is arbitrarily small, that [zeta]p ζ. Furthermore, the sequence of estimators {[zeta]} is unique in the sense that any other sequence {[zeta]} such that QN,λ(β¯,δ¯)ζ=0 and [theta w/ macron] = [theta w/ hat] must lie outside of the set Θ with probability going to one as N → ∞.

Proof of Theorem 2. For consistent estimator [zeta], using a first order Taylor expansion near ζ yields that


where ζ* is a vector between [zeta] and ζ. By standard manipulation, we have


Thus, to prove the asymptotic normality of N(ζ^ζ), we need only show 1/NQN,λ(β,δ)ζ has an asymptotic normal distribution and that 1/N2QN,λ(β*,δ*)ζζτ converges in probability to an invertible matrix.

According to the definition of QN(β, δ) we have


Note that λ = o(N−1/2). Therefore, we have 2λblockdiag(0p×p,D)(0pτ,δτ)τ=o(N12) . Thus, by the same argument as the proof of Theorem 3.2 ofWeaver (2001), we can show that Theorem 2 holds.

Proof of Theorem 3. Note that when λ = o(N−1/2), the penalty term in R(ζ) tends to zero with a rate of o(N−1/2). Then through the similar procedure for proof of the asymptotic distribution of the classical likelihood ratio statistics, Theorem 3 can be obtained. We have omitted the details here.

Contributor Information

Haibo Zhou, University of North Carolina at Chapel Hill, USA.

Jinhong You, University of North Carolina at Chapel Hill, USA.

Guoyou Qin, University of North Carolina at Chapel Hill, USA and Fudan University, CHINA.

Matthew P. Longnecker, National Institute of Environmental Health Sciences, National Institutes of Health, Department of Health and Human Services, Research Triangle Park, NC, USA.


  • Angelsen NK, Vik T, Jacobsen G, Bakketeig LS. Breast feeding and cognitive development at age 1 and 5 years. Arch Dis Child. 2001;85:183–188. [PMC free article] [PubMed]
  • Bohm B, Katz-Salamon M, Institute K, Smedler AC, Lagercrantz H, Forssberg H. Developmental risks and protective factors for influencing cognitive outcome at 5 1/2 years of age in very-low-birthweight children. Dev Med Child Neurol. 2002;44:508–516. [PubMed]
  • Breslau N, Paneth N, Lucia VC, Paneth-Pollak R. Maternal smoking during pregnancy and offspring IQ. Int J Epidemiol. 2005;34:1047–1053. [PubMed]
  • Breslow N, McNeney B, Wellner JA. Large sample theory for semiparametric regression models with two-phase, outcome dependent sampling. Ann. Statist. 2003;31:1110–1139.
  • Breslow N, Holubkov R. Maximum likelihood estimation of logistic regression parameters under two-phase, outcome-dependent sampling. J. Roy. Statist. Soc., B. 1997;59:447–461.
  • Carroll RJ, Wand MP. Semiparametric estimation in Logistic measurement error model. Journal of the Royal Statistics Society, B. 1991;53:573–585.
  • Chatterjee N, Chen Y, Breslow N. A pseudoscore estimator for regression problems with two-phase sampling. J. Amer. Statist. Assoc. 2003;98:158–168.
  • Eilers PHC, Marx BD. Flexible smoothing with B-spline and penalties. Statist. Science. 1996;11:89–102.
  • Elkum NB, Myles JD, Kumar P. Analyzing biological rhythms in clinical trials. Contemp Clin Trials. 2008;29:720–726. [PubMed]
  • Gray KA, Longnecker MP, Klebanoff MA, Brock JW, Zhou H, Needham L. In Utero exposure to background levels of Polychlorinated Biphenls and cognitive functioning among school-aged children. Am J Epidemiology. 2005;162:17–26. [PubMed]
  • Hastie TJ, Tibshirani RJ. Generalized Additive Models. London: Chapman and Hall; 1990.
  • Herndon JE, Harrell RE. The restricted cubic spline hazard model. Communications in Statistics: Theory and Methods. 1990;19:639–694.
  • Hickey DS, Kirkland JL, Lucas SB, Lye M. Analysis of circadian rhythms by fitting a least squares sine curve. Comput Biol Med. 1984;14:217–223. [PubMed]
  • Hurvich CM, Simonoff JS, Tsai CL. Smoothing parameter selection in nonparametric regression using an improved Akaike information criterion. J. R. Stat. Soc. Ser. B. 1998;60:271–293.
  • Lawless JF, Kabfleisch JD, Wild CJ. Semiparametric methods for response-selective and missing data problems in regression. J. Roy. Statist. Soc., B. 1999;61:413–438.
  • Oddy WH, Kendall GE, Blair E, De Klerk NH, Stanley FJ, Landau LI, Silburn S, Zubrick S. Breast feeding and cognitive development in childhood: a prospective birth cohort study. Paediatr Perinat Epidemiol. 2003;17:81–90. [PubMed]
  • O’Sullivan F. A statistical perspective on ill-posed inverse problems. With comments and a rejoinder by the author. Statist. Sci. 1986;1:502–527.
  • Pepe MS, Fleming TR. A nonparametric method for dealing with mismeasured covariate data. J. Amer. Statist. Assoc. 1991;86:108–113.
  • Ruppert D. Selecting the number of knots for penalized splines. J. Comput. Graph. Statist. 2002;11:735–757.
  • Ruppert D, Carroll R. Spatially-adaptive penalties for spline fitting. Austr. and New Zeal. J. Statist. 2000;42:205–223.
  • Ruppert D, Wand MP, Carroll RJ. Semiparametric regression. Cambridge University Press; 2003.
  • Speckman P. Kernel smoothing in partial linear models. J. Roy. Statist. Soc. Ser. B. 1988;50:413–436.
  • Strum DP, Pinsky MR. Modeling ischemia-induced dyssynchronous myocardial contraction. Anesth Analg. 2006;103:846–853. [PubMed]
  • Wang X, Zhou H. Semiparametric Empirical Likelihood for ODS with Ordinal Outcome Variable. Biometrics. 2006;62:1149–1160. [PubMed]
  • Walkowiak J, Altmann L, Kramer U, Sveinsson K, Turfeld M, Weishoff-Houben M, Winneke G. Cognitive and sensorimotor functions in 6-year-old children in relation to lead and mercury levels: adjustment for intelligence and contrast sensitivity in computerized testing. Neurotoxicol Teratol. 1998;20:511–521. [PubMed]
  • Weaver MA. Unpublished doctoral dissertation. University of North Carolina; 2001. Semiparametric methods for continuous outcome regression models with covariate data from an outcome-dependent subsample.
  • Weaver MA, Zhou H. An Estimated Likelihood Method for Continuous Outcome Regression Models With Outcome-Dependent Sampling. J. Amer. Statist. Assoc. 2005;100:459–469.
  • Wu Z, Yu Y. Single-index varying coefficient models with dependent data. University of Cincinnati: Working paper; 2004.
  • Yu Y. Penalized spline estimation for generalized partially linear single-index models. University of Cincinnati: Working paper; 2008.
  • Yu Y, Ruppert D. Penalized spline estimation for partially linear single-index models. J. Amer. Statist. Assoc. 2002;97:1042–1054.
  • Zhou H, Pepe MS. Auxiliary covariate data in failure time regression. Biometrika. 1995;85:139–149.
  • Zhou H, Chen J, Cai J. Random effects logistic regression analysis with auxiliary covariates. Biometrics. 2002;58:352–360. [PubMed]
  • Zhou H, Chen J, Rissanen TH, Korrick SA, Hu H, Salonen JT, Longnecker MP. Outcome-dependent sampling: an efficient sampling and inference procedure for studies with a continuous outcome. Epidemiology. 2007;18:461–468. [PubMed]
  • Zhou H, Weaver MA, Qin J, Longnecker MP, Wang MC. A semiparametric empirical likelihood method for data from an outcome-dependent sampling scheme with a continuous outcome. Biometrics. 2002;58:413–421. [PubMed]