Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2766091

Formats

Article sections

- Abstract
- 1. Introduction
- 2. Double-Penalized Least Squares Estimators and Their Asymptotics
- 3. Computational Algorithm and Parameter Tuning
- 4. Frequentist and Bayesian Covariance Estimates
- 5. Simulation Studies
- 6. Real Data Application
- 7. Discussion
- References

Authors

Related links

J Multivar Anal. Author manuscript; available in PMC 2010 October 1.

Published in final edited form as:

J Multivar Anal. 2009 October 1; 100(9): 2100–2111.

doi: 10.1016/j.jmva.2009.06.009PMCID: PMC2766091

NIHMSID: NIHMS135122

Department of Statistics, North Carolina State University

Xiao Ni: ude.uscn.tats@inx; Hao Helen Zhang: ude.uscn.tats@2gnahzh; Daowen Zhang: ude.uscn.tats@gnahz

See other articles in PMC that cite the published article.

We propose and study a unified procedure for variable selection in partially linear models. A new type of double-penalized least squares is formulated, using the smoothing spline to estimate the nonparametric part and applying a shrinkage penalty on parametric components to achieve model parsimony. Theoretically we show that, with proper choices of the smoothing and regularization parameters, the proposed procedure can be as efficient as the oracle estimator (Fan and Li, 2001). We also study the asymptotic properties of the estimator when the number of parametric effects diverges with the sample size. Frequentist and Bayesian estimates of the covariance and confidence intervals are derived for the estimators. One great advantage of this procedure is its linear mixed model (LMM) representation, which greatly facilitates its implementation by using standard statistical software. Furthermore, the LMM framework enables one to treat the smoothing parameter as a variance component and hence conveniently estimate it together with other regression coefficients. Extensive numerical studies are conducted to demonstrate the effective performance of the proposed procedure.

Partially linear models are popular semiparametric modeling techniques which assume the mean response of interest to be linearly dependent on some covariates, whereas its relation to other additional variables are characterized by nonparametric functions. In particular, we consider a partially linear model *Y* = **X**^{T}** β** +

Often times, the number of potential explanatory variables, *d*, is large, but only a subset of them are predictive to the response. Variable selection is necessary to improve prediction accuracy and model interpretability of final models. In this paper, we treat *f*(*T*) as a nuisance effect and mainly focus on automatic selection, estimation and inferences for important linear effects in the presence of *T*. For linear models, numerous variable selection methods have been developed such as stepwise selection, best subset selection, and shrinkage methods like nonnegative garrote (Breiman, 1995), least absolute selection and shrinkage operator (LASSO; Tibshirani, 1996), smoothly clipped absolute deviation (SCAD; Fan and Li, 2001), least angle regression (Efron et al., 2004), adaptive lasso (Zou, 2006; Zhang and Lu, 2006). Information criteria commonly used for model comparison include Mallows *C _{p}* (Mallows, 1973), Akaike’s Information Criteria (Akaike, 1973) and Bayesian Information Criteria (BIC; Schwarz, 1978). A thorough review on variable selection for linear models is given in Linhart and Zucchini (1986) and Miller (2002).

Though there is a vast amount of work on variable selection for linear models, limited work has been done on model selection for partially linear models as noted in Fan and Li (2004). Model selection for partially linear models is challenging, since it consists of several interrelated estimation and selection problems: nonparametric estimation, smoothing parameter selection, and variable selection and estimation for linear covariates. Fan and Li (2004) has done some pioneering work in this area. In the framework of kernel smoothing, Fan and Li (2004) proposed an effective kernel estimator for nonparametric function estimation while using the SCAD penalty for variable selection; they were among the first to extend the shrinkage selection idea to partially linear models. Bunea (2004) proposed a class of sieve estimators based on penalized least squares for semiparametric model selection, and established the consistency property of their estimator. Bunea and Wegkamp (2004) suggested another two-stage estimation procedure and proved that the estimator is minimax adaptive under some regularity conditions. Recently, variable selection for high dimensional data, either *d* diverges with *n* or *d* > *n*, has been actively studied. Fan and Peng (2004) established asymptotic properties of the nonconcave penalized likelihood estimators for linear model variable selection when *d* increases with the sample size. Xie and Huang (2007) studied the SCAD-penalized regression for partially linear models for high dimensional data, where polynomial regression splines are employed for model estimation.

In this work, we propose a new regularization approach for model selection in the context of partially smoothing spline models and study its theoretical and computational properties. As we show in the paper, the elegant smoothing spline theory and formulation can be used to develop a simple yet effective procedure for joint function estimation and variable selection. Inspired by Fan and Li (2004), we adopt the SCAD penalty for model parsimony due to its nice theoretical properties. We will show that the new estimator has the oracle property if both smoothing and regularization parameters are chosen properly as *n* → ∞, when the dimension *d* is fixed. In the more challenging case when *d _{n}* → ∞ as

The rest of the article is organized as follows. In Section 2 we propose the double penalized least squares method for joint variable selection and model estimation, and establish the asymptotic properties of the resulting estimator **. We further study the large-sample properties of the estimator, such as the estimation consistency and variable selection consistency, in situations when the input dimension increases with the sample size ***n*. In Section 3 we suggest a linear mixed model (LMM) representation for the proposed procedure, which leads to an iterative algorithm with easy implementation. We also discuss how to select the tuning parameters. In Section 4, we derive the covariance estimates for ** and , from both Frequentist and Bayesian perspectives. Sections 5 and 6 present simulation results and a real data application. Section 7 concludes the article with a discussion.**

Suppose that the sample consists of *n* observations. For the *i*th observation, denote by *y _{i}* the response, by

$${y}_{i}={\mathit{x}}_{i}^{T}\mathit{\beta}+f({t}_{i})+{\epsilon}_{i},\phantom{\rule{0.38889em}{0ex}}i=1,\dots ,n,$$

(2.1)

where ** β** is a

To simultaneously achieve the estimation of the nonparametric function *f*(*t*) and the selection of important variables, we propose a double-penalized least squares (DPLS) approach by minimizing

$${L}_{dp}(\mathit{\beta},f(\xb7);\mathbf{Y})=\frac{1}{2}\sum _{i=1}^{n}{\{{y}_{i}-{\mathit{x}}_{i}^{T}\mathit{\beta}-f({t}_{i})\}}^{2}+\frac{n{\lambda}_{1}}{2}{\int}_{0}^{1}{\{{f}^{\u2033}(t)\}}^{2}dt+n\sum _{j=1}^{d}{p}_{{\lambda}_{2}}(\mid {\beta}_{j}\mid ).$$

(2.2)

The first penalty term in (2.2) penalizes the roughness of the nonparametric fit *f*(*t*) and the second penalty *p*_{λ2} (|*β _{j}*|) is the shrinkage penalty on

$${p}_{{\lambda}_{2}}^{\prime}(\omega )={\lambda}_{2}\left\{I(\omega \le {\lambda}_{2})+\frac{{(a{\lambda}_{2}-\omega )}_{+}}{(a-1){\lambda}_{2}}I(\omega >{\lambda}_{2})\right\}\phantom{\rule{0.38889em}{0ex}}\text{for}\phantom{\rule{0.16667em}{0ex}}\omega >0,$$

(2.3)

where *a* > 2 is also a tuning parameter. Fan and Li (2001) showed that the SCAD penalty function results in consistent, sparse and continuous estimators in linear models.

First we lay out regularity conditions on *x** _{i}*,

Define **X** = (*x*_{1}, …, *x** _{n}*)

$${L}_{dp}(\mathit{\beta},\mathbf{f};\mathbf{Y})=\frac{1}{2}{(\mathbf{Y}-\mathbf{X}\mathit{\beta}-\mathbf{f})}^{T}(\mathbf{Y}-\mathbf{X}\mathit{\beta}-\mathbf{f})+\frac{n{\lambda}_{1}}{2}{\mathbf{f}}^{T}\mathbf{Kf}+n\sum _{j=1}^{d}{p}_{{\lambda}_{2}}(\mid {\beta}_{j}\mid ),$$

(2.4)

where **K** is the nonnegative definite smoothing matrix defined by Green and Silverman (1994). Given *λ*_{1}, *λ*_{2}, and ** β**, the DPLS minimizer of (2.4) is given by (

$$Q(\mathit{\beta})=\frac{1}{2}{(\mathbf{Y}-\mathbf{X}\mathit{\beta})}^{T}\{\mathbf{I}-\mathbf{A}({\lambda}_{1})\}(\mathbf{Y}-\mathbf{X}\mathit{\beta})+n\sum _{j=1}^{d}{p}_{{\lambda}_{2}}(\mid {\beta}_{j}\mid ).$$

We call the quadratic term in *Q*(** β**) as the profile least squares and denote it by

In the following, we establish the asymptotic theory for our estimator in terms of both estimation and variable selection. Proofs of these results involve the second-order Taylor expansion of *p*_{λ2} (|*β*|), and we will adapt the derivations of Fan and Li (2001) to our partially linear model context. Compared to the linear models studied in Fan and Li (2001), the major difficulty here is due to the appearance of the nonparametric component *f* in (2.1), which can affect the linear estimate ** β** through the smoother matrix

Let L′(**β**_{0}) and L″(**β**_{0}) be the gradient vector and Hessian matrix of L respectively, evaluated at **β**_{0}. Assume that **X**_{i} are independent and identically distributed with finite fourth moments. If λ_{1n} → 0 and
$n{\lambda}_{1n}^{1/4}\to \infty $ as n → ∞, then

- ${n}^{-1/2}{L}^{\prime}({\mathit{\beta}}_{0})\stackrel{d}{\to}N(\mathbf{0},{\sigma}^{2}\mathbf{R})$,
- ${n}^{-1}{L}^{\u2033}({\mathit{\beta}}_{0})\stackrel{p}{\to}\mathbf{R}$.

From Lemma 1, we have *n*^{−1/2}*L*′(*β*_{0}) = *O _{p}*(1),

As n → ∞, if *λ*_{1n} →0,
$n{\lambda}_{1n}^{1/4}\to \infty $ and λ_{2n} → 0, then there exists a local minimizer of Q(**β**) such that ||− **β**_{0}|| = O_{p}(n^{−1/2}).

Theorem 1 says that if we choose proper sequences of *λ*_{1}* _{n}* and

As n → ∞, if λ_{1n} → 0,
$n{\lambda}_{1n}^{1/4}\to \infty $, λ_{2n} → 0, and n^{1/2}λ_{2n} → ∞, then with probability tending to 1, for any **β _{1}** which satisfies ||

$$Q\{({\mathit{\beta}}_{1},\mathbf{0})\}=\underset{\parallel {\mathit{\beta}}_{2}\parallel \le C{n}^{-1/2}}{min}Q\{({\mathit{\beta}}_{1},{\mathit{\beta}}_{2})\}.$$

As n → ∞, if λ_{1n} →0,
$n{\lambda}_{1n}^{1/4}\to \infty $, λ_{2n} → 0, and n^{1/2}→λ_{2n} → ∞, then with probability tending to 1, the local minimizer
$\widehat{\mathit{\beta}}={({\widehat{\mathit{\beta}}}_{1}^{T},{\widehat{\mathit{\beta}}}_{2}^{T})}^{T}$ in Theorem 1 must satisfy:

- Sparsity:
_{2}=**0**. - Asymptotic normality: ${n}^{1/2}({\widehat{\mathit{\beta}}}_{1}-{\mathit{\beta}}_{10})\stackrel{d}{\to}N\{\mathbf{0},{\sigma}^{2}{\mathbf{R}}_{11}^{-1}\}$, where
**R**_{11}is the q × q upper-left sub-matrix of**R**.

In this section, we study the sampling properties of the DPLSEs in the situation where the number of linear predictors tends to ∞ as the sample size *n* goes to ∞. Similar to Fan and Peng (2004), we show that under certain regularity conditions, the DPLSEs are
$\sqrt{n/{d}_{n}}$-consistent and also consistent in selecting important variables, where *d _{n}* is the dimension of

- (C1) The elements {
*β*_{n}_{10,}}’s of_{j}*β*_{n}_{10}satisfy$$min\{\mid {\beta}_{n10,j}\mid ,1\le j\le {q}_{n}\}/{\lambda}_{2n}\to \infty .$$ - (C2) There exist constants
*c*_{1}and*c*_{2}such that

$$0<{c}_{1}<{\mathrm{\Lambda}}_{min}(\mathbf{R})\le {\mathrm{\Lambda}}_{max}(\mathbf{R})<{c}_{2}<\infty ,\phantom{\rule{0.38889em}{0ex}}\forall n.$$

Both conditions above are adopted from Fan and Peng (2004), which is the first work to study the large-sample properties of the non-concave penalized estimators for linear models when the dimension of data diverges with the sample size *n*. As pointed out by Fan and Peng (2004), the condition (C1) gives the rate at which the penalized estimator can distinguish nonvanishing parameters from 0. Condition (C2) assumes that the **R** is positive definite and its eigenvalues are uniformly bounded.

Under the conditions (C1) and (C2), as n → ∞, if λ_{1n} →0,
$n{\lambda}_{1n}^{1/4}\to \infty $, λ_{2n} →0, and
${d}_{n}=o({n}^{1/2}\wedge n{\lambda}_{1n}^{1/4})$, then there exists a local minimizer _{n} of Q(**β**_{n}) such that
$\parallel {\widehat{\mathit{\beta}}}_{n}-{\mathit{\beta}}_{n0}\parallel \phantom{\rule{0.16667em}{0ex}}={O}_{p}(\sqrt{{d}_{n}/n})$.

Theorem 3 says that if we choose proper sequences of *λ*_{1}* _{n}* and

Under the regularity conditions (C1) and (C2), as n → ∞, if λ_{1n} →0,
$n{\lambda}_{1n}^{1/4}\to \infty $, λ_{2n} →0,
$\sqrt{n/{d}_{n}}{\lambda}_{2n}\to \infty $, and
${d}_{n}=o({n}^{1/2}\wedge n{\lambda}_{1n}^{1/4})$, then with probability tending to 1, the local minimizer
${\widehat{\mathit{\beta}}}_{n}={({\widehat{\mathit{\beta}}}_{n1}^{T},{\widehat{\mathit{\beta}}}_{n2}^{T})}^{T}$ in Theorem 3 must satisfy _{n2} = **0**.

We reformulate the DPLS into a linear mixed model (LMM) representation for ease of computation. The LMM allows us to treat the smoothing parameter as a variance component and provides a unified estimation and inferential framework. An iterative algorithm is then outlined.

Let ** t** = (

$$\mathbf{Y}=\mathbf{X}\mathit{\beta}+\mathbf{f}+\mathit{\epsilon}.$$

(3.1)

If *ε _{i}*’s were normally distributed, then minimizing (2.4) with respect to (

$${\ell}_{dp}(\mathit{\beta},\mathbf{f};\mathbf{Y})=\ell (\mathit{\beta},\mathbf{f};\mathbf{Y})-\frac{n{\lambda}_{1}}{2{\sigma}^{2}}{\mathbf{f}}^{T}\mathbf{Kf}-\frac{n}{{\sigma}^{2}}\sum _{j=1}^{d}{p}_{{\lambda}_{2}}(\mid {\beta}_{j}\mid ),$$

(3.2)

where (** β**,

$$\begin{array}{l}{\ell}_{dp}(\mathit{\beta},\mathit{\delta},\mathbf{a};\mathbf{Y})=-\frac{n}{2}log{\sigma}^{2}-\frac{1}{2{\sigma}^{2}}{(\mathbf{Y}-{\mathbf{X}}_{\ast}{\mathit{\beta}}_{\ast}-\mathbf{Ba})}^{T}(\mathbf{Y}-{\mathbf{X}}_{\ast}{\mathit{\beta}}_{\ast}-\mathbf{Ba})\\ -\frac{n{\lambda}_{1}}{2{\sigma}^{2}}{\mathbf{a}}^{T}\mathbf{a}-\frac{n}{{\sigma}^{2}}\sum _{j=1}^{d}{p}_{{\lambda}_{2}}(\mid {\beta}_{j}\mid ),\end{array}$$

(3.3)

where **X**_{*} = [**T**, **X**], *β*_{*} = (*δ** ^{T}*,

For fixed *β*_{*} (and given *λ*_{1},*λ*_{2},*σ*^{2}), (3.3) can be treated as the joint log-likelihood for the following linear mixed model (LMM) subject to the SCAD penalty on *β*

$$\mathbf{Y}={\mathbf{X}}_{\ast}{\mathit{\beta}}_{\ast}+\mathbf{Ba}+\mathit{\epsilon},$$

(3.4)

where *β*_{*} represent fixed effects, and **a** are random effects with **a** ~ *N*(0, *τ*** I**),

$${\ell}_{dp}({\mathit{\beta}}_{\ast};\mathbf{Y})=-\frac{1}{2}{(\mathbf{Y}-{\mathbf{X}}_{\ast}{\mathit{\beta}}_{\ast})}^{T}{\mathbf{V}}^{-1}(\mathbf{Y}-{\mathbf{X}}_{\ast}{\mathit{\beta}}_{\ast})-\frac{n}{{\sigma}^{2}}\sum _{j=1}^{d}{p}_{{\lambda}_{2}}(\mid {\beta}_{j}\mid ),$$

(3.5)

where **V** = *σ*^{2}**I*** _{n}* +

The SCAD penalty function defined by (2.3) is not differentiable at the origin, causing difficulty in maximizing (3.5) with gradient based methods such as the Newton-Raphson. Following Fan and Li (2001, 2004), we use a local quadratic approximation (LQA) approach. Assuming ${\widehat{\mathit{\beta}}}_{\ast}^{0}$ is an initial value close to the maximizer of (3.5), we have the following local approximation:

$$[{p}_{{\lambda}_{2}}(\mid {\beta}_{\ast j}\mid ){]}^{\prime}={p}_{{\lambda}_{2}}^{\prime}(\mid {\beta}_{\ast j}\mid )\text{sign}({\beta}_{\ast j})\approx \frac{{p}_{{\lambda}_{2}}^{\prime}(\mid {\widehat{\beta}}_{\ast j}^{0}\mid )}{\mid {\widehat{\beta}}_{\ast j}^{0}\mid}{\beta}_{\ast j},\phantom{\rule{0.38889em}{0ex}}\text{for}\mid {\beta}_{\ast j}^{0}\mid \phantom{\rule{0.16667em}{0ex}}\ge \xi ,\phantom{\rule{0.38889em}{0ex}}j\ge 3,$$

where *ξ* is a pre-specified threshold.

Using Taylor expansions, we can approximate (3.5) by

$$\begin{array}{l}{\ell}_{dp}({\mathit{\beta}}_{\ast}\mid {\widehat{\mathit{\beta}}}_{\ast}^{0})\approx -\frac{1}{2}{(\mathbf{Y}-{\mathbf{X}}_{\ast}{\mathit{\beta}}_{\ast})}^{T}{\mathbf{V}}^{-1}(\mathbf{Y}-{\mathbf{X}}_{\ast}{\mathit{\beta}}_{\ast})-\frac{n}{2{\sigma}^{2}}{\mathit{\beta}}_{\ast}^{T}{\mathbf{\sum}}_{{\lambda}_{2}}({\widehat{\mathit{\beta}}}_{\ast}^{0}){\mathit{\beta}}_{\ast}\\ -\frac{n}{{\sigma}^{2}}\sum _{j=3}^{d+2}\left\{{p}_{{\lambda}_{2}}(\mid {\widehat{\beta}}_{\ast j}^{0}\mid )-\frac{1}{2}\frac{{p}_{{\lambda}_{2}}^{\prime}(\mid {\widehat{\beta}}_{\ast j}^{0}\mid )}{\mid {\widehat{\beta}}_{\ast j}^{0}\mid}{({\widehat{\beta}}_{\ast j}^{0})}^{2}\right\},\end{array}$$

(3.6)

where
${\mathbf{\sum}}_{{\lambda}_{2}}({\mathit{\beta}}_{\ast})=\text{diag}\{0,0,{p}_{{\lambda}_{2}}^{\prime}(\mid {\beta}_{1}\mid )/\mid {\beta}_{1}\mid ,\dots ,{p}_{{\lambda}_{2}}^{\prime}(\mid {\beta}_{d}\mid )/\mid {\beta}_{d}\mid \}$. For fixed ** θ** = (

$${\widehat{\mathit{\beta}}}_{\ast}={\left\{{\mathbf{X}}_{\ast}^{T}{\mathbf{V}}^{-1}(\mathit{\theta}){\mathbf{X}}_{\ast}+n{\mathbf{\sum}}_{{\lambda}_{2}}({\widehat{\mathit{\beta}}}_{\ast}^{0})/{\sigma}^{2}\right\}}^{-1}{\mathbf{X}}_{\ast}^{T}{\mathbf{V}}^{-1}(\mathit{\theta})\mathbf{Y}.$$

(3.7)

It is easy to recognize that (3.7) is equivalent to an iterative ridge regression algorithm.

We propose to alternately estimate (** β**,

Although the smoothing parameter *λ*_{1} (or equivalently *τ*) is readily estimated in the LMM framework, we still need to estimate the SCAD tuning parameters (*λ*_{2}, *a*). To find their optimal values, one common approach could be a two-dimensional grid search using some data-driven criteria, such as CV and GCV (Craven and Wahba, 1979), which can be rather computationally prohibitive. Fan and Li (2001) showed numerically that *a* = 3.7 minimizes the Bayesian risk and recommended its use in practice. Thus we set *a* = 3.7 and only tune *λ*_{2} in our implementation.

Many selection criteria, such as cross validation (CV), generalized cross validation (GCV), BIC and AIC selection can be used for parameter tuning. Wang et al. (2007) suggested using the BIC for the SCAD estimator in linear models and partially linear models, and proved its model selection consistency property, i.e. the optimal parameter chosen by the BIC can identify the true model with probability tending to one. We will also use the BIC to select the optimal *λ*_{2} from a gridded range under working normal distributional assumption for *ε _{i}*.

Given *λ*_{2}, suppose *q* variables are selected by the algorithm in Section 3. Let **X**_{1} be the sub-matrix of **X** for the *q* important variables and *β*_{1} be the corresponding *q* × 1 regression coefficient vector. Then we may use the estimation method of Zhang et al. (1998) to solve the partially linear model (2.1). Consequently **Ŷ** = **SY**, where **S** is a smoother matrix with *q*_{1} = *trace*(**S**). The BIC criterion is then computed as BIC(*λ*_{2}) = −2+ *q*_{1} log *n*, where = −(*n*/2) log(2*π*^{2}) − (**Y**−**X**_{1}_{1} − )* ^{T}* (

We derive the frequentist and Bayesian covariance formulas for ** and parallel to Sections 3.4 and 3.5 in Zhang et al. (1998), except that we also take into account the bias introduced by the imposed penalty for the variable selection. Using these covariance estimates, we are able to construct confidence intervals for the regression coefficients and the nonparametric function. The proposed covariance estimates are evaluated via simulation in Section 5.**

From frequentists’ point of view, cov(**Y**|*t*, **x**) = *σ*^{2}**I**, and we can write _{*} = (* ^{T}*,

$${\widehat{cov}}_{F}(\widehat{\mathit{\beta}}\mid t,\mathbf{x})={\mathbf{Q}}_{2}cov(\mathbf{Y}){\mathbf{Q}}_{2}^{T}={\widehat{\sigma}}^{2}{\mathbf{Q}}_{2}{\mathbf{Q}}_{2}^{T},$$

(4.1)

where ^{2} is the estimated error variance. It is easy to show that the empirical BLUP estimate of **a** is **â** = **Ã**(**Y** − **X**_{*}*β*_{*}) = **S**_{a}**Y**, where **S*** _{a}*=

$${\widehat{cov}}_{F}(\widehat{\mathbf{f}}\mid t,\mathbf{x})={\widehat{\sigma}}^{2}(\mathbf{T}{\mathbf{Q}}_{1}+\mathbf{B}{\mathbf{S}}_{a}){(\mathbf{T}{\mathbf{Q}}_{1}+\mathbf{B}{\mathbf{S}}_{a})}^{T}.$$

(4.2)

The LMM representation in Section 3.1 and (3.3) suggests a prior for *f*(*t*) of the form **f** = **T δ** +

$${cov}_{B}(\widehat{\mathit{\beta}})=cov{\{{\widehat{\mathit{\beta}}}_{1}^{T},{({\widehat{\mathit{\beta}}}_{2}-{\mathit{\beta}}_{2})}^{T}\}}^{T},$$

(4.3)

$${cov}_{B}(\widehat{\mathbf{f}})=[\mathbf{T},\mathbf{B}]cov{\{{\widehat{\mathit{\delta}}}^{T},{(\widehat{\mathbf{a}}-\mathbf{a})}^{T}\}}^{T}{[\mathbf{T},\mathbf{B}]}^{T}.$$

(4.4)

These Bayesian variance estimates can be viewed to account for the bias in ** and due to imposed penalties (Wahba, 1983).**

We conduct Monte Carlo simulation studies to evaluate the finite sampling performance of the proposed DPLS method in terms of both model estimation and variable selection. Furthermore, we compare our procedure with the SCAD and LASSO methods proposed by Fan and Li (2004). In the following, these three methods are respectively referred to as “DPLSE”, “SCAD” and “LASSO”. When implementing Fan and Li (2004), we adopt their approach to choose the kernel bandwidth: first compute the difference based estimator (DBE) for ** β** and then select the bandwidth using the plug-in method of Ruppert et al. (1995). To select the SCAD and LASSO tuning parameters, we tried both BIC and GCV and found that BIC generally gave better performance, so BIC was used for tuning in the SCAD and LASSO.

We simulate the data from a partially linear model *y* = **x**^{T}** β** +

As in Fan and Li (2004), we use the mean squares error (MSE) for ** and to respectively evaluate goodness-of-fit for parametric and nonparametric estimation. They are defined as ***MSE*() = *E*(||**− **** β**||

Table 5.1 compares three variable selection procedures when *σ*^{2} = 1. The DPLSE outperforms other methods in terms of both estimation and variable selection in all scenarios, and SCAD performs better than LASSO. Overall, the DPLSE achieves a sparser model, with both “Corr.” and “Inc.” closer to the oracle (5 & 0 respectively). In our implementation for the SCAD and LASSO, the bandwidth selected using the plug-in method occasionally caused numerical problems and failed to converge. Therefore, the results of SCAD and LASSO are only based on converged cases.

Table 5.2 presents the results for a high variance case *σ*^{2} = 9. We notice that, as *σ*^{2} increases from 1 to 9, although there is a substantial amount of increase in the MSEs, the DPLSE still maintains very good performance in model selection. The MSEs of the DPLSE are consistently smaller than those of SCAD and LASSO (not reported here to save space). The incidence of incorrect zero coefficients occurs seldom for *n* = 100 and never occurs for *n* = 200.

Table 5.3 presents the point estimate, relative bias, empirical standard error, model-based frequentist and Bayesian standard errors of the estimate. To save space, we only report the point estimation results for the parameters which are truly nonzero. The point estimate is the MC sample average and the empirical standard error is computed by the MC standard deviation. Relative bias is the ratio of the bias and the true value.

We report the results in four scenarios with varying *n*, *σ*^{2} and *f*(*t*), and those in other scenarios are similar and hence omitted. We observe that ** is roughly unbiased in all scenarios. Both Bayesian and frequentist SEs of *** _{j}*’s obtained from (4.1) and (4.3) agree well with the empirical SEs; all SEs decrease as

In Figure 5.1 we plot the pointwise estimates and biases for estimating *f*_{1}(*t*) and *f*_{2}(*t*) when *n* = 200 and *σ*^{2} = 1 for all three methods.

Plots of (*t*) and pointwise biases (SCAD and LASSO are based on converged MC samples). Plots (a) and (b) are for *f*_{1}; plots (c) and (d) are for *f*_{2}. Here *n* = 200 and *σ*^{2} = 1. The horizontal axis is *t* in all the four plots.

In plots (a) and (c), the averaged fitted curves are almost indistinguishable from the true nonparametric function, indicating small biases in (*t*) for all three methods. Pointwise biases are magnified in plots (b) and (d), which show that the DPLSE overall has smaller bias than the other two methods. The SCAD and LASSO fits have slightly larger and rougher pointwise biases, which indicates under-smoothing due to a small bandwidth selected by the plug-in method. Our method is more advantageous in that it automatically estimates the smoothing parameter and controls the amount of smoothing more appropriately by treating *τ* = 1/(*nλ*_{1}) as a variance component.

Figure 5.2 depicts the pointwise standard errors and pointwise coverage probabilities of confidence intervals given by the covariance formulas (4.2) and (4.4). Here *n* = 200 and *σ*^{2} = 1; (a) and (b) are for *f*_{1} with *t*_{6} errors, and (c) and (d) are for *f*_{2} with mixture normal errors.

Plots of pointwise frequentist and Bayesian standard errors and coverage probability rates. Plots (a) and (b) are for *f*_{1}(*t*); plots (c) and (d) are for *f*_{2}(*t*). The horizontal axis is *t* in all plots.

We note that the frequentist pointwise SEs interlace with the empirical SEs, whereas the Bayesian pointwise SEs are a little larger than the frequentist counterparts. Accordingly, as shown in plots (b) and (d), the pointwise coverage probability rates for frequentist confidence intervals are around the nominal level, whereas most of the Bayesian coverage probabilities are higher than 95%.

We apply the proposed DPLS method to the *Ragweed Pollen Level* data, which was analyzed in Ruppert et al. (2003). The data was collected in Kalamazoo, Michigan during the 1993 ragweed season, and it consists of 87 daily observations of ragweed pollen level and relevant information. The main interest is to develop accurate models to forecast daily ragweed pollen level. The raw response *ragweed* is the daily ragweed pollen level (grains/*m*^{3}). Among the explanatory variables, *x*_{1} is an indicator of significant rain, where *x*_{1} = 1 if there is at least 3 hour steady or brief but intense rain and *x*_{1} = 0 otherwise; *x*_{2} is temperature (°*F*); *x*_{3} is wind speed (knots). The *x*-covariates are standardized first. Since the raw response is rather skewed, Ruppert et al. (2003) suggested a square root transformation
$y=\sqrt{\mathit{ragweed}}$. Marginal plots suggest a strong nonlinear relationship between *y* and the day number in the current ragweed pollen season. Consequently, a semiparametric regression model with a nonparametric baseline *f*(*day*) is reasonable. Ruppert et al. (2003) fitted a semiparametric model with *x*_{1}, *x*_{2} and *x*_{3}, whereas we add quadratic and interaction terms and consider a more complex model:

$$y=f(\mathit{day})+{\beta}_{1}{x}_{1}+{\beta}_{2}{x}_{2}+{\beta}_{3}{x}_{3}+{\beta}_{22}{x}_{2}^{2}+{\beta}_{33}{x}_{3}^{2}+{\beta}_{12}{x}_{1}{x}_{2}+{\beta}_{13}{x}_{1}{x}_{3}+{\beta}_{23}{x}_{2}{x}_{3}+\epsilon .$$

The tuning parameter selected by BIC is *λ*_{2} = 0.177. Table 6.1 gives the DPLSE for the regression coefficients and their corresponding frequentist and Bayesian standard errors.

For comparison, we also fitted the full model via traditional partially splines with only roughness penalty on *f*. Table 6.1 shows that the final fitted model is *ŷ* = (*day*) + _{1}*x*_{1} + _{2}*x*_{2} + _{3}*x*_{3}, indicating that the linear main effect model suffices. All the estimated coefficients are positive, suggesting that the ragweed pollen level increases as each of the covariates increases. The shrinkage estimates have relatively smaller standard errors than those under the full model. Figure 6.1 depicts the estimated nonparametric function (*day*) and its frequentist and Bayesian 95% pointwise confidence intervals. The plot indicates that the baseline *f*(*day*) climbs rapidly to the peak on around day 25 and plunges until day 60, and decreases steadily thereafter.

We propose a new regularization method for simultaneous variable selection and model estimation in partially linear models via double-penalized least squares. Under certain regularity conditions, the DPLSE ** is root-***n* consistent and has the oracle property. To facilitate computation, we reformulate the problem into a linear mixed model (LMM) framework, which allows us to estimate the smoothing parameter *λ*_{1} as an additional variance component instead of conducting the conventional two-dimensional grid search together with the other tuning parameter *λ*_{2}. Another advantage of the LMM representation is that standard software can be used to implement the DPLS. Simulation studies show that the new method works effectively in terms of both variable selection and model estimation. We have derived both frequentist and Bayesian covariance formulas for the DPLSEs and empirical results favor the frequentist SE formulas for *f*(*t*). Furthermore, our empirical results suggest that the DPLSE is robust to the distributional assumption of errors, giving strong support for its application in general situations.

In this paper, we have studied the large sample properties of the new estimators when the dimension *d* satisfies: (i) *d* fixed, or (ii) *d _{n}* → ∞ as

The proposed DPLS method assumes that the errors are uncorrelated. In future research, we will generalize it to model selection for correlated data such as longitudinal data. Another interesting problem is model selection for generalized semiparametric models, e.g. *E*(*Y*) = *g*{*Xβ* + *f*(*t*)}, where *g* is a link function. In that case we will consider the double-penalized likelihood and investigate asymptotic properties for the resulting estimators.

The authors thank the editor, the associate editor, and two reviewers for their constructive comments and suggestions. The research of Hao Zhang was supported by in part by National Science Foundation DMS-0645293 and by National Institute of Health R01 CA085848-08. The research of Daowen Zhang was supported by National Institute of Health R01 CA85848-08.

Differentiating *L*(** β**) in

$$-{L}^{\prime}({\mathit{\beta}}_{0})={\mathbf{X}}^{T}\{\mathbf{I}-\mathbf{A}({\lambda}_{1})\}(\mathbf{Y}-\mathbf{X}{\mathit{\beta}}_{0}),$$

(A1)

$${L}^{\u2033}({\mathit{\beta}}_{0})={\mathbf{X}}^{T}\{\mathbf{I}-\mathbf{A}({\lambda}_{1})\}\mathbf{X}.$$

(A2)

For the partially linear model, we have **Y** − **X β**

$$\begin{array}{l}-{n}^{-1/2}{L}^{\prime}({\mathit{\beta}}_{0})={n}^{-1/2}{\mathbf{X}}^{T}\{\mathbf{I}-\mathbf{A}({\lambda}_{1})\}(\mathbf{f}+\mathit{\epsilon})\\ ={n}^{-1/2}{\mathbf{X}}^{T}[\{\mathbf{I}-\mathbf{A}({\lambda}_{1})\}\mathbf{f}+\mathit{\epsilon}]-{n}^{-1/2}{\mathbf{X}}^{T}\mathbf{A}({\lambda}_{1})\mathit{\epsilon}.\end{array}$$

(A3)

Now, the proof of Theorem 1 in Heckman (1986) and its four propositions can be used. Under regularity conditions, we have that if *λ*_{1}* _{n}* → 0 and
$n{\lambda}_{1n}^{1/4}\to \infty $, then

$${n}^{-1/2}{\mathbf{X}}^{T}[\{\mathbf{I}-\mathbf{A}({\lambda}_{1})\}\mathbf{f}+\mathit{\epsilon}]\stackrel{d}{\to}N(\mathbf{0},{\sigma}^{2}\mathbf{R}),$$

(A4)

$${n}^{-1/2}{\mathbf{X}}^{T}\mathbf{A}({\lambda}_{1})\mathit{\epsilon}\stackrel{p}{\to}\mathbf{0}.$$

(A5)

Parts (a) and (b) are obtained by applying Slutsky’s theorem to (A1) and (A2).

To prove Theorems 3 and 4, we need the following lemma. Its proof can be derived in the similar fashion as Lemma 1 above and Theorem 1 of Heckman (1986). To save space, we only state the results below and omit the proof. For any vector **v**, we use[**v**]* _{i}* to denote its

Under the regularity conditions (C1) and (C2), if *λ*_{1}* _{n}* → 0, then

- [
*L*′(*β*_{n}_{0})]=_{i}*O*(_{p}*n*^{1/2}), - ${[{L}^{\u2033}({\mathit{\beta}}_{n0})]}_{ij}=n{R}_{ij}+{O}_{p}({n}^{1/2}\vee {\lambda}_{1n}^{-1/4})$.

Let
${c}_{n}=\sqrt{{d}_{n}/n}$. We need to show that for any given *ε* > 0, there exists a large constant *C* such that

$$P\{\underset{\parallel \mathit{r}\parallel \ge C}{inf}Q({\mathit{\beta}}_{n0}+{c}_{n}\mathit{r})>Q({\mathit{\beta}}_{n0})\}\ge 1-\epsilon .$$

(A6)

Let Δ* _{n}*(

$$\begin{array}{l}{\mathrm{\Delta}}_{n}(\mathit{r})\ge L({\mathit{\beta}}_{n0}+{c}_{n}\mathit{r})-L({\mathit{\beta}}_{n0})+n\sum _{j=1}^{{q}_{n}}\{{p}_{{\lambda}_{2n}}(\mid {\beta}_{n10,j}+{c}_{n}{r}_{j}\mid )-{p}_{{\lambda}_{2n}}(\mid {\beta}_{n10,j}\mid )\}\\ \ge {c}_{n}{\mathit{r}}^{T}{L}^{\prime}({\mathit{\beta}}_{n0})+\frac{1}{2}{c}_{n}^{2}{\mathit{r}}^{T}{L}^{\u2033}({\mathit{\beta}}_{n0})\mathit{r}+\sum _{j=1}^{{q}_{n}}[n{c}_{n}{p}_{{\lambda}_{2n}}^{\prime}(\mid {\beta}_{n10,j}\mid )\text{sign}({\beta}_{n10,j}){r}_{j}]\\ +\sum _{j=1}^{{q}_{n}}[n{c}_{n}^{2}{{p}^{\u2033}}_{{\lambda}_{2n}}(\mid {\beta}_{n10,j}\mid ){r}_{j}^{2}\{1+o(1)\}]\\ \equiv {I}_{1}+{I}_{2}+{I}_{3}+{I}_{4}.\end{array}$$

By Lemma 3(a), we have

$$\mid {I}_{1}\mid \phantom{\rule{0.16667em}{0ex}}=\phantom{\rule{0.16667em}{0ex}}\mid {c}_{n}{\mathit{r}}^{T}{L}^{\prime}({\mathit{\beta}}_{n0})\mid \phantom{\rule{0.16667em}{0ex}}\le {c}_{n}\parallel {L}^{\prime}({\mathit{\beta}}_{n0})\parallel \phantom{\rule{0.16667em}{0ex}}\parallel \mathit{r}\parallel \phantom{\rule{0.16667em}{0ex}}={O}_{p}({c}_{n}\sqrt{n{d}_{n}})\parallel \mathit{r}\parallel ={O}_{p}(n{c}_{n}^{2})\parallel \mathit{r}\parallel .$$

By Lemma 3(b), under the regularity condition (C2), we have

$$\begin{array}{l}{I}_{2}=\frac{1}{2}{c}_{n}^{2}{\mathit{r}}^{T}{L}^{\u2033}({\mathit{\beta}}_{n0})\mathit{r}=\frac{1}{2}n{c}_{n}^{2}\{{\mathit{r}}^{T}\mathbf{R}\mathit{r}+{O}_{p}({d}_{n}{n}^{-1/2}\vee {d}_{n}{\lambda}_{1n}^{-1/4})\}\\ =\frac{1}{2}n{c}_{n}^{2}\{{\mathit{r}}^{T}\mathbf{R}\mathit{r}+{o}_{p}(1)\parallel \mathit{r}\mid {\mid}^{2}\},\end{array}$$

the last equation above is due to the dimension condition
${d}_{n}=o({n}^{1/2}\wedge n{\lambda}_{1n}^{1/4})$. With regard to *I*_{3} and *I*_{4}, we have

$$\mid {I}_{3}\mid \phantom{\rule{0.16667em}{0ex}}\le \sum _{j=1}^{{q}_{n}}\mid n{c}_{n}{p}_{{\lambda}_{2n}}^{\prime}(\mid {\beta}_{n10,j}\mid )\text{sign}({\beta}_{n10,j}){r}_{j}\mid \le n{c}_{n}^{2}\parallel \mathit{r}\parallel ,$$

and

$$\mid {I}_{4}\mid \phantom{\rule{0.16667em}{0ex}}=\sum _{j=1}^{{q}_{n}}n{c}_{n}^{2}{{p}^{\u2033}}_{{\lambda}_{2n}}(\mid {\beta}_{n10,j}\mid ){r}_{j}^{2}\{1+o(1)\}\le 2\xb7\underset{1\le j\le {q}_{n}}{max}{p}_{{\lambda}_{2n}}^{\u2033}(\mid {\beta}_{n10,j}\mid )\xb7n{c}_{n}^{2}\parallel \mathit{r}\mid {\mid}^{2}.$$

Under the condition (C1),
${max}_{1\le j\le {q}_{n}}{p}_{{\lambda}_{2n}}^{\prime}(\mid {\beta}_{n10,j}\mid )=0$ and
${max}_{1\le j\le {q}_{n}}{p}_{{\lambda}_{2n}}^{\u2033}(\mid {\beta}_{n10,j}\mid )=0$ when *n* is large enough and *λ*_{2}* _{n}* → 0. So, both

Let
${\gamma}_{n}=C\sqrt{{d}_{n}/n}$. It suffices to show that as *n* → ∞ with probability tending to 1, for any *β*_{n}_{1} satisfying
${\mathit{\beta}}_{n1}-{\mathit{\beta}}_{n10}=O(\sqrt{{d}_{n}/n})$ and *j* = *q _{n}* + 1, …,

$$\frac{\partial Q(\mathit{\beta})}{\partial {\beta}_{nj}}<0\phantom{\rule{0.38889em}{0ex}}\text{for}\phantom{\rule{0.16667em}{0ex}}{\beta}_{nj}\in (-{\gamma}_{n},0),$$

(A7)

$$>0\phantom{\rule{0.38889em}{0ex}}\text{for}\phantom{\rule{0.16667em}{0ex}}{\beta}_{nj}\in (0,{\gamma}_{n}).$$

(A8)

By Taylor expansion and the fact that *L*(*β** _{n}*) is quadratic in

$$\begin{array}{l}\frac{\partial Q({\mathit{\beta}}_{n})}{\partial {\beta}_{nj}}=\frac{\partial L({\mathit{\beta}}_{n})}{\partial {\beta}_{nj}}+n{p}_{{\lambda}_{2n}}^{\prime}(\mid {\beta}_{nj}\mid )\text{sign}({\beta}_{nj})\\ =\frac{\partial L({\mathit{\beta}}_{n0})}{\partial {\beta}_{nj}}+\sum _{k=1}^{d}\frac{{\partial}^{2}L({\mathit{\beta}}_{n0})}{\partial {\beta}_{nj}\partial {\beta}_{nk}}({\beta}_{nk}-{\beta}_{nk0})+n{p}_{{\lambda}_{2n}}^{\prime}(\mid {\beta}_{nj}\mid )\text{sign}({\beta}_{nj})\\ \equiv {J}_{1}+{J}_{2}+{J}_{3}.\end{array}$$

By Lemma 3 and the regularity conditions (C1) and (C2), we have

$${J}_{1}={O}_{p}({n}^{1/2})={O}_{p}(\sqrt{n{d}_{n}}),\phantom{\rule{0.38889em}{0ex}}{J}_{2}={O}_{p}(\sqrt{n{d}_{n}}),$$

so ${J}_{1}+{J}_{2}={O}_{p}(\sqrt{n{d}_{n}})$. Since $\sqrt{{d}_{n}/n}/{\lambda}_{2n}\to 0$, from

$$\frac{\partial Q({\mathit{\beta}}_{n})}{\partial {\beta}_{nj}}=n{\lambda}_{2n}\left\{-\frac{-{p}_{{\lambda}_{2n}}^{\prime}({\beta}_{nj})}{{\lambda}_{2n}}\text{sign}({\beta}_{nj})+{O}_{p}\left(\sqrt{{d}_{n}/n}/{\lambda}_{2n}\right)\right\},$$

we can see that the sign of
${\scriptstyle \frac{\partial Q({\mathit{\beta}}_{n})}{\partial {\beta}_{nj}}}$ is totally determined by the sign of *β _{nj}*. Therefore, (A7) and (A8) hold for

**Publisher's Disclaimer: **This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

- Akaike H. Maximum likelihood identification of Gaussian autoregressive moving average models. Biometrika. 1973;60:255–265.
- Breiman L. Better subset selection using the nonnegative garrote. Technometrics. 1995;37:373–384.
- Bunea F. Consistent covariate selection and post model selection inference in semiparametric regression. Annals of Statistics. 2004;32:898–927.
- Bunea F, Wegkamp M. Two-stage model detection procedures in partially linear regression. The Canadian Journal of Statistics. 2004;32:105–118.
- Craven P, Wahba G. Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized cross-validation. Numerishe Mathematik. 1979;31:377–403.
- Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. Annals of Statistics. 2004;32:407–451.
- Engle R, Granger C, Rice J, Weiss A. Nonparametric estimates of the relation between weather and electricity sales. Journal of American Statistical Association. 1986;81:310–386.
- Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of American Statistical Association. 2001;96:1348–1360.
- Fan J, Li R. New estimation and model selection procedures for semi-parametric modeling in longitudinal data analysis. Journal of American Statistical Association. 2004;99:710–723.
- Fan J, Peng H. Nonconcave penalized likelihood with a diverging number of parameters. Annals of Statistics. 2004;32:928–961.
- Green PJ. Penalized likelihood for general semi-parametric regression models. Int Statist Rev. 1987;55:245–260.
- Green PJ, Silverman BW. Nonparametric Regression and Generalized Linear Models. Chapman and Hall; London: 1994.
- Gu C. Smoothing Spline ANOVA Models. Springer; New York: 2002.
- Hastie T, Tibshirani RJ. Generalized Additive Models. Chapman and Hall; London: 1990.
- Heckman NE. Spline smoothing in a partly linear model. J R Statist Soc B. 1986;48:244–248.
- Liang H. Estimation in partially linear models and numerical comparisons. Computational Statistics and Data Analysis. 2006;50:675–687. [PMC free article] [PubMed]
- Linhart H, Zucchini W. Model Selection. New York: Wiley; 1986.
- Mallows C. Some comments on
*c*. Technometrics. 1973;15:661–675._{p} - Meier L, Van de Geer S, Bühlmann P. High-dimensional additive modeling. 2008. http://arxiv.org/abs/0806.4115v1.
- Miller AJ. Subset Selection in Regression. Chapman and Hall; London: 2002.
- Ravikumar P, Liu H, Lafferty J, Wasserman L. Spam: Sparse additive models. Advances in Neural Information Processing systems. 2008;20:1202–1208.
- Ruppert D, Sheather S, Wand M. An effective bandwidth selector for local least squares regression. Journal of American Statistical Association. 1995;90:1257–1270.
- Ruppert D, Wand MP, Carroll RJ. Semiparametric Regression. Cambridge University Press; Cambridge, New York: 2003.
- Schwarz G. Estimating the dimension of a model. Annals of Statistics. 1978;6:461–464.
- Speckman P. Kernel smoothing in partial linear models. J R Statist Soc B. 1988;50:413–436.
- Speed T. Discussion of ‘blup is a good thing: the estimation of random effects’ by G. K. Robinson. Statistical Science. 1991;6:15–51.
- Tibshirani R. Regression shrinkage and selection via the lasso. J R Statist Soc B. 1996;58:267–288.
- Wahba G. Bayesian ‘confidence intervals’ for the cross-validated smoothing spline. J R Statist Soc B. 1983;45:133–150.
- Wahba G. A comparison of GCV and GML for choosing the smoothing parameter in the generalized spline smoothing problem. Annals of Statistics. 1985;13:1378–1402.
- Wahba G. Spline Models for Observation Data. Society for Industrial and Applied Mathematics 1990
- Wang H, Li R, Tsai CL. Tuning parameter selector for SCAD. Biometrika. 2007;94:553–568. [PMC free article] [PubMed]
- Zhang D, Lin X, Raz J, Sowers M. Semiparametric stochastic mixed models for longitudinal data. Journal of American Statistical Association. 1998;93:710–719.
- Zhang H, Lu W. Adaptive lasso for cox’s proportional hazards model. Biometrika. 2007;94:691–703.
- Zou H. The adaptive lasso and its oracle properties. Journal of American Statistical Association. 2006;101:1418–1429.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |