Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2928655

Formats

Article sections

Authors

Related links

Ann Stat. Author manuscript; available in PMC 2011 January 1.

Published in final edited form as:

Ann Stat. 2010 August 1; 38(4): 2092–2117.

doi: 10.1214/09-AOS780PMCID: PMC2928655

NIHMSID: NIHMS175610

Pang Du, Department of Statistics, Virginia Tech, Blacksburg, VA 24061, USA, Email: ude.tv@udgnap;

See other articles in PMC that cite the published article.

We study the Cox models with semiparametric relative risk, which can be partially linear with one nonparametric component, or multiple additive or nonadditive nonparametric components. A penalized partial likelihood procedure is proposed to simultaneously estimate the parameters and select variables for both the parametric and the nonparametric parts. Two penalties are applied sequentially. The first penalty, governing the smoothness of the multivariate nonlinear covariate effect function, provides a smoothing spline ANOVA framework that is exploited to derive an empirical model selection tool for the nonparametric part. The second penalty, either the smoothly-clipped-absolute-deviation (SCAD) penalty or the adaptive LASSO penalty, achieves variable selection in the parametric part. We show that the resulting estimator of the parametric part possesses the oracle property, and that the estimator of the nonparametric part achieves the optimal rate of convergence. The proposed procedures are shown to work well in simulation experiments, and then applied to a real data example on sexually transmitted diseases.

In survival analysis, a problem of interest is to identify relevant risk factors and evaluate their contributions to survival time. Cox proportional hazards (PH) model is a popular approach to study the influence of covariates on survival outcome. Conventional PH models assume that covariates have a log-linear effect on the hazard function. These PH models have been studied by numerous authors; see, e.g., the references in [15]. The log-linear assumption can be too rigid in practice, especially when continuous covariates are present. This limitation motivates PH models with nonparametric relative risk. Some examples are [6, 11, 12, 23, 35]. However, nonparametric models may suffer from the curse of dimensionality. They also lack the easy interpretation in parametric risk models. PH models with semiparametric relative risk strike a good balance by allowing nonparametric risk for some covariates and parametric risk for others. The benefits of such models are two-folds. First, they have the merits of models with parametric risk, including easy interpretation, easy estimation and easy inference. Second, their nonparametric part allows a flexible form for some continuous covariates whose patterns are unexplored and whose contribution cannot be assessed by simple parametric models. For example, [10] proposed efficient estimation for a partially linear Cox model with additive nonlinear covariate effects. [3] studied partially linear hazard regression for multivariate survival data with time-dependent covariates via a profile pseudo-partial likelihood approach, where the only nonlinear covariate effect was estimated by local polynomials. But these models are limited to one nonparametric component or additive nonparametric components, ignoring the possible interactions between different nonparametric components. [30] proposed a partially linear additive hazard model whose nonlinear varying coefficients represent the interaction between the time-dependent nonlinear covariate and other covariates.

Variable selection in survival data has drawn much attention in the past decade. Traditional procedures such as Akaike information criterion (AIC) and Bayesian information criterion (BIC), as noted by [2], suffer from the lack of stability and lack of incorporating stochastic errors inherited in the stage of variable selection. [26] and [32] extended respectively the LASSO and the adaptive LASSO variable selection procedures to the Cox model. [8] extended the nonconcave penalized likelihood approach ([7]) to the Cox PH models. [4] studied variable selection for multivariate survival data. The Cox models considered in these three papers all assumed a linear form of covariate effects in the relative risk. More recently, [13] and [14] proposed procedures for selecting variables in semiparametric linear regression models for censored data, where the dependence of response over covariates was also assumed to be of linear form. Hence, the aforementioned variable selection procedures are limited in their rigid assumption of parametric covariate effects which may not be realistic in practice. We will fill in these gaps in three aspects: (i) our models are flexible with semiparametric relative risk, which allows nonadditive nonparametric components, without limiting to single or additive nonlinear covariate effects; and (ii) our approach can simultaneously estimate the parametric coefficient vector and select contributing parametric components; and (iii) our approach also provides a model selection tool for the nonparametric components.

Let the hazard function for a subject be

$$h(t)={h}_{0}(t)\phantom{\rule{0.2em}{0ex}}\mathrm{exp}[{\beta}^{T}U+\eta (W)],$$

(1.1)

where *h*_{0} is the unknown baseline hazard, *Z ^{T}* = (

The rest of the article is organized as follows. Section 2 gives the details of the proposed method, in the order of model description and estimation procedure (§2.1), model selection in the nonparametric part (§2.2), asymptotic properties (§2.3), and miscellaneous issues (§2.4) like standard error estimates and smoothing parameter selection. Section 3 presents the empirical studies, and Section 4 gives an application study. Remarks in Section 5 conclude the article.

Let *T* be the failure time and *C* be the right-censoring time. Assume that *T* and *C* are conditionally independent given the covariate. The observable random variable is (*X*, Δ, *Z*), where *X* = min(*T, C*), Δ = *I*_{[T≤C]}, and *Z* = (*U, W*) is the covariate vector with *U* * ^{d}* and

Let *Y _{i}*(

$${l}_{\beta}\phantom{\rule{0.1em}{0ex}}(\eta )\equiv -\frac{1}{n}\sum _{i=1}^{n}{\mathrm{\Delta}}_{i}\phantom{\rule{0.1em}{0ex}}\{{U}_{i}^{T}\beta +\eta ({W}_{i})-\mathrm{log}\sum _{k=1}^{n}{Y}_{k}({X}_{i})\phantom{\rule{0.2em}{0ex}}\mathrm{exp}[{U}_{k}^{T}\beta +\eta ({W}_{k})]\}+\lambda J(\eta ),$$

(2.1)

where the summation is the negative log partial likelihood representing the goodness-of-fit, *J*(*η*) is a roughness penalty specifying the smoothness of *η*, and *λ* > 0 is a smoothing parameter controlling the tradeoff. A popular choice for *J* is the *L*_{2}-penalty which yields tensor product cubic splines (see, e.g., [9]) for multivariate *W*. Note that *η* in (2.1) is identifiable up to a constant, so we use the constraint *∫ η* = 0.

Once an estimate of *η* is obtained, the estimator of *β* is then the maximizer of the penalized profile partial likelihood

$$\begin{array}{l}{l}_{\widehat{\eta}}\phantom{\rule{0.1em}{0ex}}(\beta )\equiv \sum _{i=1}^{n}{\mathrm{\Delta}}_{i}\{{U}_{i}^{T}\beta +\widehat{\eta}({W}_{i})\\ \phantom{\rule{4.5em}{0ex}}-\mathrm{log}\phantom{\rule{0.2em}{0ex}}\sum _{k=1}^{n}{Y}_{k}\phantom{\rule{0.1em}{0ex}}({X}_{i})\phantom{\rule{0.2em}{0ex}}\mathrm{exp}[{U}_{k}^{T}\beta +\widehat{\eta}({W}_{k})]\}-n\sum _{j=1}^{d}{p}_{{\theta}_{j}}(|{\beta}_{j}|),\end{array}$$

(2.2)

where *p _{θj}* (| · |) is the SCAD penalty on

The detailed algorithm for our estimation procedure is as follows.

- Step 1. Find a proper initial estimate
^{(0)}. We note that, as long as the initial estimate is reasonable, convergence to the true optimizer can be achieved. Difference choices of the initial estimate will affect the number of iterations needed but not the convergence itself. - Step 2. Let
^{(k−1)}be the estimate of*β*before the*k*th iteration. Plug^{(k−1)}into (2.1) and solve for*η*by minimizing the penalized partial likelihood*l*_{}^{(k−1)}(*η*). Let^{(k)}be the estimate thus obtained. - Step 3. Plug
^{(k)}into (2.2) and solve for*β*by maximizing the penalized profile partial likelihood*l*_{}^{(k)}(*β*). Let^{(k)}be the estimate thus obtained. - Step 4. Replace
^{(k−1)}in Step 2 by^{(k)}and repeat Steps 2 and 3 until convergence to obtain the final estimates and .

Our experience shows that the algorithm usually converges quickly within a few iterations. As in the classical Cox proportional hazards model, the estimation of baseline hazard function is of less interest and not required in our estimation procedure.

In Step 3, we use a one-step approximation to the SCAD penalty ([34]). It transforms the SCAD penalty problem to a LASSO-type optimization, where the celebrated LARS algorithm proposed in [5] can be used. Let *l _{}*(

- (Step 3a) Let
*y*=*V*^{(k−1)}. Then for each*j**B*, replace the*j*-th column of*V*by setting ${\upsilon}_{j}={\upsilon}_{j}\phantom{\rule{0.1em}{0ex}}\frac{{\theta}_{j}}{{{p}^{\prime}}_{{\theta}_{j}}\phantom{\rule{0.1em}{0ex}}\left(\left|{\widehat{\beta}}_{j}^{(k-1)}\right|\right)}$. - (Step 3b) Let ${H}_{A}={V}_{A}\phantom{\rule{0.1em}{0ex}}{({V}_{A}^{T}{V}_{A})}^{-1}{V}_{A}^{T}$ be the projection matrix to the column space of
*V*. Compute_{A}*y** =*y*−*H*and_{A}y*V** =_{B}*V*−_{B}*H*._{A}V_{B} - (Step 3c) Apply the LARS algorithm to solve$${\widehat{\beta}}_{B}^{\ast}=\mathrm{arg}\phantom{\rule{0.2em}{0ex}}\underset{\beta}{\mathrm{min}}\{\frac{1}{2}{\parallel {y}^{\ast}-{V}_{B}^{\ast}\phantom{\rule{0.1em}{0ex}}\beta \parallel}^{2}+\phantom{\rule{0.2}{0ex}}n\sum _{j\in B}{\theta}_{j}|{\beta}_{j}|\}.$$
- (Step 3d) Compute ${\widehat{\beta}}_{A}^{\ast}={({V}_{A}^{T}\phantom{\rule{0.2em}{0ex}}{V}_{A})}^{-1}{V}_{A}^{T}\phantom{\rule{0.1em}{0ex}}(y-{V}_{B}\phantom{\rule{0.2em}{0ex}}{\widehat{\beta}}_{B}^{\ast})$ to obtain ${\widehat{\beta}}^{\ast}={({\widehat{\beta}}_{A}^{{\ast}^{T}},{\widehat{\beta}}_{B}^{{\ast}^{T}})}^{T}$.
- (Step 3e) For
*j**A*, set ${\widehat{\beta}}_{j}^{(k)}={\widehat{\beta}}_{j}^{\ast}$. For*j B*, set ${\widehat{\beta}}_{j}^{(k)}={\widehat{\beta}}_{j}^{\ast}\phantom{\rule{0.1em}{0ex}}\frac{{\theta}_{j}}{{p}_{{\theta}_{j}}^{\prime}\phantom{\rule{0.1em}{0ex}}\left(\left|{\widehat{\beta}}_{j}^{(k-1)}\right|\right)}$.

While the SCAD penalty takes care of variable selection for the parametric components, we still need an approach to assess the structure of the nonparametric components. In this section, we will first transform the profile partial likelihood problem in (2.1) to a density estimation problem with biased sampling, and then derive a model selection tool based on the Kullback-Leibler geometry. In this part, we treat *β* as fixed, taking the value from the previous step in the algorithm.

Let (*i*_{1}, … , *i _{N}*) be the indices for the failed subjects. Then the profile partial likelihood in (2.1) for estimating

$${\prod _{i=1}^{n}\phantom{\rule{0.2em}{0ex}}\left[\frac{{e}^{{U}_{i}^{T}\phantom{\rule{0.2em}{0ex}}\beta +\eta ({W}_{i})}}{{\sum}_{k=1}^{n}{Y}_{k}\phantom{\rule{0.1em}{0ex}}({X}_{i})\phantom{\rule{0.2em}{0ex}}{e}^{{U}_{k}^{T}\phantom{\rule{0.2em}{0ex}}\beta +\eta ({W}_{k})}}\right]}^{{\mathrm{\Delta}}_{i}}=\prod _{p=1}^{N}\phantom{\rule{0.2em}{0ex}}\left[\frac{{e}^{{U}_{{i}_{p}}^{T}\phantom{\rule{0.2em}{0ex}}\beta +\eta ({W}_{{i}_{p}})}}{{\sum}_{k=1}^{n}{Y}_{k}\phantom{\rule{0.1em}{0ex}}({X}_{{i}_{p}})\phantom{\rule{0.2em}{0ex}}{e}^{{U}_{k}^{T}\phantom{\rule{0.2em}{0ex}}\beta +\eta ({W}_{k})}}\right].$$

Consider the empirical measure
${\mathrm{P}}_{n}^{w}$ on the discrete domain *W _{n}* = {

For two density estimates *η*_{1} and *η*_{2} in the above pseudo biased sampling density estimation problem, define their Kullback-Leibler distance as

$$\begin{array}{lll}\mathrm{KL}({\eta}_{1},{\eta}_{2})& =& \frac{1}{N}\sum _{p=1}^{N}\{\frac{\mathit{\int}({\eta}_{1}(w)-{\eta}_{2}(w))\phantom{\rule{0.1em}{0ex}}{a}_{p}(w)\phantom{\rule{0.1em}{0ex}}{e}^{{\eta}_{1}(w)}d{\mathrm{P}}_{n}^{w}}{\mathit{\int}{a}_{p}(w)\phantom{\rule{0.1em}{0ex}}{e}^{{\eta}_{1}(w)}d{\mathrm{P}}_{n}^{w}}\\ & & \phantom{\rule{1.5em}{0ex}}-\mathrm{log}\mathit{\int}{a}_{p}(w)\phantom{\rule{0.1em}{0ex}}{e}^{{\eta}_{1}(w)}d{\mathrm{P}}_{n}^{w}+\mathrm{log}\mathit{\int}{a}_{p}(w)\phantom{\rule{0.1em}{0ex}}{e}^{{\eta}_{2}(w)}d{\mathrm{P}}_{n}^{w}\}.\end{array}$$

(2.3)

Let *η _{0}* be the true function. Suppose the estimation of

$$\sum _{p=1}^{N}\frac{\mathit{\int}(\stackrel{\sim}{\eta}(w)-{\eta}_{c}(w))\phantom{\rule{0.1em}{0ex}}{a}_{p}(w)\phantom{\rule{0.1em}{0ex}}{e}^{\widehat{\eta}(w)}d{\mathrm{P}}_{n}^{w}}{\mathit{\int}{a}_{p}(w)\phantom{\rule{0.1em}{0ex}}{e}^{\widehat{\eta}(w)}d{\mathrm{P}}_{n}^{w}}=\sum _{p=1}^{N}\frac{\mathit{\int}(\stackrel{\sim}{\eta}(w)-{\eta}_{c}(w))\phantom{\rule{0.1em}{0ex}}{a}_{p}(w)\phantom{\rule{0.1em}{0ex}}{e}^{\stackrel{\sim}{\eta}(w)}d{\mathrm{P}}_{n}^{w}}{\mathit{\int}{a}_{p}(w)\phantom{\rule{0.1em}{0ex}}{e}^{\stackrel{\sim}{\eta}(w)}d{\mathrm{P}}_{n}^{w}},$$

which, through straightforward calculation, yields

$$\mathrm{KL}(\widehat{\eta},{\eta}_{c})=\mathrm{KL}(\widehat{\eta},\stackrel{\sim}{\eta})+\mathrm{KL}(\stackrel{\sim}{\eta},{\eta}_{c}).$$

Hence the ratio KL(*, *)/KL(*, η _{c}*) can be used to diagnose the feasibility of a reduced model

Denote by * ^{m}* (

$$\mathscr{H}=\{\eta \in {\mathscr{H}}^{m}(W),{\mathit{\int}}_{W}\phantom{\rule{0.2em}{0ex}}\eta (w)dw=0\},$$

and * be the estimate of *η*_{0} in that minimizes the penalized partial likelihood

$$-\frac{1}{n}\sum _{i=1}^{n}{\mathit{\int}}_{{\rm T}}\phantom{\rule{0.1em}{0ex}}\{{U}_{i}^{T}\beta +\eta ({W}_{i})-\mathrm{log}\sum _{k=1}^{n}{Y}_{k}\phantom{\rule{0.1em}{0ex}}(t)\phantom{\rule{0.2em}{0ex}}\mathrm{exp}[{U}_{k}^{T}\phantom{\rule{0.1em}{0ex}}\beta +\eta ({W}_{k})]\}d{N}_{i}(t)+\frac{\lambda}{2}J(\eta ).$$

(2.4)

Note that is an infinite dimensional function space. Hence, in practice, the minimization of (2.4) is usually performed in a data-adaptive finite dimensional space

$${\mathscr{H}}_{n}={N}_{J}\oplus \mathrm{span}\{{R}_{J}\phantom{\rule{0.1em}{0ex}}({W}_{{i}_{l}},\cdot ):l=1,\dots ,{q}_{n}\}$$

where *N _{J}* = {

Let
${s}_{n}[f;\beta ,\eta ](t)=\frac{1}{n}{\sum}_{k=1}^{n}{Y}_{k}(t)\phantom{\rule{0.1em}{0ex}}f({U}_{k},{W}_{k})\phantom{\rule{0.2em}{0ex}}\mathrm{exp}({U}_{k}^{T}\beta +\eta ({W}_{k}))$ and *s _{n}*[

$$\begin{array}{lll}s[f;\beta ,\eta ](t)& =& E[Y(t)\phantom{\rule{0.1em}{0ex}}f(U,W)\phantom{\rule{0.2em}{0ex}}\mathrm{exp}({U}^{T}\beta +\eta (W))]\\ & =& \mathit{\int}\mathit{\int}\phantom{\rule{0.2em}{0ex}}f(u,w)\phantom{\rule{0.1em}{0ex}}{e}^{{u}^{T}\beta +\eta (w)}\phantom{\rule{0.1em}{0ex}}q(t,u,w)\mathit{dudw}.\end{array}$$

For any functions *f* and *g*, define

$$V(f,g)={\mathit{\int}}_{{\rm T}}\phantom{\rule{0.1em}{0ex}}\{\frac{s[fg;\beta ,{\eta}_{0}](t)}{s[\beta ,{\eta}_{0}](t)}-\frac{s[f;\beta ,{\eta}_{0}](t)}{s[\beta ,{\eta}_{0}](t)}\frac{s[g;\beta ,{\eta}_{0}](t)}{s[\beta ,{\eta}_{0}](t)}\}s[\beta ,{\eta}_{0}](t)d{\mathrm{\Lambda}}_{0}(t).$$

(2.5)

Write *V* (*f*) *V* (*f, f*). Let be the estimate that minimizes (2.4) in * _{n}*. Then we have

*Under Conditions A1-A7 in the Appendix,*

$$(V+\lambda J)({\widehat{\eta}}^{\ast}-{\eta}_{0})={O}_{P}({n}^{-r/(r+1)})\phantom{\rule{0.2em}{0ex}}\mathit{and}\phantom{\rule{0.2em}{0ex}}(V+\lambda J)(\widehat{\eta}-{\eta}_{0})={O}_{P}({n}^{-r/(r+1)}).$$

This is the optimal convergence rate for estimate of a nonparametric function. In the view of Lemma 6.1, this theorem also indicates the same convergence rate in terms of the *L*_{2}-norm. Also note that although a higher order of *q _{n}* such as

Let ${\mathcal{L}}_{P}(\beta )={l}_{p}(\beta )-n{\sum}_{j=1}^{d}{p}_{{\theta}_{j}}(|{\beta}_{j}|)$, where

$${l}_{p}(\beta )=\sum _{i=1}^{n}\mathit{\int}\{{U}_{i}^{T}\beta +\widehat{\eta}({W}_{i})-\mathrm{log}\phantom{\rule{0.2em}{0ex}}{s}_{n}[\beta ,\widehat{\eta}](t)\}d{N}_{i}(t).$$

Let
${\beta}_{0}={({\beta}_{10},\dots ,{\beta}_{d0})}^{T}={({\beta}_{10}^{T},\phantom{\rule{0.2em}{0ex}}{\beta}_{20}^{T})}^{T}$ be the true coefficient vector. Without loss of generality, assume that *β*_{20} = 0. Let *s* be the number of nonzero components in *β*_{0}. Define
${a}_{n}={\mathrm{max}}_{j}\{|{p}_{{\theta}_{j}}^{\prime}(|{\beta}_{j0}|)|:{\beta}_{j0}\ne 0\},{b}_{n}={\mathrm{max}}_{j}\{|{p}_{{\theta}_{j}}^{\u2033}(|{\beta}_{j0}|)|:{\beta}_{j0}\ne 0\}$, and

$$\begin{array}{lll}\mathbf{b}& =& {({p}_{{\theta}_{1}}^{\prime}({\beta}_{10})\phantom{\rule{0.1em}{0ex}}\mathrm{sgn}({\beta}_{10}),\dots ,{p}_{{\theta}_{s}}^{\prime}({\beta}_{s0})\phantom{\rule{0.1em}{0ex}}\mathrm{sgn}({\beta}_{s0}))}^{T},\\ {\sum}_{\theta}& =& \mathrm{diag}({p}_{{\theta}_{1}}^{\prime}(|{\beta}_{10}|)/|{\beta}_{10}|,\dots ,{p}_{{\theta}_{s}}^{\prime}(|{\beta}_{s0}|)/|{\beta}_{s0}|).\end{array}$$

Define *π*_{1} : ^{d+q} → ^{s} such that *π*_{1} (*u, w*) = *u*_{1}, where *u*_{1} is the vector of the first *s* components of *u*. Let *V*_{0}(*π*_{1}) be defined like *V* (*π*_{1}) in (2.5) but with *β* replaced by *β*_{0}.

*Under Conditions A1-A7 in the Appendix, if a _{n}* =

*There exists a local maximizer of*|| −_{P}(β) such that*β*_{0}|| =*O*(_{p}*n*^{−1/2}).*Further assume that for all*1 ≤*j*≤*d, θ*=_{j}*o*(1), ${\theta}_{j}^{-1}=o\left({n}^{\mathrm{1/2}}\right)$,*and*$${\mathrm{lim\; inf}}_{\phantom{\rule{0.1em}{0ex}}n\to \infty}\phantom{\rule{0.1em}{0ex}}{\mathrm{lim\; inf}}_{u\to {0}^{+}}\phantom{\rule{0.1em}{0ex}}{\theta}_{j}^{-1}\phantom{\rule{0.1em}{0ex}}{p}_{{\theta}_{j}}^{\prime}(u)>0.$$*With probability approaching one, the root-n consistent estimator in (i) must satisfy*_{2}= 0*and*$$\sqrt{n}({V}_{0}({\pi}_{1})+{\sum}_{\theta})\{{\widehat{\beta}}_{1}-{\beta}_{10}+{({V}_{0}({\pi}_{1})+{\sum}_{\theta})}^{-1}\mathbf{b}\}\to N(\mathbf{0},{V}_{0}({\pi}_{1})).$$

In this section, we will propose the standard error estimates for both the parametric and the nonparametric components, and discuss the selection of the smoothing parameters *θ* and *λ*.

Let *l _{p}*(

$${\sum}_{\theta}\phantom{\rule{0.1em}{0ex}}(\beta )=\mathrm{diag}\{{p}_{{\theta}_{j}}^{\prime}(|{\beta}_{1}|)/|{\beta}_{1}|,\dots ,{p}_{{\theta}_{j}}^{\prime}(|{\beta}_{p}|)/|{\beta}_{p}|\}.$$

Then the standard errors for the nonzero coefficients of are given by the sandwich formula

$$\mathrm{c}\widehat{\mathrm{o}}\mathrm{v}(\widehat{\beta})={\{{\nabla}^{2}{l}_{p}(\widehat{\beta})-n{\sum}_{\theta}\phantom{\rule{0.1em}{0ex}}(\widehat{\beta})\}}^{-1}\mathrm{c}\widehat{\mathrm{o}}\mathrm{v}\{\nabla {l}_{p}(\widehat{\beta})\}{\{{\nabla}^{2}{l}_{p}(\widehat{\beta})-n{\sum}_{\theta}\phantom{\rule{0.1em}{0ex}}(\widehat{\beta})\}}^{-1}.$$

Sometimes the standard errors for zero coefficients are also of interest. A discussion of this problem is in Section 5

In (2.1), *η* can be decomposed as *η* = *η*^{[0]} + *η*^{[1]} where *η*^{[0]} lies in the null space of the penalty *J* representing the lower order part and *η*^{[1]} lies in the complement space representing the higher order part. A Bayes model interprets (2.1) as a posterior likelihood when *η*^{[0]} is assigned an improper constant prior and *η*^{[1]} is assigned a Gaussian prior with zero mean and certain covariance matrix. The minimizer of (2.1) then becomes the posterior mode. When the minimization is carried out in a data-adaptive function space * _{n}* with basis functions

As shown in [33], the effective degrees of freedom for *l*_{1}-penalty model is well approximated by the number of nonzero coefficients. Note that our SCAD procedure is implemented by a LASSO approximation at each step. Hence, if we let *Â* be the set of nonzero coefficients, the AIC score for selecting *θ* in Step 3 is

$$\mathrm{AIC}=2{l}_{p}(\widehat{\beta})+2|\widehat{A}|,$$

where |*Â*| is the cardinality of *Â*.

As illustrated in Section 2.2, the estimation of *η* in Step 2 can be cast as a density estimation problem with biased sampling. Let KL(*η*_{0}, * _{λ}*) be the Kullback-Leibler distance, as defined in (2.3), between the true “density”
${e}^{{\eta}_{0}}/\mathit{\int}\phantom{\rule{0.1em}{0ex}}{e}^{{\eta}_{0}}\phantom{\rule{0.1em}{0ex}}d{\mathrm{P}}_{n}^{w}$ and the estimate
${e}^{{\hat{\eta}}_{\mathrm{\lambda}}}/\mathit{\int}\phantom{\rule{0.1em}{0ex}}{e}^{{\hat{\eta}}_{\mathrm{\lambda}}}\phantom{\rule{0.2em}{0ex}}d{\mathrm{P}}_{n}^{w}$. An optimal

$$\begin{array}{l}\mathrm{RKL}({\eta}_{0},{\stackrel{\u2322}{\eta}}_{\lambda})=\frac{1}{N}\sum _{p=1}^{N}\{\frac{\mathit{\int}({\eta}_{0}(w)-{\stackrel{\u2322}{\eta}}_{\lambda}(w))\phantom{\rule{0.1em}{0ex}}{a}_{p}(w)\phantom{\rule{0.1em}{0ex}}{e}^{{\eta}_{0}(w)}d{\mathrm{P}}_{n}^{w}}{\mathit{\int}{a}_{p}(w)\phantom{\rule{0.1em}{0ex}}{e}^{{\eta}_{0}(w)}d{\mathrm{P}}_{n}^{w}}\\ \phantom{\rule{15em}{0ex}}+\mathrm{log}\mathit{\int}{a}_{p}(w)\phantom{\rule{0.1em}{0ex}}{e}^{{\stackrel{\u2322}{\eta}}_{\lambda}(w)}d{\mathrm{P}}_{n}^{w}\}.\end{array}$$

(2.6)

The second term of (2.6) is directly computable from the estimate * _{λ}*. But the first term needs to be estimated. Let

$$-\frac{1}{N}\sum _{p=1}^{N}\{\eta ({W}_{i})-\mathrm{log}\mathit{\int}{a}_{p}(w){e}^{\eta (w)}d{\mathrm{P}}_{n}^{w}\}+\frac{\mathrm{tr}({P}_{1}{Q}^{T}{H}^{-1}Q{P}_{1})}{N(N-1)},$$

where *P*_{1} = *I* − **11*** ^{T}* /

In the simulations, we generated failure times from the exponential hazard model with *h*(*t*|*U, W*) = exp[*U ^{T} β*

The theory in Section 2.3 gives the sufficient order for *q _{n}*, the number of knots in our smoothing spline estimation of

The nonparametric part had one covariate *W* generated from Uniform(0, 1). Two different *η*_{0} were used:

$${\eta}_{0a}(w)=1.5\phantom{\rule{0.1em}{0ex}}\mathrm{sin}\phantom{\rule{0.1em}{0ex}}(2\pi w-\frac{\pi}{2})\phantom{\rule{0.4em}{0ex}}\mathrm{or}\phantom{\rule{0.4em}{0ex}}{\eta}_{0b}(w)=4{(w-0.3)}^{2}+4.7{e}^{-w}-\mathrm{3.4643.}$$

Note that both functions satisfies
${\mathit{\int}}_{0}^{1}{\eta}_{0}(w)dw=0$. Given *U* and *W*, the censoring times were generated from exponential distributions such that the censoring rates are respectively 23% and 40%. Sample sizes *n* = 150 and 500 were considered. One thousand data replicates were generated for each of the four combinations of *η*_{0} and *n*.

For a prediction procedure *M* and the estimator (* _{M}, _{M}*) yielded from the procedure, an appropriate measure for the goodness-of-fit under Cox model with

*M*: procedure with partial oracle and misspecified parametric_{A}*η*_{0}, that is, (*U*_{1},*U*_{4},*U*_{7},*W*) are known to be the only contributing covariates but*η*_{0}is misspecified to be of the parametric form*η*_{0}(*W*) =*β*and_{W}W*β*is estimated together with the coefficients for (_{W}*U*_{1},*U*_{4},*U*_{7});*M*: procedure with partial oracle and estimated_{B}*η*_{0}, that is, (*U*_{1},*U*_{4},*U*_{7},*W*) are known to be the only contributing covariates but the form of*η*_{0}is unknown, and*η*_{0}is estimated together with the coefficients for (*U*_{1},*U*_{4},*U*_{7}) by penalized profile partial likelihood;*M*: the proposed partial linear procedure with the SCAD penalty on_{C}*β*;*M*: the proposed partial linear procedure with the adaptive LASSO penalty on_{D}*β*.

Procedure *M _{A}* has a mis-specified covariate effect. We intend to show that the estimation results can be unsatisfactory if the semiparametric form of covariate effect is mistakenly specified as parametric. Procedure

For each combination of *η*_{0} and *n*, we computed the following quantities out of the 1000 data replicates: the median RMEs of the complete oracle procedure *M*_{0} versus the procedures *M _{A}* to

To examine the estimation of *η*_{0}, we computed the point-wise estimates at the grid *w* = (0, 1, by = 0:01) for each data replicate. Then at each grid point, the mean, the 0.025 and the 0.975 quantiles of the 1000 estimates, together with the mean of the 1000 95% confidence intervals were computed. The results are in Figure 1. The plots show satisfactory nonparametric fits and standard error estimates.

In this section, we present some simulations to evaluate the model selection tool for nonparametric part introduced in Section 2.2. We used the SCAD penalty on the parametric components in this section. Two covariates *W*_{1} and *W*_{2}, independently generated from Uniform(0, 1), were used. We considered two scenarios for the true model of the nonparametric part: (i) nonparametric univariate model *η*_{0}(*W*) = *η*_{01}(*W*_{1}) and (ii) nonparametric bivariate additive model *η*_{0}(*W*) = *η*_{01}(*W*_{1}) + *η*_{02}(*W*_{2}). For scenario (i), the data sets generated in the last section were used, with *W*_{1} being the existing *W* covariate and *W*_{2} being an additional noise covariate. The fitted model was nonparametric additive in *W*_{1} and *W*_{2}. The ratios KL(*, *)/KL(*, η _{c}*) for the projections to the univariate models

$$\begin{array}{l}{\eta}_{0}({w}_{1},{w}_{2})=0.7{\eta}_{0a}({w}_{1})+0.3{\eta}_{0b}({w}_{2})\phantom{\rule{0.4em}{0ex}}\mathrm{or}\\ {\eta}_{0}({w}_{1},{w}_{2})={\eta}_{0a}({w}_{1})+{\eta}_{0b}({w}_{2}),\end{array}$$

where *η*_{0a} and *η*_{0b} are as defined in Section 3.1. The censoring times were generated from exponential distributions such that the resulting censoring rates were respectively 25% and 39%. Note that both choices of *η*_{0} are additive in *w*_{1} and *w*_{2}. The fitted model was the nonparametric bivariate full model with both the main effects and the interaction. Then the ratios KL(*, *)/KL(*, η _{c}*) for the projections to the bivariate additive model and the two univariate models were computed. In both scenarios, we claim a reduced model is feasible when the corresponding ratio KL(

For each of the eight combinations of *η*_{0} and *n*, we simulated 1000 data replicates and computed the proportions of replicates that produced the following results in the reduced model: selected the main effect of *W*_{1}, selected the main effect of *W*_{2}, selected the interaction *W*_{1} : *W*_{2}, under-fitted the model by excluding at least one truly significant effect, correctly fitted the model by reducing to the exact subset model, and over-fitted the model by including all the truly significant effects and some irrelevant effects. These proportion results are summarized in Table 3. It shows that the variable selection tool for the nonparametric component works very well. The better performance appears to be associated with bigger sample sizes and lower censoring rates.

An example in [17] is a study on two sexually transmitted diseases: gonorrhea and chlamydia. The purpose of the study was to identify factors that are related to time until reinfection by gonorrhea or chlamydia given an initial infection of either disease. A sample of 877 individuals with an initial diagnosis of gonorrhea or chlamydia were followed for reinfection. Recorded for each individual were follow-up time, indicator of reinfection, demographic variables including race (white or black, *U*_{1}), marital status (divorced/separated, married, or single, *U*_{2} and *U*_{3}), age at initial infection (*W*_{1}), years of schooling (*W*_{2}), and type of initial infection (gonorrhea, chlamydia, or both, *U*_{4} and *U*_{5}), behavior factors at the initial diagnosis including number of partners in the last 30 days (*U*_{6}), indicators of oral sex within past 12 months and within past 30 days (*U*_{7} and *U*_{8}), indicators of rectal sex within past 12 months and within past 30 days (*U*_{9} and *U*_{10}), and condom use (always, sometimes, or never, *U*_{11} and *U*_{12}), symptom variables at time of initial infection including presence of abdominal pain (*U*_{13}), sign of discharge (*U*_{14}), sign of dysuria (*U*_{15}), sign of itch (*U*_{16}), sign of lesion (*U*_{17}), sign of rash (*U*_{18}), and sign of lymph involvement (*U*_{19}), and symptom variables at time of examination including involvement vagina at exam (*U*_{20}), discharge at exam (*U*_{21}), and abnormal node at exam (*U*_{22}).

We used *q _{n}* = 10 · 877

$${h}_{i}(t|Z)={h}_{0}(t)\phantom{\rule{0.2em}{0ex}}\mathrm{exp}\left\{\sum _{j=1}^{3}{\eta}_{j}({W}_{ji})+\sum _{k=1}^{22}{U}_{ki}\phantom{\rule{0.1em}{0ex}}{\beta}_{k}\right\},$$

where *η*_{3}(*W*_{3i}) = *η*_{3}(*W*_{1i}, *W*_{2i}) is the interaction term between *W*_{1} and *W*_{2}. However, the interaction term was found to be negligible with the ratio KL(*, *)/KL(*, η _{c}*) = 0.003. Hence, we took out this interaction term and refitted the model. In this model, neither

Nonparametric Component Estimates for Sexually Transmitted Diseases Data. Left: Effect of age at initial infection. Right: Effect of years of schooling. Solid lines are the estimates, dashed lines are 95% confidence intervals, and dotted lines are the **...**

We have proposed a Cox PH model with semiparametric relative risk. The nonparametric part of the risk is estimated by smoothing spline ANOVA model and model selection procedure derived based on a Kullback-Leibler geometry. The parametric part of the risk is estimated by penalized profile partial likelihood and variable selection achieved by choosing a nonconcave penalty. Both theoretical and numerical studies show promising results for the proposed method. An important question in using the method in practice is which covariate effects should be treated as parametric. We suggest the following guideline for making choices. As a starting point, the effects of all the continuous covariates are put in the nonparametric part and those of the discrete covariates in the parametric part. If the estimation results show that some of the continuous covariate effects can be described by certain parametric forms such as linear form, then a new model can be fitted with those continuous covariate effects moved to the parametric part. In this way, one can take full advantage of the flexible exploratory analysis provided by the proposed method.

We thank a referee for raising the interesting question on the standard error estimates for the coefficients estimated to be 0 in . [25] and [7] suggested to set these standard errors to 0s based on the belief that those covariates with zero coefficient estimates are not important. This is the approach adopted here. When such a belief is in doubt, nonzero standard errors are preferred even for coefficients estimated to be 0s. This problem has been addressed only in a few papers. [22] looked at the problem for LASSO but it is based on a smooth approximation. [24] presented a Bayesian approach and pointed out that no fully satisfactory frequentist solution had been proposed so far, no matter LASSO or SCAD variable selection procedure is considered. This problem presents an interesting challenge that we hope to address in some future work.

Another choice of *pθ* (| · |) is the adaptive LASSO penalty ([31]). Our simulations in §3.1 indicates a similar performance when compared to the SCAD penalty. So we decided not to present the details here.

Although our method is presented for time-independent covariates, a lengthier argument modifying [23] can yield similar theoretical results for external time-dependent covariates ([15]). However, the implementation of such extension is more complicated and not pursued here.

A recently proposed nonparametric component selection procedure in a penalized likelihood framework is the COSSO method in [19] where the penalty switches from *J*(*η*) to *J*^{1/2}(*η*). Taking advantage of the smoothing spline ANOVA decomposition, the COSSO method does model selection by applying a soft thresholding type operation to the function components. An extension of COSSO to the Cox proportional hazards model with nonparametric relative risk is available in [18]. Although a similar extension to our proportional hazards model with semiparametric relative risk is of interest, it is not clear whether the theoretical properties of the COSSO method such as the existence and the convergence rate of the COSSO estimator can be transferred to the estimation of *η* under our semiparametric setting. Furthermore, the dimension of the function space in COSSO is *O*(*n*), too big to allow an entropy bound that is critical in deriving the asymptotic properties of .

For *z* = (*u, w*), let *p _{z}*(

- A1. The true coefficient
*β*_{0}is an interior point of a bounded subset of.^{d} - A2. The domain
*Ƶ*of covariate is a compact set in^{d+q}. - A3. Failure time
*T*and censoring time*C*are conditionally independent given the covariate*Z*. - A4. Assume the observations are in a finite time interval [0,
*τ*]. Assume that the baseline hazard function*h*_{0}(*t*) is bounded away from zero and infinity. - A5. Assume that there exist constants
*k*_{2}>*k*_{1}> 0 such that*k*_{1}<*q*(*t, u, w*) <*k*_{2}and $\left|\frac{\partial}{\partial t}q(t,u,w)\right|<{k}_{2}$. - A6. Assume the true function
*η*_{0}. For any*η*in a sufficiently big convex neighborhood*B*_{0}of*η*_{0}, there exist constants*c*_{1},*c*_{2}> 0 such that*c*_{1}*e*^{η0(w)}≤*e*^{η(w)}≤*c*_{2}*e*^{η0(w)}for all*w*. - A7. The smoothing parameter
*λ**n*^{−r/(r+1)}.

Condition A1 requires that *β*_{0} is not on the boundary of the parameter space. Condition A2 is also a common boundedness assumption on covariate. Condition A3 assumes noninformative censoring. Condition A4 is the common boundedness assumption on the baseline hazard. Condition A5 bounds the joint density of (*T, Z*) and thus also the derivatives of the partial likelihood. Condition A6 assumes that *η*_{0} has proper level of smoothness and integrates to zero. The neighborhood *B*_{0} in Condition A6 should be big enough to contain all the estimates of *η*_{0} considered below. When the members of *B*_{0} are all uniformly bounded, Condition A6 is automatically satisfied. The order for *λ* in Condition A7 matches that in standard smooth spline problems.

We first show the equivalence between *V* (·) and the *L*_{2}-norm
${\parallel \xb7\parallel}_{2}^{2}$.

*Let f* *. Then there exist constants* 0 < *c*_{3} ≤ *c*_{4} < ∞ *such that*

$${c}_{3}{\parallel f\parallel}_{2}^{2}\le V(f)\le {c}_{4}{\parallel f\parallel}_{2}^{2}.$$

For *z* = (*u, w*), let *p _{z}* (

$$V(f)={\mathit{\int}}_{{\rm T}}\left\{\mathit{\int}\mathit{\int}{f(u,w)-{\overline{f}}_{t})}^{2}{p}_{z}(t)\mathit{dudw}\right\}s[\beta ,{\eta}_{0}](t)d{\mathrm{\Lambda}}_{0}(t)$$

By Conditions A4 and A5, there exist positive constants *c*_{1} and *c*_{2} such that

$$\begin{array}{l}{c}_{1}{\mathit{\int}}_{{\rm T}}\{\mathit{\int}\mathit{\int}{(f(u,w)-{\overline{f}}_{t})}^{2}\mathit{dudw}\}d{\mathrm{\Lambda}}_{0}(t)\\ \phantom{\rule{7em}{0ex}}\le V(f)\le {c}_{2}{\mathit{\int}}_{{\rm T}}\{\mathit{\int}\mathit{\int}{(f(u,w)-{\overline{f}}_{t})}^{2}\mathit{dudw}\}d{\mathrm{\Lambda}}_{0}(t)\end{array}$$

Let *m*(*Ƶ*) < ∞ be the Lebesgue measure of *Ƶ*. Then

$$\begin{array}{l}{\mathit{\int}}_{{\rm T}}\{\mathit{\int}\mathit{\int}{(f(u,w)-{\overline{f}}_{t})}^{2}\mathit{dudw}\}d{\mathrm{\Lambda}}_{0}(t)\\ \phantom{\rule{6em}{0ex}}={\mathrm{\Lambda}}_{0}(\tau )\mathit{\int}\mathit{\int}\phantom{\rule{0.2em}{0ex}}{f}^{2}(u,w)\mathit{dudw}+m(Z)\mathit{\int}\mathit{\int}{[{\overline{f}}_{t}]}^{2}d{\mathrm{\Lambda}}_{0}(t).\end{array}$$

The lemma follows from the Cauchy-Schwartz inequality and Condition A4.

We will prove the results using an eigenvalue analysis of three steps. In the first step (linear approximation), we show the convergence rate *O _{p}*(

A quadratic function *B* is said to be *completely continuous* with respect to another quadratic functional *A*, if for any *ε* > 0, there exists a finite number of linear functionals *L*_{1}, … , *L _{k}* such that

We first present two lemmas without proof. The first one follows directly from the results in Section 8.1 of [9] and Lemma 6.1. The second one is exactly Lemma 8.1 in [9].

*V is completely continuous to J and the eigenvalues ρ _{ν} of J with respect to V satisfy that as ν* → ∞,
${\rho}_{\nu}^{-1}=O\left({\nu}^{r}\right)$.

*As λ* → 0, *the sums*
${\sum}_{\nu}\frac{\lambda {\rho}_{\nu}}{{(1+\lambda {\rho}_{\nu})}^{2}},{\sum}_{\nu}\frac{1}{{(1+\lambda {\rho}_{\nu})}^{2}}$, *and*
${\sum}_{\nu}\frac{1}{1+\lambda {\rho}_{\nu}}$ *are all order O*(*λ*^{−1/r}).

A linear approximation to * is the minimizer of a quadratic approximation to (2.4),

$$\begin{array}{l}-\frac{1}{n}\sum _{i=1}^{n}{\mathit{\int}}_{{\rm T}}\{\eta ({W}_{i})-{s}_{n}^{-1}[\beta ,{\eta}_{0}](t){s}_{n}[\eta -{\eta}_{0};\beta ,{\eta}_{0}](t)\}d{N}_{i}(t)\\ \phantom{\rule{16em}{0ex}}+\frac{1}{2}V(\eta -{\eta}_{0})+\frac{\lambda}{2}J(\eta ).\end{array}$$

(6.1)

Let *η* = *Σ _{ν} η_{ν}ϕ_{ν}* and

$$\sum _{\nu}\{-{\eta}_{\nu}{\gamma}_{\nu}+\frac{1}{2}{({\eta}_{\nu}-{\eta}_{\nu ,0})}^{2}+\frac{\lambda}{2}{\rho}_{\nu}{\eta}_{\nu}^{2}\},$$

(6.2)

where
${\gamma}_{\nu}=\frac{1}{n}{\sum}_{i=1}^{n}{\mathit{\int}}_{{\rm T}}\{{\varphi}_{\nu}({W}_{i})-{s}_{n}^{-1}[\beta ,{\eta}_{0}](t){s}_{n}[{\varphi}_{\nu};\beta ,{\eta}_{0}](t)\}d{N}_{i}(t)$. The Fourier coefficients that minimize (6.2) are * _{ν}* = (

$$\begin{array}{c}E[V(\stackrel{\sim}{\eta}-{\eta}_{0})]=\frac{1}{n}\sum _{\nu}\frac{1}{{(1+\lambda {\rho}_{\nu})}^{2}}+\lambda \sum _{\nu}\frac{\lambda {\rho}_{\nu}}{{(1+\lambda {\rho}_{\nu})}^{2}}{\rho}_{\nu}{\eta}_{\nu ,0}^{2},\\ E[\lambda J(\stackrel{\sim}{\eta}-{\eta}_{0})]=\frac{1}{n}\sum _{\nu}\frac{\lambda {\rho}_{\nu}}{{(1+\lambda {\rho}_{\nu})}^{2}}+\lambda \sum _{\nu}\frac{{(\lambda {\rho}_{\nu})}^{2}}{{(1+\lambda {\rho}_{\nu})}^{2}}{\rho}_{\nu}{\eta}_{\nu ,0}^{2}.\end{array}$$

(6.3)

Combining Lemma 6.3 and (6.3), we obtain that (*V* + *λJ*)( − *η*_{0}) = *O _{p}*(

We now investigate the approximation error * − and prove the convergence rate of *. Define *A _{f,g}*(

$${\dot{A}}_{f,g}(0)=-\frac{1}{n}\sum _{i=1}^{n}{\mathit{\int}}_{{\rm T}}\{g({W}_{i})-{s}_{n}^{-1}[\beta ,f](t){s}_{n}[g;\beta ,f](t)\}d{N}_{i}(t)+\lambda J(f,g),$$

(6.4)

$$\begin{array}{lll}{\dot{B}}_{f,g}(0)& =& -\frac{1}{n}\sum _{i=1}^{n}{\mathit{\int}}_{{\rm T}}\{g({W}_{i})-{s}_{n}^{-1}[\beta ,{\eta}_{0}](t){s}_{n}[g;\beta ,{\eta}_{0}](t)\}d{N}_{i}(t)\\ & & +V(f-{\eta}_{0},g)+\lambda J(f,g).\end{array}$$

(6.5)

Set *f* = * and *g* = * − in (6.4), and set *f* = and *g* = * − in (6.5). Then subtracting the resulted equations gives

$$\begin{array}{l}{\mu}_{{\widehat{\eta}}^{\ast}}({\widehat{\eta}}^{\ast}-\stackrel{\sim}{\eta})-{\mu}_{\stackrel{\sim}{\eta}}({\widehat{\eta}}^{\ast}-\stackrel{\sim}{\eta})+\lambda J({\widehat{\eta}}^{\ast}-\stackrel{\sim}{\eta})\\ \phantom{\rule{8em}{0ex}}=V(\stackrel{\sim}{\eta}-{\eta}_{0},{\widehat{\eta}}^{\ast}-\stackrel{\sim}{\eta})+{\mu}_{{\eta}_{0}}({\widehat{\eta}}^{\ast}-\stackrel{\sim}{\eta})-{\mu}_{\stackrel{\sim}{\eta}}({\widehat{\eta}}^{\ast}-\stackrel{\sim}{\eta}),\end{array}$$

(6.6)

where ${\mu}_{f}(g)\equiv \frac{1}{n}{\sum}_{i=1}^{n}{\mathit{\int}}_{{\rm T}}{s}_{n}^{-1}[\beta ,f](t){s}_{n}[g;\beta ,f](t)d{N}_{i}(t)$. Define

$${S}_{n}[f,g](t)=\frac{{s}_{n}[fg;\beta ,{\eta}_{0}](t)}{{s}_{n}[\beta ,{\eta}_{0}](t)}-\frac{{s}_{n}[f;\beta ,{\eta}_{0}](t)}{{s}_{n}[\beta ,{\eta}_{0}](t)}\frac{{s}_{n}[g;\beta ,{\eta}_{0}](t)}{{s}_{n}[\beta ,{\eta}_{0}](t)}$$

and *S* [*f, g*](*t*) be its limit. The following lemma is needed to proceed.

$$\frac{1}{n}\sum _{i=1}^{n}{\mathit{\int}}_{{\rm T}}{S}_{n}[f,g](t)d{N}_{i}(t)=V(f,g)+{o}_{p}({\{(V+\lambda J)(f)(V+\lambda J)(g)\}}^{1/2}).$$

Let *f* = *Σ _{ν}f_{ν}ϕ_{ν}* and

$$M(t)\equiv M(t\mid Z)=N(t)-{\mathit{\int}}_{0}^{t}s[\beta ,{\eta}_{0}](\tau )d{\mathrm{\Lambda}}_{0}(\tau )$$

defines a local martingale with mean zero. Combining the above uniform convergence result and the martingale property with the boundedness condition, we obtain that for any *ν* and *μ*,

$$E\left[{\left\{{\mathit{\int}}_{{\rm T}}S[{\varphi}_{\nu},{\varphi}_{\mu}](t)dN(t)-V({\varphi}_{\nu},{\varphi}_{\mu})\right\}}^{2}\right]<\infty $$

Then from the Cauchy-Schwartz inequality and Lemma 6.3,

$$\begin{array}{l}\left|\frac{1}{n}\sum _{i=1}^{n}{\mathit{\int}}_{{\rm T}}{S}_{n}[f,g](t)\phantom{\rule{0.1em}{0ex}}d{N}_{i}(t)-V(f,g)\right|\\ \phantom{\rule{4em}{0ex}}=\left|\sum _{\nu}\sum _{\mu}{f}_{\nu}\phantom{\rule{0.1em}{0ex}}{g}_{\mu}\{\frac{1}{n}\sum _{i=1}^{n}{\mathit{\int}}_{{\rm T}}{S}_{n}[{\varphi}_{\nu},{\varphi}_{\mu}](t)\phantom{\rule{0.1em}{0ex}}d{N}_{i}(t)-V({\varphi}_{\nu},{\varphi}_{\mu})\}\right|\\ \phantom{\rule{4em}{0ex}}\le \{\sum _{\nu}\sum _{\mu}\frac{1}{1+\lambda {\rho}_{\nu}}\frac{1}{1+\lambda {\rho}_{\mu}}\\ \phantom{\rule{5.5em}{0ex}}{\times {\{\frac{1}{n}\sum _{i=1}^{n}{\mathit{\int}}_{{\rm T}}{S}_{n}[{\varphi}_{\nu},{\varphi}_{\mu}](t)\phantom{\rule{0.1em}{0ex}}d{N}_{i}(t)-V({\varphi}_{\nu},{\varphi}_{\mu})\}}^{2}\}}^{1/2}\\ \phantom{\rule{5em}{0ex}}\times {\left\{\sum _{\nu}\sum _{\mu}(1+\lambda {\rho}_{\nu})(1+\lambda {\rho}_{\mu})\phantom{\rule{0.1em}{0ex}}{f}_{\nu}^{2}\phantom{\rule{0.1em}{0ex}}{g}_{\mu}^{2}\right\}}^{1/2}\\ \phantom{\rule{4em}{0ex}}={O}_{p}({n}^{-1/2}{\lambda}^{-1/r}){\{(V+\lambda J)(f)(V+\lambda J)(g)\}}^{1/2}.\end{array}$$

A Taylor expansion at *η*_{0} gives

$$\begin{array}{c}{\mu}_{{\widehat{\eta}}^{\ast}}({\widehat{\eta}}^{\ast}-\stackrel{\sim}{\eta})-{\mu}_{\stackrel{\sim}{\eta}}({\widehat{\eta}}^{\ast}-\stackrel{\sim}{\eta})=\frac{1}{n}\sum _{i=1}^{n}{\mathit{\int}}_{{\rm T}}{S}_{n}[{\widehat{\eta}}^{\ast}-\stackrel{\sim}{\eta},{\widehat{\eta}}^{\ast}-\stackrel{\sim}{\eta}](t)d{N}_{i}(t)(1+{o}_{p}(1)),\\ {\mu}_{\stackrel{\sim}{\eta}}({\widehat{\eta}}^{\ast}-\stackrel{\sim}{\eta})-{\mu}_{{\eta}_{0}}({\widehat{\eta}}^{\ast}-\stackrel{\sim}{\eta})=\frac{1}{n}\sum _{i=1}^{n}{\mathit{\int}}_{{\rm T}}{S}_{n}[\stackrel{\sim}{\eta}-{\eta}_{0},{\widehat{\eta}}^{\ast}-\stackrel{\sim}{\eta}](t)d{N}_{i}(t)(1+{o}_{p}(1)).\end{array}$$

Then by the mean value theorem, Condition A6, Lemma 6.4, and (6.6),

$$\begin{array}{l}({c}_{1}V+\lambda J)({\widehat{\eta}}^{\ast}-\stackrel{\sim}{\eta})(1+{o}_{p}(1))\\ \phantom{\rule{3em}{0ex}}\le {\{(|1-c|V+\lambda J)({\widehat{\eta}}^{\ast}-\stackrel{\sim}{\eta})\}}^{1/2}{O}_{p}({\{(|1-c|V+\lambda J)(\stackrel{\sim}{\eta}-{\eta}_{0})\}}^{1/2})\end{array}$$

for some *c* [*c*_{1}, *c*_{2}]. Then the convergence rate of * follows from that of proved in the previous step.

Our last goal is the convergence rate for the minimizer in the space * _{n}*. For any

$$\begin{array}{l}V(h)=|\frac{1}{{q}_{n}}\sum _{l=1}^{{q}_{n}}{\mathit{\int}}_{{\rm T}}{S}_{{q}_{n}}[h,h](t)d{N}_{{i}_{l}}(t)-V(h)|\\ \phantom{\rule{8.5em}{0ex}}={O}_{p}({q}_{n}^{-1/2}{\lambda}^{-1/r})(V+\lambda J)(h)={o}_{p}(\lambda J(h)),\end{array}$$

(6.7)

where the last equality follows from *q _{n}*

Let *η** be the projection of * in * _{n}*. Setting

$$\begin{array}{l}\lambda J({\widehat{\eta}}^{\ast}-{\eta}^{\ast})=\{\frac{1}{n}\sum _{i=1}^{n}{\mathit{\int}}_{{\rm T}}({\widehat{\eta}}^{\ast}-{\eta}^{\ast})({W}_{i})d{N}_{i}(t)-{\mu}_{{\eta}_{0}}({\widehat{\eta}}^{\ast}-{\eta}^{\ast})\}\\ \phantom{\rule{13em}{0ex}}-\{{\mu}_{{\widehat{\eta}}^{\ast}}({\widehat{\eta}}^{\ast}-{\eta}^{\ast})-{\mu}_{{\eta}_{0}}({\widehat{\eta}}^{\ast}-{\eta}^{\ast})\}\end{array}$$

(6.8)

Recall that
${\gamma}_{\nu}=\frac{1}{n}{\sum}_{i=1}^{n}{\mathit{\int}}_{{\rm T}}{\varphi}_{\nu}({W}_{i})d{N}_{i}(t)-{\mu}_{{\eta}_{0}}({\varphi}_{\nu})$ with *E*[*γ _{ν}*] = 0 and
$E\left[{\gamma}_{\nu}^{2}\right]=1/n$. An application of the Cauchy-Schwartz inequality and Lemma 6.3 shows that the first term in (6.8) is of order {(

Note that *J*(* − *η**, *η**) = *J*(* − *η**, ) = 0, so *J*(*, * − ), = *J* (* − *η**) + *J* (*η**, *η** − ). Set *f* = and *g* = − *η** in (6.4), and set *f* = * and *g* = * − in (6.4). Adding the resulted equations yields

$$\begin{array}{l}{\mu}_{\widehat{\eta}}(\widehat{\eta}-{\eta}^{\ast})-{\mu}_{{\eta}_{0}}(\widehat{\eta}-{\eta}^{\ast})+\lambda J(\widehat{\eta}-{\eta}^{\ast})+\lambda J({\widehat{\eta}}^{\ast}-{\eta}^{\ast})\\ \phantom{\rule{4.5em}{0ex}}=\left\{\frac{1}{n}\sum _{i=1}^{n}{\mathit{\int}}_{{\rm T}}({\widehat{\eta}}^{\ast}-{\eta}^{\ast})({W}_{i})d{N}_{i}(t)-{\mu}_{{\eta}_{0}}({\widehat{\eta}}^{\ast}-{\eta}^{\ast})\right\}\\ \phantom{\rule{6em}{0ex}}-\{{\mu}_{{\widehat{\eta}}^{\ast}}({\widehat{\eta}}^{\ast}-{\eta}^{\ast})-{\mu}_{{\eta}_{0}}({\widehat{\eta}}^{\ast}-{\eta}^{\ast})\}+\{{\mu}_{{\widehat{\eta}}^{\ast}}(\widehat{\eta}-{\eta}^{\ast})-{\mu}_{{\eta}_{0}}(\widehat{\eta}-{\eta}^{\ast})\}.\end{array}$$

(6.9)

By the mean value theorem, Condition A6, and Lemma 6.4, the left-hand side of (6.9) is bounded from below by

$$({c}_{1}V+\lambda J)(\widehat{\eta}-{\eta}^{\ast})(1+{o}_{p}(1))+\lambda J({\widehat{\eta}}^{\ast}-{\eta}^{\ast}).$$

For the right-hand side, the terms in the first and second brackets are respectively of the orders {(*V* + *λJ*)(* − *η**)}^{1/2}*O _{p}*(

$${\{(V+\lambda J)(\widehat{\eta}-{\eta}^{\ast})\}}^{1/2}{o}_{p}({\{\lambda J({\widehat{\eta}}^{\ast}-{\eta}^{\ast})\}}^{1/2})$$

by Condition 3, Lemma 6.4, and (6.7). Putting all these together, one obtains (*V* + *λJ*)( − *η**) = *O _{p}*(

Let *P _{n}* be the empirical measure of (

*Let m*_{0}(*t, u, w; β, η*) = *u ^{T} β*+

$$\begin{array}{l}{\mathcal{M}}_{0}(\delta )=\{{m}_{0}:\parallel \beta -{\beta}_{0}\parallel \le \delta ,{\parallel \eta -{\eta}_{0}\parallel}_{2}\le \delta \},\\ {\mathcal{M}}_{1}(\delta )=\{{m}_{1}:s\in {\rm T},\parallel \beta -{\beta}_{0}\parallel \le \delta ,{\parallel \eta -{\eta}_{0}\parallel}_{2}\le \delta \},\\ {\mathcal{M}}_{2}(\delta )=\{{m}_{2}:s\in {\rm T},\parallel \beta -{\beta}_{0}\parallel \le \delta ,{\parallel \eta -{\eta}_{0}\parallel}_{2}\le \delta ,{\parallel h\parallel}_{2}\delta \}.\end{array}$$

*Then*
${J}_{\left[\right]}(\delta ,{M}_{0},{L}_{2}(P))\le {c}_{0}{q}_{n}^{1/2}\delta $ *and J*_{[]}(*δ, M _{j}, L*

The proof is similar to that of Corollary A.1 in [10] and thus omitted here.

$$\underset{t\in {\rm T}}{\mathrm{sup}}|{s}^{-1}[{\beta}_{0},\widehat{\eta}](t)s[U;{\beta}_{0},\widehat{\eta}](t)-{s}_{n}^{-1}[{\beta}_{0},\widehat{\eta}](t){s}_{n}[U;{\beta}_{0},\widehat{\eta}](t)|={o}_{p}({n}^{-1/2}).$$

Write

$$\begin{array}{l}{s}^{-1}[{\beta}_{0},\widehat{\eta}](t)s[U;{\beta}_{0},\widehat{\eta}](t)-{s}_{n}^{-1}[{\beta}_{0},\widehat{\eta}](t){s}_{n}[U;{\beta}_{0},\widehat{\eta}](t)=\\ \phantom{\rule{13em}{0ex}}\frac{s[{\beta}_{0},\widehat{\eta}](t){A}_{1n}(t)-s[U;{\beta}_{0},\widehat{\eta}](t){A}_{2n}(t)}{s[{\beta}_{0},\widehat{\eta}](t){s}_{n}[{\beta}_{0},\widehat{\eta}](t)},\end{array}$$

where *A*1_{n} = *s _{n}* [

Let *γ _{n}* =

$$P\{\underset{\parallel \upsilon \parallel =C}{\mathrm{sup}}{\mathcal{L}}_{P}({\beta}_{0}+{\gamma}_{n}\upsilon )<{\mathcal{L}}_{P}({\beta}_{0})\}\ge 1-\delta .$$

Consider * _{P}*(

$$\sqrt{s}{\gamma}_{n}{a}_{n}\parallel \upsilon \parallel +{\gamma}_{n}^{2}{b}_{n}{\parallel \upsilon \parallel}^{2}=C{\gamma}_{n}^{2}(\sqrt{s}+{b}_{n}C),$$

(6.10)

where *s* is the number of nonzero elements in *β*_{0}.

Applying the second order Taylor expansion to *n*^{−1} *D*_{n1} gives

$${n}^{-1}{D}_{n1}={\gamma}_{n}{\mathbf{v}}^{T}{\mathbf{J}}_{1n}-\frac{1}{2}{\gamma}_{n}^{2}{\mathbf{v}}^{T}{\mathbf{J}}_{2n}\mathbf{v}+{o}_{p}({n}^{-1})$$

(6.11)

with
${\mathbf{J}}_{1n}={P}_{n}\{U-{s}_{n}^{-1}[{\beta}_{0},\widehat{\eta}](t){s}_{n}[U;{\beta}_{0},\widehat{\eta}](t)\}$, and
${\mathbf{J}}_{2n}={P}_{n}\{{s}_{n}^{-2}[{\beta}_{0},\widehat{\eta}](t)[{s}_{n}[{\mathit{UU}}^{T};{\beta}_{0},\widehat{\eta}](t){s}_{n}[{\beta}_{0},\widehat{\eta}](t)-{s}_{n}[U;{\beta}_{0},\widehat{\eta}](t){s}_{n}[U;{\beta}_{0},\widehat{\eta}]{(t)}^{T}]\}$, where *U*(*u, w*) *u*.

Let ${\overline{U}}_{n}={\sum}_{i=1}^{n}{U}_{i}/n$. Note that ${s}_{n}^{-1}[{\beta}_{0},\widehat{\eta}](t){s}_{n}[{\overline{U}}_{n};{\beta}_{0},\widehat{\eta}](t)={\overline{U}}_{n}$. We have

$$\begin{array}{ll}{\mathbf{J}}_{1n}& ={P}_{n}\{U-{\overline{U}}_{n}-{s}_{n}^{-1}[{\beta}_{0},\widehat{\eta}](t){s}_{n}[U-{\overline{U}}_{n};{\beta}_{0},\widehat{\eta}](t)\}\\ & \equiv {I}_{1n}+{I}_{2n}+{I}_{3n},\end{array}$$

(6.12)

where

$$\begin{array}{lll}{I}_{1n}& =& ({P}_{n}-P)\{U-{\overline{U}}_{n}-{s}^{-1}[{\beta}_{0},\widehat{\eta}](t)s[U-{\overline{U}}_{n};{\beta}_{0},\widehat{\eta}](t)\},\\ {I}_{2n}& =& {P}_{n}\{{s}^{-1}[{\beta}_{0},\widehat{\eta}](t)s[U;{\beta}_{0},\widehat{\eta}](t)-{s}_{n}^{-1}[{\beta}_{0},\widehat{\eta}](t){s}_{n}[U;{\beta}_{0},\widehat{\eta}](t)\},\\ {I}_{3n}& =& P\{U-{\overline{U}}_{n}-{s}^{-1}[{\beta}_{0},\widehat{\eta}](t)s[U-{\overline{U}}_{n};{\beta}_{0},\widehat{\eta}](t)\}.\end{array}$$

Lemma 3.4.2 of [27] and Lemma 6.5 indicate that (*P _{n}* −

Next, we shall show the sparsity of . It suffices to show that for any given *β*_{1} satisfying ||*β*_{1} − *β*_{10}|| = *O _{p}* (

$${n}^{-1}\partial {\mathcal{L}}_{P}(\beta )/\partial {\beta}_{j}={P}_{n}\{{U}_{j}-{s}_{n}^{-1}[\beta ,\widehat{\eta}](t)\phantom{\rule{0.2em}{0ex}}{s}_{n}[{U}_{j};\beta ,\widehat{\eta}](t)\}-{p}_{{\theta}_{j}}^{\prime}(|{\beta}_{j}|)\mathrm{sgn}({\beta}_{j}).$$

Similar to bounding **J**_{1n}, the first term can be shown to be *O _{p}*(

Lastly, we show the asymptotic normality of _{1} using the result in [21]. Let *z _{i}* = (

$$\sum _{i=1}^{n}M({z}_{i},{\beta}_{1},\widehat{\eta})-n{\zeta}_{1}=0,$$

(6.13)

where $M(z,{\beta}_{1},\eta )=\mathit{\int}\{{U}_{1}-{s}_{n}^{-1}[{\beta}_{1},\eta ](t){s}_{n}[{U}_{1};{\beta}_{1},\eta ](t)\}dN(t)$ and ${\zeta}_{1}={({p}_{{\theta}_{1}}^{\prime}(|{\beta}_{1}|)\mathrm{sgn}({\beta}_{1}),\dots ,{p}_{{\theta}_{s}}^{\prime}\left(|{\beta}_{s}|\right)\mathrm{sgn}({\beta}_{s}))}^{T}$. Let

$$D(z,h)=\mathit{\int}\{\frac{{s}_{n}\left[{U}_{1}h;{\beta}_{10},{\eta}_{0}\right](t)}{{s}_{n}\left[{U}_{1}h;{\beta}_{10},{\eta}_{0}\right](t)}-\frac{{s}_{n}\left[{U}_{1};{\beta}_{10},{\eta}_{0}\right](t)}{{s}_{n}\left[{\beta}_{10},{\eta}_{0}\right](t)}\frac{{s}_{n}\left[h;{\beta}_{10},{\eta}_{0}\right](t)}{{s}_{n}\left[{\beta}_{10},{\eta}_{0}\right](t)}\}dN(t)$$

be the Fréchet derivative of *M*(*z, β*_{10}, *η*) at *η*_{0}. Since the convergence rate of is *n*^{−r/[2(r+1)]} = *o*(*n*^{−1/4}), the linearization assumption (Assumption 5.1) in [21] is satisfied. A derivation similar to bounding (6.12) can verify the stochastic assumption (Assumption 5.2) in [21]. Direct calculation yields *E*[*D*(*z, η* − *η*_{0})] = 0 for *η* close to *η*_{0}. Then the mean-square continuity assumption (Assumption 5.3) in [21] also holds with *α*(*z*) 0. By Lemma 5.1 in [21], _{1} thus has the same distribution as the solution to the equation

$$\sum _{i=1}^{n}M({z}_{i},{\beta}_{1},{\eta}_{0})-n{\zeta}_{1}=\mathbf{0}.$$

A straightforward simplification yields the result.

We would like to thank the Associate Editor and two referees for their insightful comments that have improved the article.

Pang Du, Department of Statistics, Virginia Tech, Blacksburg, VA 24061, USA, Email: ude.tv@udgnap.

Shuangge Ma, Department of Epidemiology and Public Health, Yale University School of Medicine, New Haven, CT 06520, USA, Email: ude.elay@am.eggnauhs.

Hua Liang, Department of Biostatistics and Computational Biology, University of Rochester, Rochester, NY 14642, USA, Email: ude.retsehcor.tsb@gnailh.

1. Andersen PK, Gill RD. Cox’s regression model for counting processes: A large sample study. Ann Statist. 1982;10:1100–1120.

2. Breiman Leo. Heuristics of instability and stabilization in model selection. Ann Statist. 1996;24(6):2350–2383.

3. Cai J, Fan J, Jiang J, Zhou H. Partially linear hazard regression for multivariate survival data. J Amer Statist Assoc. 2007;102(478):538–551.

4. Cai J, Fan J, Li R, Zhou H. Variable selection for multivariate failure time data. Biometrika. 2005;92:303–316. [PMC free article] [PubMed]

5. Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression (with discussion) Ann Statist. 2004;32(2):407–499.

6. Fan J, Gijbels I, King M. Local likelihood and local partial likelihood in hazard regression. Ann Statist. 1997;25(4):1661–1690.

7. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc. 2001;96(456):1348–1360.

8. Fan J, Li R. Variable selection for Cox’s proportional hazards model and frailty model. Ann Statist. 2002;30(1):74–99.

9. Gu C. Smoothing Spline ANOVA Models. Springer-Verlag; New York: 2002.

10. Huang J. Efficient estimation of the partly linear additive Cox model. Ann Statist. 1999;27(5):1536–1563.

11. Huang JZ, Kooperberg C, Stone CJ, Truong YK. Functional ANOVA modeling for proportional hazards regression. Ann Statist. 2000;28(4):961–999.

12. Huang JZ, Liu L. Polynomial spline estimation and inference of proportional hazards regression models with flexible relative risk form. Biometrics. 2006;62:793–802. [PubMed]

13. Johnson BA. Variable selection in semiparametric linear regression with censored data. J Roy Statist Soc Ser B. 2008;70(2):351–370.

14. Johnson BA, Lin DY, Zeng D. Penalized estimating functions and variable selection in semiparametric regression models. J Amer Statist Assoc. 2008;103(482):672–680. [PMC free article] [PubMed]

15. Kalbfleisch JD, Prentice RL. The Statistical Analysis of Failure Time Data. Wiley; New York: 2002.

16. Kim Y-J, Gu C. Smoothing spline Gaussian regression: More scalable computation via efficient approximation. J Roy Statist Soc Ser B. 2004;66:337–356.

17. Klein JP, Moeschberger ML. Survival Analysis: Techniques for Censored and Truncated Data. Springer-Verlag Inc; 1997.

18. Leng C, Zhang HH. Model selection in nonparametric hazard regression. Journal of Nonparametric Statistics. 2006;18(7-8):417–429.

19. Lin Y, Zhang HH. Component selection and smoothing in smoothing spline analysis of variance models. Ann Statist. 2006;34(5):2272–2297.

20. Murphy SA, van der Vaart AW. On profile likelihood (with discussion) J Amer Statist Assoc. 2000;95(450):449–465.

21. Newey WK. The asymptotic variance of semiparametric estimators. Econometrica. 1994;62:1349–1382.

22. Osborne Michael R, Presnell Brett, Turlach Berwin A. On the LASSO and its dual. J Comput Graph Statist. 2000;9(2):319–337.

23. O’Sullivan F. Nonparametric estimation in the Cox model. Ann Statist. 1993;21:124–145.

24. Park Trevor, Casella George. The Bayesian Lasso. J Amer Statist Assoc. 2008;103(482):681–686.

25. Tibshirani RJ. Regression shrinkage and selection via the lasso. J Roy Statist Soc Ser B. 1996;58:267–288.

26. Tibshirani Robert. The lasso method for variable selection in the Cox model. Statist in Med. 1997;16:385–395. [PubMed]

27. van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes. Springer-Verlag Inc; 1996.

28. Wahba G. volume 59 of CBMS-NSF Regional Conference Series in Applied Mathematics. SIAM; Philadelphia: 1990. Spline Models for Observational Data.

29. Weinberger HF. volume 15 of CBMS-NSF Regional Conference Series in Applied Mathematics. SIAM; Philadelphia: 1974. Variational Methods for Eigenvalue Approximation.

30. Yin G, Li H, Zeng D. Partially linear additive hazards regression with varying coefficients. J Amer Statist Assoc. 2008;103(483):1200–1213.

31. Zou H. The adaptive LASSO and its oracle properties. J Amer Statist Assoc. 2006;101(476):1418–1429.

32. Zou H. A note on path-based variable selection in the penalized proportional hazards model. Biometrika. 2008;95(1):241–247.

33. Zou H, Hastie T, Tibshirani R. On the “degree of freedom” of the LASSO. Ann Statist. 2007;35(5):2173–2192.

34. Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models (with discussion) Ann Statist. 2008;36(4):1509–1533. [PMC free article] [PubMed]

35. Zucker DM, Karr AF. Nonparametric survival analysis with time-dependent covariate effects: A penalized partial likelihood approach. Ann Statist. 1990;18:329–353.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |