Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2860882

Formats

Article sections

- Abstract
- 1 Introduction
- 2 Examples
- 3 Preliminaries
- 4 Main Results and Implications
- 5 Examples (Continued)
- 6 Future Work
- References

Authors

Related links

J Multivar Anal. Author manuscript; available in PMC 2010 April 28.

Published in final edited form as:

J Multivar Anal. 2009 March 1; 100(3): 345–362.

PMCID: PMC2860882

NIHMSID: NIHMS195772

See other articles in PMC that cite the published article.

The penalized profile sampler for semiparametric inference is an extension of the profile sampler method [9] obtained by profiling a penalized log-likelihood. The idea is to base inference on the posterior distribution obtained by multiplying a profiled penalized log-likelihood by a prior for the parametric component, where the profiling and penalization are applied to the nuisance parameter. Because the prior is not applied to the full likelihood, the method is not strictly Bayesian. A benefit of this approximately Bayesian method is that it circumvents the need to put a prior on the possibly infinite-dimensional nuisance components of the model. We investigate the first and second order frequentist performance of the penalized profile sampler, and demonstrate that the accuracy of the procedure can be adjusted by the size of the assigned smoothing parameter. The theoretical validity of the procedure is illustrated for two examples: a partly linear model with normal error for current status data and a semiparametric logistic regression model. Simulation studies are used to verify the theoretical results.

Semiparametric models are statistical models indexed by both a finite dimensional parameter of interest *θ* and an infinite dimensional nuisance parameter *η*. In order to make statistical inference about *θ* separately from *η*, we estimate the nuisance parameter with * _{θ}*, its maximum likelihood estimate at each fixed

$${\widehat{\eta}}_{\theta}={\text{argmax}}_{\eta \in \mathcal{H}}{\mathit{lik}}_{n}(\theta ,\eta ),$$

where *lik _{n}*(

$${pl}_{n}(\theta )=\underset{\eta \in \mathcal{H}}{sup}{\mathit{lik}}_{n}(\theta ,\eta ).$$

The convergence rate of the nuisance parameter *η* is the order of *d*(_{n}, *η*_{0}), where *d*(·, ·) is some metric on *η*, * _{n}* is any sequence satisfying

$$d({\widehat{\eta}}_{{\stackrel{\sim}{\theta}}_{n}},{\eta}_{0})={O}_{P}(||{\stackrel{\sim}{\theta}}_{n}-{\theta}_{\theta}||+{n}^{-r}),$$

(1)

where ||·|| is the Euclidean norm and *r >* 1/4. Of course, a smaller value of *r* leads to a slower convergence rate of the nuisance parameter. For instance, the nuisance parameter in the Cox proportional hazards model with right censored data, the cumulative hazard function, has the parametric rate, i.e., *r* = 1/2. If current status data is applied to the Cox model instead, then the convergence rate will be slower, with *r* = 1/3, due to the loss of information provided by this kind of data.

The profile sampler is the procedure of sampling from the posterior of the profile likelihood in order to estimate and draw inference on the parametric component *θ* in a semiparametric model, where the profiling is done over the possibly infinite-dimensional nuisance parameter *η*. [9] show that the profile sampler gives a first order correct approximation to the maximum likelihood estimator * _{n}* and consistent estimation of the efficient Fisher information for

The motivation for studying second order asymptotic properties of the profile sampler comes from the observed simulation differences in the Cox model with different types of data, i.e. right censored data [2] and current status data [9]. The profile sampler generated based on the first model yields much more accurate estimation results comparing to the second model when the sample size is relatively small. [2] and [3] have successfully explored the theoretical reasons behind the above phenomena by establishing the relation between the estimation accuracy of the profile sampler, measured in terms of second order asymptotics, and the convergence rate of the nuisance parameters. Specifically speaking, the profile sampler generated from a semiparametric model with a faster convergence rate usually yields more precise frequentist inference of *θ*. These second order results are verified in [2] and [3] for several examples, including the proportional odds model, case-control studies with missing covariates, and the partly linear model. The convergence rates for these models range from the parametric to the cubic. The work in [3] has shown clearly that the accuracy of the inference for *θ* based on the profile sampler method is intrinsically determined by the semiparametric model specifications through its entropy number.

In many semiparametric models involving a smooth nuisance parameter, it is often convenient and beneficial to perform estimation using penalization. One motivation for this is that, in the absence of any restrictions on the form of the function *η*, maximum likelihood estimation for some semiparametric models leads to over-fitting. Seminal applications of penalized maximum likelihood estimation include estimation of a probability density function in [18] and nonparametric linear regression in [19]. Note that penalized likelihood is a special case of penalized quasi-likelihood studied in [13]. Under certain reasonable regularity conditions, penalized semiparametric log-likelihood estimation can yield fully efficient estimates for *θ* (see, for example, [13]). As far as we are aware, the only general procedure for inference for *θ* in this context known to be theoretically valid is a weighted bootstrap with bounded random weights (see [11]). It is even unclear whether the usual nonparametric bootstrap will work in this context when the nuisance parameter has a convergence rate *r <* 1/2.

The purpose of this paper is to ask the somewhat natural question: does sampling from the exponential of a profiled penalized log-likelihood (which process we refer hereafter to as the penalized profile sampler) yield first and even second order accurate frequentist inference? The conclusion of this paper is that the answer is yes and, moreover, the accuracy of the inference depends in a fairly simple way on the size of the smoothing parameter.

The unknown parameters in the semiparametric models we study in this paper include *θ*, which we assume belongs to some compact set Θ * ^{d}*, and

$$log{\mathit{lik}}_{{\lambda}_{n}}(\theta ,\eta )=log\mathit{lik}(\theta ,\eta )-n{\lambda}_{n}^{2}{J}^{2}(\eta ),$$

(2)

where log *lik*(*θ*, *η*) *n*_{n}* _{θ,η}*(

For the purpose of establishing first order accuracy of inference for *θ* based on the penalized profile sampler, we assume that the bounds for the smoothing parameter are in the form below:

$${\lambda}_{n}={o}_{P}({n}^{-1/4})\phantom{\rule{0.38889em}{0ex}}\text{and}\phantom{\rule{0.38889em}{0ex}}{\lambda}_{n}^{-1}={O}_{P}({n}^{k/(2k+1)}).$$

(3)

The condition (3) is assumed to hold throughout this paper. One way to ensure (3) in practice is simply to set *λ _{n}* =

The log-profile penalized likelihood is defined as follows:

$$log{pl}_{{\lambda}_{n}}(\theta )=log\mathit{lik}(\theta ,{\widehat{\eta}}_{\theta ,{\lambda}_{n}})-n{\lambda}_{n}^{2}{J}^{2}({\widehat{\eta}}_{\theta ,{\lambda}_{n}}),$$

(4)

where _{θ}_{,}* _{λn}* is

*Distribution Approximation*: The posterior distribution with respect to*pl*(_{λn}*θ*) can be approximated by the normal distribution with mean the maximum penalized likelihood estimator of*θ*and variance the inverse of the efficient information matrix, with error ${O}_{P}({n}^{1/2}{\lambda}_{n}^{2})$;*Moment Approximation*: The maximum penalized likelihood estimator of*θ*can be approximated by the mean of the MCMC chain with error ${O}_{P}({\lambda}_{n}^{2})$. The efficient information matrix can be approximated by the inverse of the variance of the MCMC chain with error ${O}_{P}({n}^{1/2}{\lambda}_{n}^{2})$;*Confidence Interval Approximation*: An exact frequentist confidence interval of Wald’s type for*θ*can be estimated by the credible set obtained from the MCMC chain with error ${O}_{P}({\lambda}_{n}^{2})$.

Obviously, given any smoothing parameter satisfying the upper bound in (3), the penalized profile sampler can yield first order frequentist valid inference for *θ*, similar as to what was shown for the profile sampler in [9]. Moreover, the above conclusions are actually second order frequentist valid results, whose approximation accuracy is directly controlled by the smoothing parameter. Note that the corresponding results for the usual (non-penalized) profile sampler with nuisance parameter convergence rate *r* in [3] are obtained by replacing in the above
${O}_{P}({n}^{1/2}{\lambda}_{n}^{2})$ with *O _{P}*(

Our results are the first general higher order frequentist inference results for penalized semi-parametric estimation. We also note, however, that some results on second order efficiency of semiparametric estimators were derived in [4]. The layout of the article is as follows. The next section, section 2, introduces the two main examples we will be using for illustration: partly linear regression for current status data and semiparametric logistic regression. Some background is given in section 3, including the concept of a least favorable submodel as well as the main model assumptions. One preliminary theorem concerning about second order asymptotic expansions of the log-profile penalized likelihood is also presented in section 3. The main results and implications are discussed in section 4, and all remaining model assumptions are verified for the examples in section 5. A brief discussion of future work is given in section 6. We postpone all technical tools and proofs to the last section, section 7.

In this example, we study the partly linear regression model with normal residue error. The continuous outcome *Y*, conditional on the covariates (*U*, *V*) * ^{d}* × , is modeled as

$$Y={\theta}^{T}U+f(V)+\epsilon ,$$

(5)

where *f* is an unknown smooth function, and *ε* ~ *N*(0, *σ*^{2}) with finite variance *σ*^{2}. For simplicity, we assume for the rest of the paper that *σ* = 1. The theory we propose also works when *σ* is unknown, but the added complexity would detract from the main issues. We also assume that only the current status of response *Y* is observed at a random censoring time *C* . In other words, we observe *X* = (*C*, Δ, *U*, *V*), where indicator Δ = 1{*Y* ≤ *C*}. Current status data may occur due to study design or measurement limitations. Examples of such data arise in several fields, including demography, epidemiology and econometrics. For simplicity of exposition, *θ* is assumed to be one dimensional.

Under the model (5) and given that the joint distribution for (*C*, *U*, *V*) does not involve parameters (*θ*, *f*), the log-likelihood for a single observation at *X* = *x* (*c*, *δ*, *u*, *v*) is

$${\mathit{loglik}}_{\theta ,f}(x)=\delta log\{\mathrm{\Phi}(c-\theta u-f(v))\}+(1-\delta )log\{1-\mathrm{\Phi}(c-\theta u-f(v))\},$$

(6)

where Φ is the cdf of the standard normal distribution. The parameter of interest, *θ*, is assumed to belong to some compact set in ^{1}. The nuisance parameter is the function *f*, which belongs to the Sobolev function class of degree *k*. We further make the following assumptions on this model. We assume that (*Y*, *C*) is independent given (*U*, *V*). The covariates (*U*, *V*) are assumed to belong to some compact set, and the support for random censoring time *C* is an interval [*l _{c}*,

Let *X*_{1} = (*Y*_{1}, *W*_{1}, *Z*_{1}), *X*_{2} = (*Y*_{2}, *W*_{2}, *Z*_{2}), … be independent copies of *X* = (*Y*, *W*, *Z*), where *Y* is a dichotomous variable with conditional expectation *P*(*Y|W*, *Z*) = *F*(*θ ^{T}W* +

$${\mathit{lik}}_{\theta ,\eta}(x)=F{({\theta}^{T}w+\eta (z))}^{y}{(1-F({\theta}^{T}w+\eta (z)))}^{1-y}{f}^{(W,Z)}(w,z).$$

(7)

This example is a special case of quasi-likelihood in partly linear models when the conditional variance of response *Y* is taken to have some quadratic form of the conditional mean of *Y*. In the absence of any restrictions on the form of the function *η*, the maximum likelihood of this simple model often leads to over-fitting. Hence [5] propose maximizing instead the penalized likelihood of the form
$log\mathit{lik}(\theta ,\eta )-n{\lambda}_{n}^{2}{J}^{2}(\eta )$; and [13] showed the asymptotic consistency of the maximum penalized likelihood estimators for *θ* and *η*. For simplicity, we will restrict ourselves to the case where Θ ^{1} and (*W*, *Z*) have bounded support, say [0, 1]^{2}. To ensure the identifiability of the parameters, we assume that *PVar*(*W|Z*) is positive and that the support of *Z* contains at least *k* distinct points in [0, 1], see lemma 7.1 in [15].

Another interesting potential example we may apply the penalized profile sampler method to is the classic proportional hazards model with current status data by penalizing the cumulative hazard function with its Sobolev norm. There are two motivations for us to penalize the cumulative hazard function in the Cox model. One is that the estimated step functions from the unpenalized estimation cannot be used easily for other estimation or inference purposes. Another issue with the unpenalized approach is that without making stronger continuity assumptions, we cannot achieve uniform consistency even on a compact set [10]. The asymptotic properties of the corresponding penalized M-estimators have been studied in [12].

In this section, we present some necessary preliminary material concerning least favorable sub-models and assume some structural requirements to achieve second order asymptotic expansion of the log-profile penalized likelihood (21).

In this subsection, we briefly review the concept of a least favorable submodel. A submodel *t* *lik _{t,ηt}* is defined to be least favorable at (

$$log{pl}_{{\lambda}_{n}}(\theta )=n({\mathbb{P}}_{n}\ell (\theta ,\theta ,{\widehat{\eta}}_{\theta ,{\lambda}_{n}})-{\lambda}_{n}^{2}{J}^{2}({\eta}_{\theta}(\theta ,{\widehat{\eta}}_{\theta ,{\lambda}_{n}}))),$$

(8)

where (*t*, *θ*, *η*)(*x*) = _{t,ηt}_{(}_{θ,η}_{)}(*x*), *t* *η _{t}*(

The derivatives of the function (*t*, *θ*, *η*) are with respect to its first argument, *t*. For the derivatives relative to the argument *θ*, we use the following shortened notation: * _{θ}*(

The set of structural conditions about the least favorable submodel are the “no-bias” conditions:

$$P\stackrel{.}{\ell}({\theta}_{0},{\stackrel{\sim}{\theta}}_{n},{\widehat{\eta}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}})={O}_{P}{({\lambda}_{n}+||{\stackrel{\sim}{\theta}}_{n}-{\theta}_{0}||)}^{2},$$

(9)

$$P\ddot{\ell}({\theta}_{0},{\stackrel{\sim}{\theta}}_{n},{\widehat{\eta}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}})=P{\ddot{\ell}}_{0}+{O}_{P}({\lambda}_{n}+||{\stackrel{\sim}{\theta}}_{n}-{\theta}_{0}||),$$

(10)

for any sequence * _{n}* satisfying

$$d({\widehat{\eta}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}},{\eta}_{0})={O}_{P}({\lambda}_{n}+||{\stackrel{\sim}{\theta}}_{n}-{\theta}_{0}||).$$

(11)

The form of *d*(*η*, *η*_{0}) may vary for different situations and does not need to be specified in this subsection beyond the given conditions. (11) implies that _{n}_{,}* _{λn}* is consistent for

$$P\stackrel{.}{\ell}({\theta}_{0},{\theta}_{0},\eta )=O({d}^{2}(\eta ,{\eta}_{0})),$$

(12)

which is usually implied by a bounded Fréchet derivative of *η* (*θ*_{0}, *θ*_{0}, *η*) and second order Fréchet differentiability of the map *η* *lik*(*θ*_{0}, *η*).

The empirical version of the no-bias conditions,

$${\mathbb{P}}_{n}\stackrel{.}{\ell}({\theta}_{0},{\stackrel{\sim}{\theta}}_{n},{\widehat{\eta}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}})={\mathbb{P}}_{n}{\stackrel{\sim}{\ell}}_{0}+{O}_{P}{({\lambda}_{n}+||{\stackrel{\sim}{\theta}}_{n}-{\theta}_{0}||)}^{2},$$

(13)

$${\mathbb{P}}_{n}\ddot{\ell}({\theta}_{0},{\stackrel{\sim}{\theta}}_{n},{\widehat{\eta}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}})=\mathbb{P}{\ddot{\ell}}_{0}+{O}_{P}({\lambda}_{n}+||{\stackrel{\sim}{\theta}}_{n}-{\theta}_{0}||),$$

(14)

where * _{n}* represents the empirical distribution of the observations, ensures that the penalized profile likelihood behaves like a penalized likelihood in the parametric model asymptotically and therefore yields a second order asymptotic expansion of the penalized profile log-likelihood. Obviously the empirical no-bias conditions are built upon (9) and (10) by assuming the sizes of the collections of the functions and are manageable. This condition is expressed in the language of empirical processes. Provided that

$${\mathbb{G}}_{n}(\ddot{\ell}({\theta}_{0},{\stackrel{\sim}{\theta}}_{n},{\widehat{\eta}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}})-{\ddot{\ell}}_{0})={o}_{P}(1),$$

(15)

where ${\mathbb{G}}_{n}\equiv \sqrt{n}({\mathbb{P}}_{n}-P)$ is used for the empirical processes of the observations. If we further assume that

$${\mathbb{G}}_{n}({\ell}_{t,\theta}({\theta}_{0},{\overline{\theta}}_{n},{\widehat{\eta}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}})-{\ell}_{t,\theta}({\theta}_{0},{\theta}_{0},{\eta}_{0}))={o}_{P}(1),$$

(16)

$${\mathbb{G}}_{n}(\stackrel{.}{\ell}({\theta}_{0},{\theta}_{0},{\widehat{\eta}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}})-{\stackrel{.}{\ell}}_{0})={O}_{P}({n}^{{\scriptstyle \frac{1}{4k+2}}}({\lambda}_{n}+\left|\right|{\stackrel{\sim}{\theta}}_{n}-{\theta}_{0}\left|\right|)),$$

(17)

for any sequence * _{n}* satisfying

Let (13) and (14) be satisfied and suppose that

$$({\mathbb{P}}_{n}-P){\ell}^{(3)}({\overline{\theta}}_{n},{\stackrel{\sim}{\theta}}_{n},{\widehat{\eta}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}})={o}_{P}(1),$$

(18)

$${\lambda}_{n}J({\widehat{\eta}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}})={O}_{P}({\lambda}_{n}+||{\stackrel{\sim}{\theta}}_{n}-{\theta}_{0}||),$$

(19)

for any sequence _{n} and _{n} satisfying _{n} = θ_{0} + o_{P} (1) and _{n} = θ_{0} + o_{P} (1). If θ_{0} is an interior point in Θ and _{λn} is consistent, then we have

$$\sqrt{n}({\widehat{\theta}}_{{\lambda}_{n}}-{\theta}_{0})=\frac{1}{\sqrt{n}}\sum _{i=1}^{n}{\stackrel{\sim}{I}}_{0}^{-1}{\stackrel{\sim}{\ell}}_{0}({X}_{i})+{O}_{P}({n}^{1/2}{\lambda}_{n}^{2}),$$

(20)

$$log{pl}_{{\lambda}_{n}}({\stackrel{\sim}{\theta}}_{n})=log{pl}_{{\lambda}_{n}}({\widehat{\theta}}_{{\lambda}_{n}})-\frac{n}{2}{({\stackrel{\sim}{\theta}}_{n}-{\widehat{\theta}}_{{\lambda}_{n}})}^{T}{\stackrel{\sim}{I}}_{0}({\stackrel{\sim}{\theta}}_{n}-{\widehat{\theta}}_{{\lambda}_{n}})+{O}_{P}({g}_{{\lambda}_{n}}(\left|\right|{\stackrel{\sim}{\theta}}_{n}-{\widehat{\theta}}_{{\lambda}_{n}}\left|\right|)),$$

(21)

where
${g}_{{\lambda}_{n}}(w)=n{w}^{3}+n{w}^{2}{\lambda}_{n}+nw{\lambda}_{n}^{2}+{n}^{1/2}{\lambda}_{n}^{2}$, provided the efficient information Ĩ_{0} is positive definite.

For the verification of (18), we need to make use of a Glivenko-Cantelli theorem for classes of functions that change with *n* which is a modification of theorem 2.4.3 in [22] and is explained in the appendix. Moreover, (19) implies that *J*(* _{λn}*) =

The results in theorem 1 are useful in their own right for inference about θ. (20) is a second higher order frequentist result in penalized semiparametric estimation regarding the asymptotic linearity of the maximum penalized likelihood estimator of θ.

We now state the main results on the penalized posterior profile distribution. A preliminary result, theorem 2 with corollary 1 below, shows that the penalized posterior profile distribution is asymptotically close enough to the distribution of a normal random variable with mean * _{λn}* and variance (

Let
${\stackrel{\sim}{P}}_{\theta \mid \stackrel{\sim}{X}}^{{\lambda}_{n}}$ be the penalized posterior profile distribution of *θ* with respect to the prior *ρ*(*θ*). Define

$${\mathrm{\Delta}}_{{\lambda}_{n}}(\theta )={n}^{-1}\{log{pl}_{{\lambda}_{n}}(\theta )-log{pl}_{{\lambda}_{n}}({\widehat{\theta}}_{{\lambda}_{n}})\}.$$

Let (20) and (21) be satisfied and suppose that

$${\mathrm{\Delta}}_{{\lambda}_{n}}({\stackrel{\sim}{\theta}}_{n})={o}_{P}(1)\phantom{\rule{0.38889em}{0ex}}\mathit{implies}\phantom{\rule{0.38889em}{0ex}}{\stackrel{\sim}{\theta}}_{n}={\theta}_{0}+{o}_{P}(1),$$

(22)

for every random {_{n}} Θ. If proper prior ρ(θ_{0}) > 0 and ρ(·) has continuous and finite first order derivative in some neighborhood of θ_{0}, then we have,

$$\underset{\xi \in {\mathbb{R}}^{d}}{sup}|{\stackrel{\sim}{P}}_{\theta |\stackrel{\sim}{X}}^{{\lambda}_{n}}(\sqrt{n}{\stackrel{\sim}{I}}_{0}^{1/2}(\theta -{\widehat{\theta}}_{{\lambda}_{n}})\le \xi )-{\mathrm{\Phi}}_{d}(\xi )\mid ={O}_{P}({n}^{1/2}{\lambda}_{n}^{2}),$$

(23)

where Φ_{d}(·) is the distribution of the d-dimensional standard normal random variable.

Under the assumptions of theorem 2, we have that if θ has finite second absolute moment, then

$${\widehat{\theta}}_{{\lambda}_{n}}={E}_{\theta \mid \stackrel{\sim}{X}}^{{\lambda}_{n}}(\theta )+{O}_{P}({\lambda}_{n}^{2}),$$

(24)

$${\stackrel{\sim}{I}}_{0}={n}^{-1}{(Va{r}_{\theta \mid \stackrel{\sim}{X}}^{{\lambda}_{n}}(\theta ))}^{-1}+{O}_{P}({n}^{1/2}{\lambda}_{n}^{2}),$$

(25)

where ${E}_{\theta \mid \stackrel{\sim}{X}}^{{\lambda}_{n}}(\theta )$ and $Va{r}_{\theta \mid \stackrel{\sim}{X}}^{{\lambda}_{n}}(\theta )$ are the penalized posterior profile mean and penalized posterior profile covariance matrix, respectively.

We now present another second order asymptotic frequentist property of the penalized profile sampler in terms of quantiles. The *α*-th quantile of the penalized posterior profile distribution, *τ _{nα}*, is defined as
${\tau}_{n\alpha}=inf\{\xi :{\stackrel{\sim}{P}}_{\theta \mid \stackrel{\sim}{X}}^{{\lambda}_{n}}(\theta \le \xi )\ge \alpha \}$, where the inf is taken componentwise. Without loss of generality, we can assume
${\stackrel{\sim}{P}}_{\theta \mid \stackrel{\sim}{X}}^{{\lambda}_{n}}(\theta \le {\tau}_{n\alpha})=\alpha $ because of the assumed smoothness of both the prior and the likelihood in our setting. We can also define
${\kappa}_{n\alpha}\equiv \sqrt{n}({\tau}_{n\alpha}-{\widehat{\theta}}_{{\lambda}_{n}})$, i.e.,
${\stackrel{\sim}{P}}_{\theta \mid \stackrel{\sim}{X}}^{{\lambda}_{n}}(\sqrt{n}(\theta -{\widehat{\theta}}_{{\lambda}_{n}})\le {\kappa}_{n\alpha})=\alpha $. Note that neither

Under the assumptions of theorem 2 and assuming that _{0}(X) has finite third moment with a nondegenerate distribution, then there exists a _{nα} based on the data such that
$P(\sqrt{n}({\widehat{\theta}}_{{\lambda}_{n}}-{\theta}_{0})\le {\widehat{\kappa}}_{n\alpha})=\alpha $ and
${\widehat{\kappa}}_{n\alpha}-{\kappa}_{n\alpha}={O}_{P}({n}^{1/2}{\lambda}_{n}^{2})$ for each choice of κ_{nα}.

Theorem 3 ensures that there exists a unique α-th quantile for θ up to
${O}_{P}({\lambda}_{n}^{2})$ in the frequentist set-up for each fixed τ_{nα}. Note that τ_{nα} is not unique if the dimension of θ is larger than one.

Theorem 2, corollary 1 and theorem 3 above show that the penalized profile sampler generates second order asymptotic frequentist valid results in terms of distributions, moments and quantiles. Moreover, the second order accuracy of this procedure is controlled by the smoothing parameter.

Another interpretation for the role of λ_{n} in the penalized profile sampler is that we can view λ_{n} as the prior on J(η), or on η to some extent. To see this, we can write lik_{λn} (θ, η) in the following form:

$${\mathit{lik}}_{{\lambda}_{n}}(\theta ,\eta )={\mathit{lik}}_{n}(\theta ,\eta )\times exp\left[-\frac{{J}^{2}(\eta )}{2({\scriptstyle \frac{1}{2n{\lambda}_{n}^{2}}})}\right]$$

This idea can be traced back to [23]. In other words, the prior on J(η) is a normal distribution with mean zero and variance
${(2{\lambda}_{n}^{2}n)}^{-1}$. Hence it is natural to expect λ_{n} to have some effect on the convergence rate of η. Other possible priors on the functional parameter include Dirichlet and Gaussian processes which are more commonly used in nonparametric Bayesian methodology.

We now illustrate verification of the assumptions in section 3.2 with the two examples that were introduced in section 2. Thus this section is a continuation of the earlier examples.

In this section we verify the regularity conditions for the partly linear model with current status data as well as present a small simulation study to gain insight into the moderate sample size agreement with the asymptotic theory.

We will concentrate on the estimation of the regression coefficient *θ*, considering the infinite dimensional parameter
$f\in {\mathcal{H}}_{k}^{M}$ as a nuisance parameter. The strengthened condition on *η*, together with the requirement that the density for the joint distribution (*U*, *V*, *C*) is strictly positive and finite, is necessary to verify the rate assumptions (27) and (28) in the below lemma 1. The score function of *θ*, * _{θ,f}*, is given as follows:

$${\stackrel{.}{\ell}}_{\theta ,f}(x)=uQ(x;\theta ,f),$$

where

$$Q(X;\theta ,f)=(1-\mathrm{\Delta})\frac{\phi ({q}_{\theta ,f}(X))}{1-\mathrm{\Phi}({q}_{\theta ,f}(X))}-\mathrm{\Delta}\frac{\phi ({q}_{\theta ,f}(X))}{\mathrm{\Phi}({q}_{\theta ,f}(X))},$$

*q _{θ,f}* (

$${h}_{0}(v)=\frac{{E}_{0}(U{Q}^{2}(X;\theta ,f)\mid V=v)}{{E}_{0}({Q}^{2}(X;\theta ,f)\mid V=v)},$$

where *E*_{0} is the expectation relative to the true parameters. The derivation of * _{θ,f}* and

$$\ell (t,\theta ,f)=log\mathit{lik}(t,{f}_{t}(\theta ,f)),$$

(26)

where *f _{t}*(

Under the above set-up for the partly linear normal model with current status data, we then have for λ_{n} satisfying (3) and
${\stackrel{\sim}{\theta}}_{n}\stackrel{p}{\to}{\theta}_{0}$,

$${\left|\right|{\widehat{f}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}}-{f}_{0}\left|\right|}_{2}={O}_{P}({\lambda}_{n}+||{\stackrel{\sim}{\theta}}_{n}-{\theta}_{0}||),$$

(27)

$${\lambda}_{n}J({\widehat{f}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}})={O}_{P}({\lambda}_{n}+||{\stackrel{\sim}{\theta}}_{n}-{\theta}_{0}||),$$

(28)

where ||·||_{2} represents the regular L_{2} norm. Moreover, if we also assume that f {g: ||g||_{∞} + J(g) ≤ } for some known , then

$${\left|\right|{\widehat{f}}_{{\stackrel{\sim}{\theta}}_{n}}-{f}_{0}\left|\right|}_{2}={O}_{P}({n}^{-k/(2k+1)}+||{\stackrel{\sim}{\theta}}_{n}-{\theta}_{0}||),$$

(29)

provided condition (3) holds.

Lemma 1 implies that the convergence rate of the estimated nuisance parameter is slower than that of the regular nuisance parameter by comparing (27) and (29). This result is not surprising since the slower rate is the trade-off for the smoother nuisance parameter estimator. However, the advantage of the penalized profile sampler is that we can control the convergence rate by assigning the smoothing parameter with different rates. To obtain the convergence rate of the non-penalized estimated nuisance parameter, we would need to assume that the Sobolev norm of the nuisance parameter has some known upper bound. Thus we can argue that the penalized method enables a relaxation of the assumptions needed for the nuisance parameter. Lemma 1 also indicates that ||_{λn} − f_{0}||_{2} = O_{P} (λ_{n}) and ||_{n} − f_{0}||_{2} = O_{P} (n^{−k/(k+2)}). Note that the convergence rate of the maximum penalized likelihood estimator, O_{P} (λ_{n}), is deemed as the optimal rate in [23]. Similar remarks also hold for lemma 4 in semiparametric logistic regression model example below.

Lemma 1 and 4 imply that J(_{λn}) = O_{P} (1) and J(_{λn}) = O_{P} (1), respectively. Thus the maximum likelihood estimators of the nuisance parameters in the two examples of this paper are consistent in the uniform norm, i.e. ||_{λn} − η_{0}||_{∞} = o_{P} (1) and ||_{λn} − f_{0}||_{∞} = o_{P} (1), since the sequences _{λn} and _{λn} consist of smooth functions defined on a compact set with asymptotically bounded first-order derivatives.

Under the above set-up for the partly linear normal model with current status data, assumptions (13), (14) and (18) are satisfied.

Under the above set-up for the partly linear normal model with current status data, condition (22) is satisfied.

In this subsection, we conducted simulations for the partly linear model with two different sizes of smoothing parameter, i.e. *λ _{n}* =

We next discuss the computation of _{θ}_{,}* _{λn}* in the simulations. For the special case of

In the following, the simulations are run for various sample sizes under a Lebesgue prior. For each sample size, 200 datasets were analyzed. The regression coefficient is *θ* = 1 and *f*(*v*) = sin(*πv*). We generate *U* ~ *Unif*[0, 1], *V* ~ *Unif*[−1, 1] and *C* ~ *Unif*[0, 2]. For each dataset, Markov chains of length 20, 000 with a burn-in period of 5, 000 were generated using the Metropolis algorithm. The jumping density for the coefficient was normal with current iteration and variance tuned to yield an acceptance rate of 20% – 40%. The approximate variance of the estimator of *θ* was computed by numerical differentiation with step size proportional to *n*^{−1/3} (*n*^{−2/5}) for the model with smoothing parameter *λ _{n}* =

Table 1 (2) in the below summarizes the simulation results for *θ* with smoothing parameter *λ _{n}* =

In the semiparametric logistic regression model, we can obtain the score function for *θ* and *η* by similar analysis performed in the first example, i.e. * _{θ,η}*(

$${h}_{0}(z)=\frac{{P}_{0}[W\stackrel{.}{F}({\theta}_{0}W+{\eta}_{0}(Z))\mid Z=z]}{{P}_{0}[\stackrel{.}{F}({\theta}_{0}W+{\eta}_{0}(Z))\mid Z=z]},$$

where (*u*) = *F*(*u*)(1 − *F*(*u*)). The above assumptions plus the requirement that *J*(*h*_{0}) < ∞ ensures the identifiability of the parameters. Thus the least favorable submodel can be written as:

$$\ell (t,\theta ,\eta )=log\mathit{lik}(t,{\eta}_{t}(\theta ,\eta )),$$

where *η _{t}*(

$$\begin{array}{l}\stackrel{.}{\ell}(t,\theta ,\eta )=(y-F(tw+\eta (z)+(\theta -t){h}_{0}(z)))(w-{h}_{0}(z)),\\ \ddot{\ell}(t,\theta ,\eta )=-\stackrel{.}{F}(tw+\eta (z)+(\theta -t){h}_{0}(z)){(w-{h}_{0}(z))}^{2},\\ {\ell}_{t,\theta}(t,\theta ,\eta )=-\stackrel{.}{F}(tw+\eta (z)+(\theta -t){h}_{0}(z))(w-{h}_{0}(z)){h}_{0}(z),\\ {\ell}^{(3)}(t,\theta ,\eta )=-\ddot{F}(tw+\eta (z)+(\theta -t){h}_{0}(z)){(w-{h}_{0}(z))}^{3},\\ {\ell}_{t,t,\theta}(t,\theta ,\eta )=-\ddot{F}(tw+\eta (z)+(\theta -t){h}_{0}(z)){(w-{h}_{0}(z))}^{2}{h}_{0}(z),\\ {\ell}_{t,\theta ,\theta}(t,\theta ,\eta )=-\ddot{F}(tw+\eta (z)+(\theta -t){h}_{0}(z))(w-{h}_{0}(z)){h}_{0}^{2}(z),\end{array}$$

where (·) is the second derivative of the function *F*(·). The rate assumptions will be shown in lemma 4. The remaining assumptions are verified in the last two lemmas:

Under the above set-up for the semiparametric logistic regression model, we have for λ_{n} satisfying condition (3) and any
${\stackrel{\sim}{\theta}}_{n}\stackrel{p}{\to}{\theta}_{0}$ that

$${\left|\right|{\widehat{\eta}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}}-{\eta}_{0}\left|\right|}_{2}={O}_{P}({\lambda}_{n}+||{\stackrel{\sim}{\theta}}_{n}-{\theta}_{0}||),$$

(30)

$${\lambda}_{n}J({\widehat{\eta}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}})={O}_{P}({\lambda}_{n}+||{\stackrel{\sim}{\theta}}_{n}-{\theta}_{0}||).$$

(31)

If we also assume that η {g: ||g||_{∞} + J(g) ≤ } for some known , then

$${\left|\right|{\widehat{\eta}}_{{\stackrel{\sim}{\theta}}_{n}}-{\eta}_{0}\left|\right|}_{2}={O}_{P}({n}^{-k/(2k+1)}+||{\stackrel{\sim}{\theta}}_{n}-{\theta}_{0}||),$$

(32)

provided condition (3) holds.

Under the above set-up for the semiparametric logistic regression model, assumptions (13), (14) and (18) are satisfied.

Under the above set-up for the semiparametric logistic regression model, condition (22) is satisfied.

Our paper evaluates the penalized profile sampler method from the frequentist view and discusses the effect of the smoothing parameter on estimation accuracy. One potential problem of interest is to sharpen the upper bound for the convergence rate of the approximation error in this paper, like the typical second-order asymptotic results in Edgeworth expansions, see, for example [1]. A formal study about the higher order comparisons between the profile sampler procedure and fully Bayesian procedure [17], which assigns priors to both the finite dimensional parameter and the infinite dimensional nuisance parameter, is also interesting. We expect that the involvement of a suitable prior on the infinite dimensional parameter would at least not decrease the estimation accuracy of the parameter of interest.

Another worthwhile avenue of research is to develop analogs of the profile sampler and penalized profile sampler to likelihood estimation under model misspecification and to general M-estimation. Some first order results for this setting in the case where the nuisance parameter may not be root-*n* consistent have been developed for a weighted bootstrap procedure in [11]. The studies about second order asymptotics under mild model misspecifications can provide theoretical insights into semiparametric model selection problems.

The authors thank Dr. Joseph Kadane for several insightful discussions.

We first state classical definitions for the covering number (entropy number) and bracketing number (bracketing entropy number) for a class of functions, and then present some technical tools about the entropy calculations and increments of empirical processes which will be employed in the proofs that follow. The notations and mean greater than, or smaller than, up to a universal constant.

Let be a subset of a (pseudo-) metric space (, *d*) of real-valued functions. The *δ*-covering number *N*(*δ*, , *d*) of is the smallest *N* for which there exist functions *a*_{1}, …, *a _{N}* in , such that for each

T1. For each 0 *< C <* ∞ and *δ >* 0 we have

$${H}_{B}(\delta ,\{\eta :{\left|\right|\eta \left|\right|}_{\infty}\le C,J(\eta )\le C\},{\left|\right|\xb7\left|\right|}_{\infty})\lesssim {(\frac{C}{\delta})}^{1/k},$$

(33)

$$H(\delta ,\{\eta :{\left|\right|\eta \left|\right|}_{\infty}\le C,J(\eta )\le C\},{\left|\right|\xb7\left|\right|}_{\infty})\lesssim {(\frac{C}{\delta})}^{1/k}.$$

(34)

T2. Let be a class of measurable functions such that *P f*^{2} *< δ*^{2} and ||*f*||_{∞} ≤ *M* for every *f* in . Then

$${E}_{P}^{\ast}{\left|\right|{\mathbb{G}}_{n}\left|\right|}_{\mathcal{F}}\lesssim K(\delta ,\mathcal{F},{L}_{2}(P))\left(1+\frac{K(\delta ,\mathcal{F},{L}_{2}(P))}{{\delta}^{2}\sqrt{n}}M\right),$$

where ||||_{} = sup_{f}_{} |*G _{n}f*| and
$K(\delta ,\mathcal{F},||\xb7||)={\int}_{0}^{\delta}\sqrt{1+{H}_{B}(\epsilon ,\mathcal{F},||\xb7||)}d\epsilon $.

T3. Let = {*f _{t}*:

$${N}_{B}(2\varepsilon |\left|F\right||,\mathcal{F},||\xb7||)\le N(\epsilon ,T,d).$$

T4. Let be a class of measurable functions *f*: × on a product of a finite set and an arbitrary measurable space (, ). Let *P* be a probability measure on × and let be its marginal on . For every *d* , let * _{d}* be the set of functions

T5. Let be a uniformly bounded class of measurable functions such that for some measurable *f*_{0}, sup_{f}_{} ||*f* − *f*_{0}||_{∞} < ∞. Moreover, assume that *H _{B}*(

$$\underset{f\in \mathcal{F}}{sup}\left[\frac{\mid ({\mathbb{P}}_{n}-P)(f-{f}_{0})\mid}{{\left|\right|f-{f}_{0}\left|\right|}_{2}^{1-\alpha /2}\vee {n}^{(\alpha -2)/[2(2+\alpha )]}}\right]={O}_{P}({n}^{-1/2}).$$

T6. For a probability measure *P*, let _{1} be a class of measurable functions *f*_{1}: , and let _{2} denote a class of continuous nondecreasing functions *f*_{2}: [0, 1]. Then,

$${H}_{B}(\epsilon ,{\mathcal{F}}_{2}({\mathcal{F}}_{1}),{L}_{2}(P))\le 2{H}_{B}(\epsilon /3,{\mathcal{F}}_{1},{L}_{2}(P))+\underset{Q}{sup}{H}_{B}(\epsilon /3,{\mathcal{F}}_{2},{L}_{2}(Q)).$$

T7. Let and be classes of measurable functions. Then for any probability measure *Q* and any 1 ≤ *r* ≤ ∞,

$${H}_{B}(2\epsilon ,\mathcal{F}+\mathcal{G},{L}_{r}(Q))\le {H}_{B}(\epsilon ,\mathcal{F},{L}_{r}(Q))+{H}_{B}(\epsilon ,\mathcal{G},{L}_{r}(Q)),$$

(35)

and, provided and are bounded by 1 in terms of ||·||_{∞},

$${H}_{B}(2\epsilon ,\mathcal{F}\xb7\mathcal{G},{L}_{r}(Q))\le {H}_{B}(\epsilon ,\mathcal{F},{L}_{r}(Q))+{H}_{B}(\epsilon ,\mathcal{G},{L}_{r}(Q)),$$

(36)

where · {*f* × *g: f* and *g* }.

The proof of T1 is found in [22]. T1 implies that the Sobolev class of functions with known bounded Sobolev norm is P-Donsker. T2 and T3 are separately lemma 3.4.2 and theorem 2.7.11 in [22]. T4 is lemma 9.2 in [16]. T5 is a result presented on page 79 of [20] and is a special case of lemma 5.13 on the same page, the proof of which can be found in pages 79–80. T6 and T7 are separately lemma 15.2 and 9.24 in [7].

We first show (20), and then we need to state one lemma before proceeding to the proof of (21). For the proof of (20), note that

$$0={\mathbb{P}}_{n}\stackrel{.}{\ell}({\widehat{\theta}}_{{\lambda}_{n}},{\widehat{\theta}}_{{\lambda}_{n}},{\widehat{\eta}}_{{\lambda}_{n}})+2{\lambda}_{n}^{2}{\int}_{\mathcal{Z}}{\widehat{\eta}}_{{\lambda}_{n}}^{(k)}(z){h}_{0}^{(k)}(z)dz.$$

Combining the third order Taylor expansion of _{λn}* _{n}*(

$$\frac{1}{\sqrt{n}}\sum _{i=1}^{n}{\stackrel{\sim}{I}}_{0}^{-1}{\stackrel{\sim}{\ell}}_{0}({X}_{i})=\sqrt{n}({\widehat{\theta}}_{{\lambda}_{n}}-{\theta}_{0})+{O}_{P}({n}^{1/2}{({\lambda}_{n}+||{\widehat{\theta}}_{{\lambda}_{n}}-{\theta}_{0}||)}^{2}).$$

(37)

The right-hand-side of (37) is of the order
${O}_{P}(\sqrt{n}{\lambda}_{n}^{2}+\sqrt{n}{w}_{n}(1+{w}_{n}+{\lambda}_{n}))$, where *w _{n}* represents ||

We next prove (21). Note that * _{λn}* −

$$log{pl}_{{\lambda}_{n}}({\widehat{\theta}}_{{\lambda}_{n}})=log{pl}_{{\lambda}_{n}}({\theta}_{0})+n{({\widehat{\theta}}_{{\lambda}_{n}}-{\theta}_{0})}^{T}{\mathbb{P}}_{n}{\stackrel{\sim}{\ell}}_{0}-\frac{n}{2}{({\widehat{\theta}}_{{\lambda}_{n}}-{\theta}_{0})}^{T}{\stackrel{\sim}{I}}_{0}({\widehat{\theta}}_{{\lambda}_{n}}-{\theta}_{0})+{O}_{P}({n}^{1/2}{\lambda}_{n}^{2}).$$

(38)

The difference between (38) and (56) generates

$$log{pl}_{{\lambda}_{n}}({\stackrel{\sim}{\theta}}_{n})=log{pl}_{{\lambda}_{n}}({\widehat{\theta}}_{{\lambda}_{n}})+n{({\stackrel{\sim}{\theta}}_{n}-{\widehat{\theta}}_{{\lambda}_{n}})}^{T}\left({\mathbb{P}}_{n}{\stackrel{\sim}{\ell}}_{0}-{\stackrel{\sim}{I}}_{0}({\widehat{\theta}}_{{\lambda}_{n}}-{\theta}_{0})\right)-\frac{n}{2}{({\stackrel{\sim}{\theta}}_{n}-{\widehat{\theta}}_{{\lambda}_{n}})}^{T}{\stackrel{\sim}{I}}_{0}({\stackrel{\sim}{\theta}}_{n}-{\widehat{\theta}}_{{\lambda}_{n}})+{O}_{P}({g}_{{\lambda}_{n}}(\left|\right|{\stackrel{\sim}{\theta}}_{n}-{\widehat{\theta}}_{{\lambda}_{n}}\left|\right|)).$$

Suppose that *F _{λn}* (·) is the penalized posterior profile distribution of
$\sqrt{n}{\varrho}_{n}$ with respect to the prior

$${F}_{{\lambda}_{n}}(\xi )=\frac{{\int}_{{\varrho}_{n}\in (-\infty ,{n}^{-1/2}\xi ]\cap {\mathrm{\Xi}}_{n}}\rho ({\widehat{\theta}}_{{\lambda}_{n}}+{\stackrel{\sim}{I}}_{0}^{-{\scriptstyle \frac{1}{2}}}{\varrho}_{n}){\scriptstyle \frac{{pl}_{{\lambda}_{n}}({\widehat{\theta}}_{{\lambda}_{n}}+{\stackrel{\sim}{I}}_{0}^{-{\scriptstyle \frac{1}{2}}}{\varrho}_{n})}{{pl}_{{\lambda}_{n}}({\widehat{\theta}}_{{\lambda}_{n}})}}d{\varrho}_{n}}{{\int}_{{\varrho}_{n}\in {\mathrm{\Xi}}_{n}}\rho ({\widehat{\theta}}_{{\lambda}_{n}}+{\stackrel{\sim}{I}}_{0}^{-{\scriptstyle \frac{1}{2}}}{\varrho}_{n}){\scriptstyle \frac{{pl}_{{\lambda}_{n}}({\widehat{\theta}}_{{\lambda}_{n}}+{\stackrel{\sim}{I}}_{0}^{-{\scriptstyle \frac{1}{2}}}{\varrho}_{n})}{{pl}_{{\lambda}_{n}}({\widehat{\theta}}_{{\lambda}_{n}})}}d{\varrho}_{n}}.$$

(39)

Note that *d _{n}* in the above is the short notation for

Choose *r _{n}* =

$${\int}_{\left|\right|{\varrho}_{n}\left|\right|>{r}_{n}}\rho ({\widehat{\theta}}_{{\lambda}_{n}}+{\stackrel{\sim}{I}}_{0}^{-{\scriptstyle \frac{1}{2}}}{\varrho}_{n})\frac{{pl}_{{\lambda}_{n}}({\widehat{\theta}}_{{\lambda}_{n}}+{\stackrel{\sim}{I}}_{0}^{-{\scriptstyle \frac{1}{2}}}{\varrho}_{n})}{{pl}_{{\lambda}_{n}}({\widehat{\theta}}_{{\lambda}_{n}})}d{\varrho}_{n}={O}_{P}({n}^{-M}),$$

(40)

for any positive number *M*.

Fix *r >* 0. We then have

$${\int}_{\left|\right|{\varrho}_{n}\left|\right|>r}\rho ({\widehat{\theta}}_{{\lambda}_{n}}+{\stackrel{\sim}{I}}_{0}^{-{\scriptstyle \frac{1}{2}}}{\varrho}_{n})\frac{{pl}_{{\lambda}_{n}}({\widehat{\theta}}_{{\lambda}_{n}}+{\stackrel{\sim}{I}}_{0}^{-{\scriptstyle \frac{1}{2}}}{\varrho}_{n})}{{pl}_{{\lambda}_{n}}({\widehat{\theta}}_{{\lambda}_{n}})}d{\varrho}_{n}\lesssim I\{{\mathrm{\Delta}}_{{\lambda}_{n}}^{r}<-{n}^{-{\scriptstyle \frac{1}{2}}}\}exp(-\sqrt{n}){\int}_{\mathrm{\Theta}}\rho (\theta )d\theta +I\{{\mathrm{\Delta}}_{{\lambda}_{n}}^{r}\ge -{n}^{-{\scriptstyle \frac{1}{2}}}\},$$

where
${\mathrm{\Delta}}_{{\lambda}_{n}}^{r}={sup}_{\left|\right|{\varrho}_{n}\left|\right|>r}{\mathrm{\Delta}}_{{\lambda}_{n}}({\widehat{\theta}}_{{\lambda}_{n}}+{\varrho}_{n}{\stackrel{\sim}{I}}_{0}^{-1/2})$. According to lemma 3.2 in [2],
$I\{{\mathrm{\Delta}}_{{\lambda}_{n}}^{r}\ge -{n}^{-{\scriptstyle \frac{1}{2}}}\}={O}_{P}({n}^{-M})$ for any positive decreasing *r →* 0. Note that the above inequality holds uniformly for any decreasing *r _{n} →* 0. Therefore, we can choose a positive decreasing sequence

Choose *r _{n}* =

$${\int}_{\left|\right|{\varrho}_{n}\left|\right|\le {r}_{n}}\left|\frac{{pl}_{{\lambda}_{n}}({\widehat{\theta}}_{{\lambda}_{n}}+{\stackrel{\sim}{I}}_{0}^{-{\scriptstyle \frac{1}{2}}}{\varrho}_{n})}{{pl}_{{\lambda}_{n}}(\widehat{\theta})}\rho ({\widehat{\theta}}_{{\lambda}_{n}}+{\stackrel{\sim}{I}}_{0}^{-{\scriptstyle \frac{1}{2}}}{\varrho}_{n})-exp\left(-\frac{n}{2}{\varrho}_{n}^{T}{\varrho}_{n}\right)\rho ({\widehat{\theta}}_{{\lambda}_{n}})\right|\times d{\varrho}_{n}={O}_{P}({n}^{-(d-1)/2}{\lambda}_{n}^{2}).$$

(41)

The posterior mass over the region ||* _{n}*||

$${\int}_{{\left|\right|{\varrho}_{n}\left|\right|}_{2}\le {r}_{n}}\left|\frac{{pl}_{{\lambda}_{n}}({\widehat{\theta}}_{{\lambda}_{n}}+{\stackrel{\sim}{I}}_{0}^{-{\scriptstyle \frac{1}{2}}}{\varrho}_{n})}{{pl}_{{\lambda}_{n}}({\widehat{\theta}}_{{\lambda}_{n}})}\rho ({\widehat{\theta}}_{{\lambda}_{n}})-exp\left(-\frac{n}{2}{\varrho}_{n}^{T}{\varrho}_{n}\right)\rho ({\widehat{\theta}}_{{\lambda}_{n}})\right|d{\varrho}_{n}\phantom{\rule{0.38889em}{0ex}}(\ast )+{\int}_{{\left|\right|{\varrho}_{n}\left|\right|}_{2}\le {r}_{n}}\left|\frac{{pl}_{{\lambda}_{n}}({\widehat{\theta}}_{{\lambda}_{n}}+{\stackrel{\sim}{I}}_{0}^{-{\scriptstyle \frac{1}{2}}}{\varrho}_{n})}{{pl}_{{\lambda}_{n}}({\widehat{\theta}}_{{\lambda}_{n}})}\rho ({\widehat{\theta}}_{{\lambda}_{n}}+{\stackrel{\sim}{I}}_{0}^{-{\scriptstyle \frac{1}{2}}}{\varrho}_{n})-\frac{{pl}_{{\lambda}_{n}}({\widehat{\theta}}_{{\lambda}_{n}}+{\stackrel{\sim}{I}}_{0}^{-{\scriptstyle \frac{1}{2}}}{\varrho}_{n})}{{pl}_{{\lambda}_{n}}({\widehat{\theta}}_{{\lambda}_{n}})}\rho ({\widehat{\theta}}_{{\lambda}_{n}})\right|d{\varrho}_{n}.\phantom{\rule{0.38889em}{0ex}}(\ast \ast )$$

By (21), we obtain

$$(\ast )={\int}_{{\left|\right|{\varrho}_{n}\left|\right|}_{2}\le {r}_{n}}\left[\rho ({\widehat{\theta}}_{{\lambda}_{n}})exp\left(-\frac{n{\varrho}_{n}^{T}{\varrho}_{n}}{2}\right)\mid exp({O}_{P}({g}_{{\lambda}_{n}}(|\left|{\varrho}_{n}\right||)))-1\mid \right]d{\varrho}_{n}.$$

Obviously the order of (*) depends on that of |exp(*O _{P}* (

We next start the formal proof of theorem 2. By considering both lemma 2.1 and lemma 2.2, we know the denominator of (39) equals

$${\int}_{\{{\left|\right|{\varrho}_{n}\left|\right|}_{2}\le {r}_{n}\}\cap {\mathrm{\Xi}}_{n}}\left[exp\left(-\frac{n}{2}{\varrho}_{n}^{T}{\varrho}_{n}\right)\rho ({\widehat{\theta}}_{{\lambda}_{n}})\right]d{\varrho}_{n}+{O}_{P}({n}^{-(d-1)/2}{\lambda}_{n}^{2}).$$

The first term in the above display equals

$${n}^{-d/2}\rho ({\widehat{\theta}}_{{\lambda}_{n}}){\int}_{\{{\left|\right|{u}_{n}\left|\right|}_{2}\le \sqrt{n}{r}_{n}\}\cap \sqrt{n}{\mathrm{\Xi}}_{n}}{e}^{-{u}_{n}^{T}{u}_{n}/2}d{u}_{n}={n}^{-d/2}\rho ({\widehat{\theta}}_{{\lambda}_{n}}){\int}_{{\mathbb{R}}^{d}}{e}^{-{u}_{n}^{T}{u}_{n}/2}d{u}_{n}+O({n}^{-(d-1)/2}{\lambda}_{n}^{2}),$$

where
${u}_{n}=\sqrt{n}{\varrho}_{n}$. The above equality follows from the inequality that
${\int}_{x}^{\infty}{e}^{-{y}^{2}/2}dy\le {x}^{-1}{e}^{-{x}^{2}/2}$ for any *x* > 0. Consolidating the above analyses, we deduce that the denominator of (39) equals
${n}^{-{\scriptstyle \frac{d}{2}}}\rho ({\widehat{\theta}}_{{\lambda}_{n}}){(2\pi )}^{d/2}+{O}_{P}({n}^{-(d-1)/2}{\lambda}_{n}^{2})$. The same analysis also applies to the numerator, thus completing the whole proof.

We only show (24) in what follows. (25) can be verified similarly. Showing (24) is equivalent to establishing ${\stackrel{\sim}{E}}_{\theta \mid x}^{{\lambda}_{n}}({\varrho}_{n})={O}_{P}({\lambda}_{n}^{2})$. Note that ${\stackrel{\sim}{E}}_{\theta \mid x}^{{\lambda}_{n}}({\varrho}_{n})$ can be written as:

$${\stackrel{\sim}{E}}_{\theta \mid x}^{{\lambda}_{n}}({\varrho}_{n})=\frac{{\int}_{{\varrho}_{n}\in {\mathrm{\Xi}}_{n}}{\varrho}_{n}\rho ({\widehat{\theta}}_{{\lambda}_{n}}+{\stackrel{\sim}{I}}_{0}^{-{\scriptstyle \frac{1}{2}}}{\varrho}_{n}){\scriptstyle \frac{{pl}_{{\lambda}_{n}}({\widehat{\theta}}_{{\lambda}_{n}}+{\stackrel{\sim}{I}}_{0}^{-{\scriptstyle \frac{1}{2}}}{\varrho}_{n})}{{pl}_{{\lambda}_{n}}({\widehat{\theta}}_{{\lambda}_{n}})}}d{\varrho}_{n}}{{\int}_{{\varrho}_{n}\in {\mathrm{\Xi}}_{n}}\rho ({\widehat{\theta}}_{{\lambda}_{n}}+{\stackrel{\sim}{I}}_{0}^{-{\scriptstyle \frac{1}{2}}}{\varrho}_{n}){\scriptstyle \frac{{pl}_{{\lambda}_{n}}({\widehat{\theta}}_{{\lambda}_{n}}+{\stackrel{\sim}{I}}_{0}^{-{\scriptstyle \frac{1}{2}}}{\varrho}_{n})}{{pl}_{{\lambda}_{n}}({\widehat{\theta}}_{{\lambda}_{n}})}}d{\varrho}_{n}}.$$

By analysis similar to that applied in the proof of theorem 2, we know the denominator in the above display is ${n}^{-d/2}{(2\pi )}^{d/2}\rho ({\widehat{\theta}}_{{\lambda}_{n}})+{O}_{P}({n}^{-(d-1)/2}{\lambda}_{n}^{2})$ and the numerator is a random vector of order ${O}_{P}({n}^{-d/2}{\lambda}_{n}^{2})$. This yields the conclusion.

Note that (23) implies
${\kappa}_{n\alpha}={\stackrel{\sim}{I}}_{0}^{-1/2}{z}_{\alpha}+{O}_{P}({n}^{1/2}{\lambda}_{n}^{2})$, for any *ξ* < *α* < 1 − *ξ*, where
$\xi \in (0,{\scriptstyle \frac{1}{2}})$. Note also that the *α*-th quantile of a *d* dimensional standard normal distribution, *z _{α}*, is not unique if

We first present a technical lemma before the formal proof of lemma 1. In lemma 1.1 we define

$$\mathcal{K}=\left\{\frac{{\ell}_{\theta ,\eta}(X)-{\ell}_{0}(X)}{1+J(\eta )}:\left|\right|\theta -{\theta}_{0}\left|\right|\le {C}_{1},{\left|\right|\eta -{\eta}_{0}\left|\right|}_{\infty}\le {C}_{1},J(\eta )<\infty \right\},$$

for a known constant *C*_{1} < ∞. Combining with T5, we use condition (42) below to control the order of the increments of the empirical processes indexed by _{θ}_{,}* _{η}*:

$${H}_{B}(\epsilon ,\mathcal{K},{L}_{2}(P))\lesssim {\epsilon}^{-1/k}.$$

(42)

We next assume two smoothness conditions about the criterion function (*θ*, *η*) *P*_{θ}_{,}* _{η}*, i.e.,

$${\left|\right|{\ell}_{\theta ,\eta}-{\ell}_{0}\left|\right|}_{2}\lesssim \left|\right|\theta -{\theta}_{0}\left|\right|+{d}_{\theta}(\eta ,{\eta}_{0}),$$

(43)

$$P({\ell}_{\theta ,\eta}-{\ell}_{\theta ,{\eta}_{0}})\lesssim -{d}_{\theta}^{2}(\eta ,{\eta}_{0})+{\left|\right|\theta -{\theta}_{0}\left|\right|}^{2}.$$

(44)

Here ${d}_{\theta}^{2}(\eta ,{\eta}_{0})$ can be thought of as the square of a distance, but the following lemma is valid for arbitrary functions $\eta \mapsto {d}_{\theta}^{2}(\eta ,{\eta}_{0})$. Finally, we assume a somewhat stronger assumption on the density, i.e.,

$${p}_{\theta ,\eta}/{p}_{\theta ,{\eta}_{0}}\phantom{\rule{0.16667em}{0ex}}\text{is}\phantom{\rule{0.16667em}{0ex}}\text{bounded}\phantom{\rule{0.16667em}{0ex}}\text{away}\phantom{\rule{0.16667em}{0ex}}\text{from}\phantom{\rule{0.16667em}{0ex}}\text{zero}\phantom{\rule{0.16667em}{0ex}}\text{and}\phantom{\rule{0.16667em}{0ex}}\text{infinity}.$$

(45)

But (45) is trivial to satisfy in our first model.

Assume conditions (42)–(45) in the above hold for every *θ* Θ* _{n}* and

$$\begin{array}{r}{d}_{{\stackrel{\sim}{\theta}}_{n}}({\widehat{\eta}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}},{\eta}_{0})={O}_{P}({\lambda}_{n}+||{\stackrel{\sim}{\theta}}_{n}-{\theta}_{0}||),\\ {\lambda}_{n}J({\widehat{\eta}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}})={O}_{P}({\lambda}_{n}+||{\stackrel{\sim}{\theta}}_{n}-{\theta}_{0}||),\end{array}$$

for (* _{n}*,

The definition of _{n}_{,}* _{λn}* implies that

$$\begin{array}{l}{\lambda}_{n}^{2}{J}^{2}({\widehat{\eta}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}})\le {\lambda}_{n}^{2}{J}^{2}({\eta}_{0})+({\mathbb{P}}_{n}-P)\left({\ell}_{{\stackrel{\sim}{\theta}}_{n},{\widehat{\eta}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}}}-{\ell}_{{\stackrel{\sim}{\theta}}_{n},{\eta}_{0}}\right)+P\left({\ell}_{{\stackrel{\sim}{\theta}}_{n},{\widehat{\eta}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}}}-{\ell}_{{\stackrel{\sim}{\theta}}_{n},{\eta}_{0}}\right)\\ \le {\lambda}_{n}^{2}{J}^{2}({\eta}_{0})+I+II.\end{array}$$

Note that by T5 and assumption (42), we have

$$\begin{array}{l}I\le (1+J({\widehat{\eta}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}})){O}_{P}({n}^{-1/2})\times \left\{{\Vert \frac{{\ell}_{{\stackrel{\sim}{\theta}}_{n},{\widehat{\eta}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}}}-{\ell}_{0}}{1+J({\widehat{\eta}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}})}\Vert}_{2}^{1-{\scriptstyle \frac{1}{2k}}}\vee {n}^{-{\scriptstyle \frac{2k-1}{2(2k+1)}}}\right\}\\ +(1+J({\eta}_{0})){O}_{P}({n}^{-1/2})\times \left\{{\Vert \frac{{\ell}_{{\stackrel{\sim}{\theta}}_{n},{\eta}_{0}}-{\ell}_{0}}{1+J({\eta}_{0})}\Vert}_{2}^{1-{\scriptstyle \frac{1}{2k}}}\vee {n}^{-{\scriptstyle \frac{2k-1}{2(2k+1)}}}\right\}.\end{array}$$

By assumption (44), we have

$$II\lesssim -{d}_{{\stackrel{\sim}{\theta}}_{n}}^{2}({\widehat{\eta}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}},{\eta}_{0})+{\left|\right|{\stackrel{\sim}{\theta}}_{n}-{\theta}_{0}\left|\right|}^{2}.$$

Combining with the above, we can deduce that

$$\begin{array}{l}{\widehat{d}}_{n}^{2}+{\lambda}_{n}^{2}{\widehat{J}}_{n}^{2}\lesssim (1+{\widehat{J}}_{n}){O}_{P}({n}^{-1/2})\times \left\{{\left(\frac{{\widehat{d}}_{n}+\left|\right|{\stackrel{\sim}{\theta}}_{n}-{\theta}_{0}\left|\right|}{1+{\widehat{J}}_{n}}\right)}^{1-{\scriptstyle \frac{1}{2k}}}\vee {n}^{-{\scriptstyle \frac{2k-1}{2(2k+1)}}}\right\}\\ +(1+{J}_{0}){O}_{P}({n}^{-1/2})\times \left\{{\left(\frac{\left|\right|{\stackrel{\sim}{\theta}}_{n}-{\theta}_{0}\left|\right|}{1+{J}_{0}}\right)}^{1-{\scriptstyle \frac{1}{2k}}}\vee {n}^{-{\scriptstyle \frac{2k-1}{2(2k+1)}}}\right\}\\ +{\lambda}_{n}^{2}{J}_{0}^{2}+{\left|\right|{\stackrel{\sim}{\theta}}_{n}-{\theta}_{0}\left|\right|}^{2},\end{array}$$

(46)

where * _{n}* =

$${u}_{n}^{2}={O}_{P}(1)+{O}_{P}(1){u}_{n}^{1-{\scriptstyle \frac{1}{2k}}},$$

(47)

$${v}_{n}={v}_{n}^{-1}{O}_{P}({\left|\right|{\stackrel{\sim}{\theta}}_{n}-{\theta}_{0}\left|\right|}^{2})+{u}_{n}^{1-{\scriptstyle \frac{1}{2k}}}{O}_{P}({\lambda}_{n})+{O}_{P}({n}^{-{\scriptstyle \frac{1}{2}}}{\lambda}_{n}^{-1}{\left|\right|{\stackrel{\sim}{\theta}}_{n}-{\theta}_{0}\left|\right|}^{1-{\scriptstyle \frac{1}{2k}}}),$$

(48)

where *u _{n}* = (

We now apply lemma 1.1 to derive the related convergence rates in the partly linear model. Conditions (43)–(45) can be verified easily in this example because _{θ}_{,}* _{f}* has finite second moment, and

$$\mathcal{O}\equiv \left\{\frac{{\ell}_{\theta ,f}(X)}{1+J(f)}:\left|\right|\theta -{\theta}_{0}\left|\right|\le {C}_{1},{\left|\right|f-{f}_{0}\left|\right|}_{\infty}\le {C}_{1},J(f)<\infty \right\},$$

for some constant *C*_{1}. Note that _{θ}_{,}* _{f}* (

$$\mathrm{\Delta}{A}^{-1}log\mathrm{\Phi}({\overline{q}}_{\theta ,f}A)+(1-\mathrm{\Delta}){A}^{-1}log(1-\mathrm{\Phi}({\overline{q}}_{\theta ,f}A)),$$

(49)

where *A* = 1 + *J*(*f*) and _{θ}_{,}* _{f}* , where

$${\mathcal{O}}_{1}\equiv \left\{\frac{{q}_{\theta ,f}(X)}{1+J(f)}:\left|\right|\theta -{\theta}_{0}\left|\right|\le {C}_{1},{\left|\right|f-{f}_{0}\left|\right|}_{\infty}\le {C}_{1},J(f)<\infty \right\},$$

and where we know *H _{B}*(

We next calculate the *ε*-bracketing entropy number with *L*_{2} norm for the class of functions *R*_{1} {*k _{a}*(

For the proof of (29), we apply arguments similar to those used in the proof of lemma 1.1 but after setting *λ _{n}*,

Based on the discussions of (13) and (14), we need to verify the smoothness conditions and asymptotic equicontinuity conditions, i.e. (15)–(17), for the function (*t*, *θ*, *η*) and its related derivatives. The first set of conditions are verified in lemma 5 of [3]. For the verifications of (15)–(17), we first show condition (17). Without loss of generality, we assume that *λ _{n}* is bounded below by a multiple of

$$P{\left(\frac{\stackrel{.}{\ell}({\theta}_{0},{\theta}_{0},{\widehat{f}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}})-{\stackrel{.}{\ell}}_{0}}{{n}^{{\scriptstyle \frac{1}{4k+2}}}({\lambda}_{n}+||{\stackrel{\sim}{\theta}}_{n}-{\theta}_{0}||)}\right)}^{2}\lesssim \frac{{\left|\right|{\widehat{f}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}}-{f}_{0}\left|\right|}_{2}^{2}}{{n}^{{\scriptstyle \frac{1}{2k+1}}}{({\lambda}_{n}+||{\stackrel{\sim}{\theta}}_{n}-{\theta}_{0}||)}^{2}}={O}_{P}\left({n}^{-{\scriptstyle \frac{1}{2k+1}}}\right),$$

where (27) implies the equality in the above expression.

By (28), we know that *J*(_{n}_{,}* _{λn}*) =

$$\left\{\frac{\stackrel{.}{\ell}({\theta}_{0},{\theta}_{0},f)-{\stackrel{.}{\ell}}_{0}}{{n}^{{\scriptstyle \frac{1}{4k+2}}}({\lambda}_{n}+||\theta -{\theta}_{0}||)}:J(f)\le {C}_{n}(1+\frac{\left|\right|\theta -{\theta}_{0}\left|\right|}{{\lambda}_{n}}),{\left|\right|f\left|\right|}_{\infty}\le M,\left|\right|\theta -{\theta}_{0}\left|\right|\le \delta \right\}\cap \left\{g\in {L}_{2}(P):P{g}^{2}\le {C}_{n}{n}^{-{\scriptstyle \frac{1}{2k+1}}}\right\},$$

for some *δ* > 0. Obviously the function *n*^{−1/(4}^{k}^{+2)}((*θ*_{0}, *θ*_{0}, _{n}_{,}* _{λn}*) − )/(

$$\{{H}_{n}(f):J({H}_{n}(f))\lesssim {\lambda}_{n}^{-1}{n}^{-1/(4k+2)},{\left|\right|{H}_{n}(f)\left|\right|}_{\infty}\lesssim {\lambda}_{n}^{-1}{n}^{-1/(4k+2)}\},$$

where *H _{n}*(

we know that

$$H(\epsilon ,{\mathcal{R}}_{n},{L}_{2}(P))\lesssim {({\lambda}_{n}^{-1}{n}^{{\scriptstyle \frac{-1}{(4k+2)}}})/\epsilon )}^{1/k}.$$

Note that *δ _{n}* =

For the proof of (15), we only need to show (15) holds for * _{n}* =

$${\mathbb{G}}_{n}(\ddot{\ell}({\theta}_{0},{\stackrel{\sim}{\theta}}_{n},{\widehat{f}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}})-{\ddot{\ell}}_{0})={o}_{P}(1+{n}^{1/3}||{\stackrel{\sim}{\theta}}_{n}-{\theta}_{0}||)={o}_{P}(1).$$

By the rate assumptions (27), we have

$$P{\left(\frac{\ddot{\ell}({\theta}_{0},{\stackrel{\sim}{\theta}}_{n},{\widehat{f}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}})-{\ddot{\ell}}_{0}}{1+{n}^{1/3}\left|\right|{\stackrel{\sim}{\theta}}_{n}-{\theta}_{0}\left|\right|}\right)}^{2}\lesssim \frac{{\left|\right|{\stackrel{\sim}{\theta}}_{n}-{\theta}_{0}\left|\right|}^{2}+{\left|\right|{\widehat{f}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}}-{f}_{0}\left|\right|}_{2}^{2}}{{(1+{n}^{1/3}||{\stackrel{\sim}{\theta}}_{n}-{\theta}_{0}||)}^{2}}={O}_{P}({n}^{-1/2}).$$

We next define as follows:

$$\left\{\frac{\ddot{\ell}({\theta}_{0},\theta ,f)-{\ddot{\ell}}_{0}}{1+{n}^{1/3}\left|\right|\theta -{\theta}_{0}\left|\right|}:J(f)\le {C}_{n}(1+\frac{\left|\right|\theta -{\theta}_{0}\left|\right|}{{\lambda}_{n}}),{\left|\right|f\left|\right|}_{\infty}\le M,\left|\right|\theta -{\theta}_{0}\left|\right|<\delta \right\}\cap \left\{g\in {L}_{2}(P):P{g}^{2}\le {C}_{n}{n}^{-{\scriptstyle \frac{1}{2}}}\right\}.$$

Obviously the function ((*θ*_{0}, * _{n}*,

$$\{{H}_{n}(f):J({H}_{n}(f))\lesssim 1+{({n}^{1/3}{\lambda}_{n})}^{-1},{\left|\right|{H}_{n}(f)\left|\right|}_{\infty}\lesssim 1+{({n}^{1/3}{\lambda}_{n})}^{-1}\},$$

where *H _{n}*(

$$H(\epsilon ,{\overline{\mathcal{R}}}_{n},{L}_{2}(P))\lesssim {((1+{n}^{-1/3}{\lambda}_{n}^{-1})/\epsilon )}^{1/k}.$$

Then by analysis similar to that used in the proof of (17), we can show that lim_{n}_{→∞} *E**||||_{} → 0 in view of T2. This completes the proof of (15).

For the proof of (16), it suffices to show that (_{t}_{,}* _{θ}*(

In the last part, we show (18). It suffices to verify that the sequence of classes of functions is *P*-Glivenko-Cantelli, where {^{(3)}(* _{n}*,

$${\stackrel{\sim}{\mathcal{F}}}_{n}\equiv \{\phi ({q}_{t,{f}_{t}(\theta ,f)}(x))R({q}_{t,{f}_{t}(\theta ,f)}(x)):(t,\theta )\in {V}_{{\theta}_{0}},{\lambda}_{n}J(f)\le C,{\left|\right|f\left|\right|}_{\infty}\le M\}.$$

By arguments similar to those used in lemma 7.2 of [15], we know that
${sup}_{Q}H(\epsilon ,{\stackrel{\sim}{\mathcal{F}}}_{n},{L}_{1}(Q))\lesssim {(1+{\lambda}_{n}^{-1}/\epsilon )}^{1/k}={o}_{P}(n)$. Moreover, the * _{n}* are uniformly bounded since
$f\in {\mathcal{H}}_{k}^{M}$. Considering the fact that the probability that is contained in

By the assumption that Δ* _{λn}*(

$${n}^{-1}\sum _{i=1}^{n}log\left[\frac{\mathit{lik}({\stackrel{\sim}{\theta}}_{n},{\widehat{f}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}},{X}_{i})}{\mathit{lik}({\theta}_{0},{\widehat{f}}_{{\theta}_{0},{\lambda}_{n}},{X}_{i})}\right]-{\lambda}_{n}^{2}[{J}^{2}({\widehat{f}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}})-{J}^{2}({\widehat{f}}_{{\theta}_{0},{\lambda}_{n}})]\ge {o}_{P}(1)$$

By considering assumption (19), the above inequality simplifies to

$${n}^{-1}\sum _{i=1}^{n}log\left[\frac{H({\stackrel{\sim}{\theta}}_{n},{\widehat{f}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}};{X}_{i})}{H({\theta}_{0},{\widehat{f}}_{{\theta}_{0},{\lambda}_{n}};{X}_{i})}\right]\ge {o}_{P}(1),$$

where *H*(*θ*, *f; X*) = ΔΦ(*C* − *θU* − *f*(*V*)) + (1 − Δ)(1 − Φ(*C* − *θU* − *f*(*V*))). By arguments similar to those used in lemma 2 and by T4, we know *H*(* _{n}*,

$$Plog\left[1+\alpha \left(\frac{H({\stackrel{\sim}{\theta}}_{n},{\widehat{f}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}};{X}_{i})}{H({\theta}_{0},{\widehat{f}}_{{\theta}_{0},{\lambda}_{n}};{X}_{i})}-1\right)\right]\ge {o}_{P}(1).$$

(50)

The remainder of the proof follows the proof of lemma 6 in [3].

The boundedness condition (45) in Lemma 1.1 can not be satisfied in semiparametric logistic regression model. Hence we propose lemma 4.1 below to relax this condition by choosing the criterion function *m _{θ}*

Assume for any given *θ* Θ* _{n}*,

$$P({m}_{\theta ,\eta}-{m}_{\theta ,{\eta}_{0}})\lesssim -{d}_{\theta}^{2}(\eta ,{\eta}_{0})+{\left|\right|\theta -{\theta}_{0}\left|\right|}^{2},$$

(51)

$${E}^{\ast}\underset{\theta \in {\mathrm{\Theta}}_{n},\eta \in {\mathcal{V}}_{n},\left|\right|\theta -{\theta}_{0}\left|\right|<\epsilon ,{d}_{\theta}(\eta ,{\eta}_{0})<\epsilon}{sup}\mid {\mathbb{G}}_{n}({m}_{\theta ,\eta}-{m}_{\theta ,{\eta}_{0}})\mid \lesssim \phantom{\rule{0.16667em}{0ex}}{\phi}_{n}(\epsilon ).$$

(52)

Suppose that (52) is valid for functions * _{n}* such that

Lemma 4.2 below is presented to verify the modulus condition for the continuity of the empirical process in (52). Let = {*x* *m _{θ}*

$$K(\delta ,{\mathcal{S}}_{\delta},{L}_{2}(P))={\int}_{0}^{\delta}\sqrt{1+{H}_{B}(\epsilon ,{\mathcal{S}}_{\delta},{L}_{2}(P))}d\epsilon .$$

(53)

Suppose the functions (*x*, *θ*, *η*) *m _{θ}*

$$P{({m}_{\theta ,\eta}-{m}_{{\theta}_{0},{\eta}_{0}})}^{2}\lesssim {d}_{\theta}^{2}(\eta ,{\eta}_{0})+{\left|\right|\theta -{\theta}_{0}\left|\right|}^{2}.$$

Then condition (52) is satisfied for any functions * _{n}* such that

$${\phi}_{n}(\delta )\ge K(\delta ,{\mathcal{S}}_{\delta},{L}_{2}(P))\left(1+\frac{K(\delta ,{\mathcal{S}}_{\delta},{L}_{2}(P))}{{\delta}^{2}\sqrt{n}}\right)$$

Consequently, in the conclusion of the above theorem, we may use *K*(*δ*, , *L*_{2}(*P*)) rather than * _{n}*(

We then apply lemma 4.1 to the penalized semiparametric logistic regression model by including *λ* in *θ*, i.e.
${m}_{\theta ,\lambda ,\eta}={m}_{\theta ,\eta}-{\scriptstyle \frac{1}{2}}{\lambda}^{2}({J}^{2}(\eta )-{J}^{2}({\eta}_{0}))$, in the proof of lemma 4. First, lemma 7.1 in [15] establishes that

$${\Vert {p}_{{\stackrel{\sim}{\theta}}_{n},{\widehat{\eta}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}}}-{p}_{{\theta}_{0},{\eta}_{0}}\Vert}_{2}+{\lambda}_{n}J({\widehat{\eta}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}})={O}_{P}({\lambda}_{n}+||{\stackrel{\sim}{\theta}}_{n}-{\theta}_{0}||)$$

(54)

after choosing

$${m}_{\theta ,\lambda ,\eta}=log\frac{{p}_{\theta ,\eta}+{p}_{\theta ,{\eta}_{0}}}{2{p}_{\theta ,{\eta}_{0}}}-\frac{1}{2}{\lambda}^{2}({J}^{2}(\eta )-{J}^{2}({\eta}_{0}))$$

in lemma 4.1. Note that the map *θ* *p _{θ}*

$${\left|\right|{p}_{\theta ,\eta}-{p}_{{\theta}_{0},{\eta}_{0}}\left|\right|}_{2}\gtrsim (||\theta -{\theta}_{0}||\wedge 1+{\left|\right|\mid \eta -{\eta}_{0}\mid \wedge 1\left|\right|}_{2})\wedge 1.$$

(55)

Thus we have proved (30). For (32), we just replace the *m _{θ}*

The proof of lemma 5 follows that of lemma 2. The smoothness conditions of (*t*, *θ*, *η*) and its related derivatives can be shown similarly since *F*(·), (·) and (·) are all uniformly bounded in (−∞, +∞), and *h*_{0}(·) is intrinsically bounded over [0, 1]. Note that we can show (12) directly by the following analysis. *P*(*θ*_{0}, *θ*_{0}, *η*) can be written as *P*(*F*(*θ*_{0}*w*+*η*_{0}) − *F*(*θ*_{0}*w*+ *η*(*z*)))(*w*−*h*_{0}(*z*)) since *P*_{0} = 0. Note that *P*(*w*−*h*_{0}(*z*))(*θ*_{0}*w*+*η*_{0}(*z*))(*η*−*η*_{0})(*z*) = 0. This implies that *P*(*θ*_{0}, *θ*_{0}, *η*) = *P*(*F*(*θ*_{0}*w*+*η*_{0}) − *F*(*θ*_{0}*w*+*η*(*z*))+(*θ*_{0}*w*+*η*_{0}(*z*))(*η*−*η*_{0})(*z*))(*w*−*h*_{0}(*z*)). However, by the common Taylor expansion, we have |*F*(*θ*_{0}*w* + *η*) − *F*(*θ*_{0}*w* + *η*_{0}) − (*θ*_{0}*w* + *η*_{0})(*η* − *η*_{0})| ≤ ||||_{∞}|*η* − *η*_{0}|^{2}. This proves (12).

We next verify the asymptotic equicontinuity conditions, i.e. (15)–(17). For (17), we first apply analysis similar to that used in the proof of lemma 2 to obtain

$$P{\left(\frac{\stackrel{.}{\ell}({\theta}_{0},{\theta}_{0},{\widehat{\eta}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}})-{\stackrel{.}{\ell}}_{0}}{{n}^{{\scriptstyle \frac{1}{4k+2}}}({\lambda}_{n}+||{\stackrel{\sim}{\theta}}_{n}-{\theta}_{0}||)}\right)}^{2}\lesssim {O}_{P}\left({n}^{-{\scriptstyle \frac{1}{2k+1}}}\right).$$

By lemma 7.1 in [15], we know that *J*(_{n}_{,} * _{λn}*) =

$$\{\frac{\stackrel{.}{\ell}({\theta}_{0},{\theta}_{0},\eta )-{\stackrel{.}{\ell}}_{0}}{{n}^{{\scriptstyle \frac{1}{4k+2}}}({\lambda}_{n}+||\theta -{\theta}_{0}||)}:J(\eta )\le {C}_{n}(1+\frac{\left|\right|\theta -{\theta}_{0}\left|\right|}{{\lambda}_{n}}),{\left|\right|\eta \left|\right|}_{\infty}\le {C}_{n}(1+J(\eta )),\left|\right|\theta -{\theta}_{0}\left|\right|<\delta \}\cap \left\{g\in {L}_{2}(P):P{g}^{2}\le {C}_{n}{n}^{-{\scriptstyle \frac{1}{2k+1}}}\right\}.$$

Clearly, the probability that the function *n*^{−1/(4}^{k}^{+2)}((*θ*_{0}, *θ*_{0}, _{n}_{,} * _{λn}*) −

The proof of (15) and (16) follows arguments quite similar to those used in the proof of lemma 2. In other words, we can show that ((*θ*_{0}, * _{n}*,

Next we define {^{(3)}(* _{n}*,

The proof of lemma 6 is analogous to that of lemma 3.

Assuming the assumptions in theorem 1, we have

$${\mathit{logpl}}_{{\lambda}_{n}}({\stackrel{\sim}{\theta}}_{n})=log{pl}_{{\lambda}_{n}}({\theta}_{0})+n{({\stackrel{\sim}{\theta}}_{n}-{\theta}_{0})}^{T}{\mathbb{P}}_{n}{\stackrel{\sim}{\ell}}_{0}-\frac{n}{2}{({\stackrel{\sim}{\theta}}_{n}-{\theta}_{0})}^{T}{\stackrel{\sim}{I}}_{0}({\stackrel{\sim}{\theta}}_{n}-{\theta}_{0})+{O}_{P}({g}_{{\lambda}_{n}}(\left|\right|{\stackrel{\sim}{\theta}}_{n}-{\widehat{\theta}}_{{\lambda}_{n}}\left|\right|)),$$

(56)

for any sequence _{n} satisfying _{n} = θ_{0} + o* _{P}* (1).

n^{−1}(log pl_{λn}(_{n}) − log pl_{λ n} (θ_{0})) is bounded above and below by

$${\mathbb{P}}_{n}(\ell ({\stackrel{\sim}{\theta}}_{n},{\stackrel{\sim}{\theta}}_{n},{\widehat{\eta}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}})-\ell ({\theta}_{0},{\stackrel{\sim}{\theta}}_{n},{\widehat{\eta}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}}))-{\lambda}_{n}^{2}({J}^{2}({\widehat{\eta}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}})-{J}^{2}({\eta}_{{\theta}_{0}}({\stackrel{\sim}{\theta}}_{n},{\widehat{\eta}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}})))$$

and

$${\mathbb{P}}_{n}(\ell ({\stackrel{\sim}{\theta}}_{n},{\theta}_{0},{\widehat{\eta}}_{{\theta}_{0},{\lambda}_{n}})-\ell ({\theta}_{0},{\theta}_{0},{\widehat{\eta}}_{{\theta}_{0},{\lambda}_{n}}))-{\lambda}_{n}^{2}({J}^{2}({\eta}_{{\stackrel{\sim}{\theta}}_{n}}({\theta}_{0},{\widehat{\eta}}_{{\theta}_{0},{\lambda}_{n}}))-{J}^{2}({\widehat{\eta}}_{{\theta}_{0},{\lambda}_{n}})),$$

respectively. By the third order Taylor expansion of _{n}* _{n}*(

$$\begin{array}{l}{\lambda}_{n}^{2}({J}^{2}({\widehat{\eta}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}})-{J}^{2}({\eta}_{{\theta}_{0}}({\stackrel{\sim}{\theta}}_{n},{\widehat{\eta}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}})))=-2{\lambda}_{n}^{2}{({\stackrel{\sim}{\theta}}_{n}-{\theta}_{0})}^{T}{\int}_{\mathcal{Z}}{\widehat{\eta}}_{{\stackrel{\sim}{\theta}}_{n},{\lambda}_{n}}^{(k)}{h}_{0}^{(k)}dz+2{\lambda}_{n}^{2}{({\stackrel{\sim}{\theta}}_{n}-{\theta}_{0})}^{T}{\int}_{\mathcal{Z}}{h}_{0}^{(k)}{h}_{0}^{(k)T}dz({\stackrel{\sim}{\theta}}_{n}-{\theta}_{0})\\ ={O}_{P}({n}^{-1}{g}_{{\lambda}_{n}}(\left|\right|{\stackrel{\sim}{\theta}}_{n}-{\widehat{\theta}}_{{\lambda}_{n}}\left|\right|))\end{array}$$

by Taylor expansion. The last equation holds because of the assumptions (3) and (19). Similar analysis also applies to the lower bound. This proves (56).

1. Bentkus V, Gotze F, van Zwer WR. An Edgeworth Expansion for Symmetric Statistics. Annals of Statistics. 1997;25:851–896.

2. Cheng G, Kosorok MR. Higher order semiparametric frequentist inference with the profile sampler. Annals of Statistics. 2007 In Press.

3. Cheng G, Kosorok MR. General Frequentist Properties of the Posterior Profile Distribution. Annals of Statistics. 2007 In Press.

4. Dalalyan A, Golubev G, Tsybakov A. A Penalized Maximum Likelihood and Semipara-metric Second-Order Efficiency. Annals of Statistics. 2006;34:169–201.

5. Good IJ, Gaskins RA. Non-parametric roughness penalties for probability densities. Biometrika. 1971;58:255–277.

6. Huang J. Efficient estimation of the partly linear Cox model. Annals of Statistics. 1999;27:1536–1563.

7. Kosorok MR. Introduction to Empirical Processes and Semiparametric Inference. Springer; New York: 2008.

8. Kuo HH. Lecture Notes in Mathematics. Berlin: Springer; 1975. Gaussian Measure on Banach Spaces.

9. Lee BL, Kosorok MR, Fine JP. The profile sampler. Journal of the American Statistical Association. 2005;100:960–969.

10. Ma S, Kosorok MR. Penalized Log-likelihood Estimation for Partly Linear Transformation Models with Current Status Data. Annals of Statistics. 2005;33:2256–2290.

11. Ma S, Kosorok MR. Robust semiparametric M-estimation and the weighted bootstrap. Journal of Multivariate Analysis. 2005;96:190–217.

12. Ma S, Kosorok MR. Adaptive penalized M-estimation with current status data. Annals of the Institute of Statistical Mathematics. 2006;58:511–526.

13. Mammen E, van de Geer S. Penalized quasi-likelihood estimation in partial linear models. Annals of Statistics. 1997;25:1014–1035.

14. Murphy SA. Asymptotic Theory for the Frailty Model. Annals of Statistics. 1995;23:182–198.

15. Murphy SA, Van der Vaart AW. Observed information in semiparametric models. Bernoulli. 1999;5:381–412.

16. Murphy SA, Van der Vaart AW. Semiparametric mixtures in case-control studies. Journal of Multivariate Analysis. 2001;79:1–32.

17. Shen X. Asymptotic normality in semiparametric and nonparametric posterior distributions. Journal of the American Statistical Association. 2002;97:222–235.

18. Silverman BW. On the estimation of a probability density function by the maximum penalized likelihood method. Annals of Statistics. 1982;10:795–810.

19. Silverman BW. Some aspects of the spline smoothing approach to nonparametric regression curve fitting (with discussion) Journal of the Royal Statistical Society Series B. 1985;47:1–52.

20. van de Geer S. Empirical Processes in M-estimation. Cambridge University Press; Cambridge: 2000.

21. van der Vaart AW. Maximum Likelihood Estimation with Partially Censored Observations. Annals of Statistics. 1994;22:1896–1916.

22. van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes: With Applications to Statistics. Springer; New York: 1996.

23. Wahba G. Spline Models for Observational Data. SIAM; Philadelphia: 1998.

24. Xiang D, Wahba G. Approximate smoothin spline methods for large data sets in the binary case. ASA Proc of the Biometrics Section. :94–99.