PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
J Multivar Anal. Author manuscript; available in PMC 2010 April 28.
Published in final edited form as:
J Multivar Anal. 2009 March 1; 100(3): 345–362.
PMCID: PMC2860882
NIHMSID: NIHMS195772

The Penalized Profile Sampler

Abstract

The penalized profile sampler for semiparametric inference is an extension of the profile sampler method [9] obtained by profiling a penalized log-likelihood. The idea is to base inference on the posterior distribution obtained by multiplying a profiled penalized log-likelihood by a prior for the parametric component, where the profiling and penalization are applied to the nuisance parameter. Because the prior is not applied to the full likelihood, the method is not strictly Bayesian. A benefit of this approximately Bayesian method is that it circumvents the need to put a prior on the possibly infinite-dimensional nuisance components of the model. We investigate the first and second order frequentist performance of the penalized profile sampler, and demonstrate that the accuracy of the procedure can be adjusted by the size of the assigned smoothing parameter. The theoretical validity of the procedure is illustrated for two examples: a partly linear model with normal error for current status data and a semiparametric logistic regression model. Simulation studies are used to verify the theoretical results.

Keywords: Penalized Likelihood, Posterior Distribution, Profile Likelihood, Semiparametric Inference, Smoothing Parameter

1 Introduction

Semiparametric models are statistical models indexed by both a finite dimensional parameter of interest θ and an infinite dimensional nuisance parameter η. In order to make statistical inference about θ separately from η, we estimate the nuisance parameter with [eta w/ hat]θ, its maximum likelihood estimate at each fixed θ, i.e.

η^θ=argmaxηHlikn(θ,η),

where likn(θ, η) is the likelihood of the semiparametric model given n observations and H is the parameter space for η. Therefore we can do frequentist inference about θ based on the profile likelihood, which is typically defined as

pln(θ)=supηHlikn(θ,η).

The convergence rate of the nuisance parameter η is the order of d([eta w/ hat][theta w/ tilde]n, η0), where d(·, ·) is some metric on η, [theta w/ tilde]n is any sequence satisfying [theta w/ tilde]n = θ0 + oP(1), and (η0, θ0) is the true value of (η, θ). Typically,

d(η^θn,η0)=OP(||θnθθ||+nr),
(1)

where ||·|| is the Euclidean norm and r > 1/4. Of course, a smaller value of r leads to a slower convergence rate of the nuisance parameter. For instance, the nuisance parameter in the Cox proportional hazards model with right censored data, the cumulative hazard function, has the parametric rate, i.e., r = 1/2. If current status data is applied to the Cox model instead, then the convergence rate will be slower, with r = 1/3, due to the loss of information provided by this kind of data.

The profile sampler is the procedure of sampling from the posterior of the profile likelihood in order to estimate and draw inference on the parametric component θ in a semiparametric model, where the profiling is done over the possibly infinite-dimensional nuisance parameter η. [9] show that the profile sampler gives a first order correct approximation to the maximum likelihood estimator [theta w/ hat]n and consistent estimation of the efficient Fisher information for θ even when the nuisance parameter is not estimable at the n rate. Another Bayesian procedure employed to do semiparametric estimation is considered in [17] who study the marginal semiparametric posterior distribution for a parameter of interest. In particular, [17] show that marginal semiparametric posterior distributions are asymptotically normal and centered at the corresponding maximum likelihood estimates or posterior means, with covariance matrix equal to the inverse of the Fisher information. Unfortunately, this fully Bayesian method requires specification of a prior on η, which is quite challenging since for some models there is no direct extension of the concept of a Lebesgue dominating measure for the infinite-dimensional parameter set involved [8]. The advantages of the profile sampler for estimating θ compared to other methods is discussed extensively in [2], [3] and [9].

The motivation for studying second order asymptotic properties of the profile sampler comes from the observed simulation differences in the Cox model with different types of data, i.e. right censored data [2] and current status data [9]. The profile sampler generated based on the first model yields much more accurate estimation results comparing to the second model when the sample size is relatively small. [2] and [3] have successfully explored the theoretical reasons behind the above phenomena by establishing the relation between the estimation accuracy of the profile sampler, measured in terms of second order asymptotics, and the convergence rate of the nuisance parameters. Specifically speaking, the profile sampler generated from a semiparametric model with a faster convergence rate usually yields more precise frequentist inference of θ. These second order results are verified in [2] and [3] for several examples, including the proportional odds model, case-control studies with missing covariates, and the partly linear model. The convergence rates for these models range from the parametric to the cubic. The work in [3] has shown clearly that the accuracy of the inference for θ based on the profile sampler method is intrinsically determined by the semiparametric model specifications through its entropy number.

In many semiparametric models involving a smooth nuisance parameter, it is often convenient and beneficial to perform estimation using penalization. One motivation for this is that, in the absence of any restrictions on the form of the function η, maximum likelihood estimation for some semiparametric models leads to over-fitting. Seminal applications of penalized maximum likelihood estimation include estimation of a probability density function in [18] and nonparametric linear regression in [19]. Note that penalized likelihood is a special case of penalized quasi-likelihood studied in [13]. Under certain reasonable regularity conditions, penalized semiparametric log-likelihood estimation can yield fully efficient estimates for θ (see, for example, [13]). As far as we are aware, the only general procedure for inference for θ in this context known to be theoretically valid is a weighted bootstrap with bounded random weights (see [11]). It is even unclear whether the usual nonparametric bootstrap will work in this context when the nuisance parameter has a convergence rate r < 1/2.

The purpose of this paper is to ask the somewhat natural question: does sampling from the exponential of a profiled penalized log-likelihood (which process we refer hereafter to as the penalized profile sampler) yield first and even second order accurate frequentist inference? The conclusion of this paper is that the answer is yes and, moreover, the accuracy of the inference depends in a fairly simple way on the size of the smoothing parameter.

The unknown parameters in the semiparametric models we study in this paper include θ, which we assume belongs to some compact set Θ [subset or is implied by] Rd, and η, which we assume to be a function in the Sobolev class of functions Hk or its subset HkMHk{η:||η||M} for some known M < ∞ supported on some compact set on the real line. The Sobolev class of functions Hk is defined as the set {η: J2(η) [equivalent]An external file that holds a picture, illustration, etc.
Object name is nihms195772ig1.jpg(η(k)(z))2dz < ∞}, where η(j) is the j-th derivative of η with respect to z. Obviously J2(η) is some measurement of complexity of η. We denote Hk as the Sobolev function class with degree k. The penalized log-likelihood in this context is:

loglikλn(θ,η)=loglik(θ,η)nλn2J2(η),
(2)

where log lik(θ, η) [equivalent] nPn[ell]θ,η(X), [ell]θ,η(X) is the log-likelihood of the single observation X, and λn is a smoothing parameter, possibly dependent on data. In practice, λn can be obtained by cross-validation [23] or by inspecting the various curves for different values of λn. The penalized maximum likelihood estimators [theta w/ hat]n and [eta w/ hat]n depend on the choice of the smoothing parameter λn. Consequently we use the notation [theta w/ hat]λn and [eta w/ hat]λn for the remainder of this paper to denote the estimators obtained from maximizing (2). In particular, a larger smoothing parameter usually leads to a less rough penalized estimator of η0. It is of interest to establish the asymptotic property of the proposed penalized profile sampler procedure with a data-driven λn. Further studies on this issue are needed, but it is beyond the scope of this paper.

For the purpose of establishing first order accuracy of inference for θ based on the penalized profile sampler, we assume that the bounds for the smoothing parameter are in the form below:

λn=oP(n1/4)andλn1=OP(nk/(2k+1)).
(3)

The condition (3) is assumed to hold throughout this paper. One way to ensure (3) in practice is simply to set λn = nk/(2k+1). Or we can just choose λn = n−1/3 which is independent of k. It turns out that the upper bound guarantees that [theta w/ hat]λn is n-consistent, while the lower bound controls the penalized nuisance parameter estimator convergence rate. Another approach to controlling estimators is to use sieve estimates with assumptions on the derivatives (see [6]). We will not pursue this further here.

The log-profile penalized likelihood is defined as follows:

logplλn(θ)=loglik(θ,η^θ,λn)nλn2J2(η^θ,λn),
(4)

where [eta w/ hat]θ,λn is argmaxη[set membership]Hk log likλn(θ, η) for fixed θ and λn. Note that J([eta w/ hat][theta w/ tilde]n,0) ≥ J([eta w/ hat][theta w/ tilde]n,λn), where ηθ,0 = [eta w/ hat]θ [equivalent] argmaxη[set membership]H log lik(θ, η) for a fixed θ, based on the inequality that log likλn([theta w/ tilde]n, [eta w/ hat][theta w/ tilde]n,0) ≤ log likλn([theta w/ tilde]n, [eta w/ hat][theta w/ tilde]n, λn). Hence again we verify that the smoothing parameter λn plays a role in determining the complexity degree of the estimated nuisance parameter. The penalized profile sampler is just the procedure of sampling from the posterior distribution of plλn(θ) by assigning a prior on θ. By analyzing the corresponding MCMC chain from the frequentist’s point of view, our paper obtains the following conclusions:

  1. Distribution Approximation: The posterior distribution with respect to plλn(θ) can be approximated by the normal distribution with mean the maximum penalized likelihood estimator of θ and variance the inverse of the efficient information matrix, with error OP(n1/2λn2);
  2. Moment Approximation: The maximum penalized likelihood estimator of θ can be approximated by the mean of the MCMC chain with error OP(λn2). The efficient information matrix can be approximated by the inverse of the variance of the MCMC chain with error OP(n1/2λn2);
  3. Confidence Interval Approximation: An exact frequentist confidence interval of Wald’s type for θ can be estimated by the credible set obtained from the MCMC chain with error OP(λn2).

Obviously, given any smoothing parameter satisfying the upper bound in (3), the penalized profile sampler can yield first order frequentist valid inference for θ, similar as to what was shown for the profile sampler in [9]. Moreover, the above conclusions are actually second order frequentist valid results, whose approximation accuracy is directly controlled by the smoothing parameter. Note that the corresponding results for the usual (non-penalized) profile sampler with nuisance parameter convergence rate r in [3] are obtained by replacing in the above OP(n1/2λn2) with OP(n−1/2 [logical or] nr+1/2) and OP(λn2) with OP(n−1 [logical or] nr), for all respective occurrences, where r is as defined in (1).

Our results are the first general higher order frequentist inference results for penalized semi-parametric estimation. We also note, however, that some results on second order efficiency of semiparametric estimators were derived in [4]. The layout of the article is as follows. The next section, section 2, introduces the two main examples we will be using for illustration: partly linear regression for current status data and semiparametric logistic regression. Some background is given in section 3, including the concept of a least favorable submodel as well as the main model assumptions. One preliminary theorem concerning about second order asymptotic expansions of the log-profile penalized likelihood is also presented in section 3. The main results and implications are discussed in section 4, and all remaining model assumptions are verified for the examples in section 5. A brief discussion of future work is given in section 6. We postpone all technical tools and proofs to the last section, section 7.

2 Examples

2.1 Partly Linear Normal Model with Current Status Data

In this example, we study the partly linear regression model with normal residue error. The continuous outcome Y, conditional on the covariates (U, V) [set membership] Rd × R, is modeled as

Y=θTU+f(V)+ε,
(5)

where f is an unknown smooth function, and ε ~ N(0, σ2) with finite variance σ2. For simplicity, we assume for the rest of the paper that σ = 1. The theory we propose also works when σ is unknown, but the added complexity would detract from the main issues. We also assume that only the current status of response Y is observed at a random censoring time C [set membership] R. In other words, we observe X = (C, Δ, U, V), where indicator Δ = 1{YC}. Current status data may occur due to study design or measurement limitations. Examples of such data arise in several fields, including demography, epidemiology and econometrics. For simplicity of exposition, θ is assumed to be one dimensional.

Under the model (5) and given that the joint distribution for (C, U, V) does not involve parameters (θ, f), the log-likelihood for a single observation at X = x [equivalent] (c, δ, u, v) is

loglikθ,f(x)=δlog{Φ(cθuf(v))}+(1δ)log{1Φ(cθuf(v))},
(6)

where Φ is the cdf of the standard normal distribution. The parameter of interest, θ, is assumed to belong to some compact set in R1. The nuisance parameter is the function f, which belongs to the Sobolev function class of degree k. We further make the following assumptions on this model. We assume that (Y, C) is independent given (U, V). The covariates (U, V) are assumed to belong to some compact set, and the support for random censoring time C is an interval [lc, uc], where −∞ < lc < uc < ∞. In addition, PVar(U|V) is strictly positive and Pf (V) = 0. The first order symptotic behaviors of the penalized log-likelihood estimates of a slightly more general version of this model have been extensively studied in [10].

2.2 Semiparametric Logistic Regression

Let X1 = (Y1, W1, Z1), X2 = (Y2, W2, Z2), … be independent copies of X = (Y, W, Z), where Y is a dichotomous variable with conditional expectation P(Y|W, Z) = F(θTW + η(Z)). F(u) is the logistic distribution defined as eu/(eu + 1). Obviously the likelihood for a single observation is of the following form:

likθ,η(x)=F(θTw+η(z))y(1F(θTw+η(z)))1yf(W,Z)(w,z).
(7)

This example is a special case of quasi-likelihood in partly linear models when the conditional variance of response Y is taken to have some quadratic form of the conditional mean of Y. In the absence of any restrictions on the form of the function η, the maximum likelihood of this simple model often leads to over-fitting. Hence [5] propose maximizing instead the penalized likelihood of the form loglik(θ,η)nλn2J2(η); and [13] showed the asymptotic consistency of the maximum penalized likelihood estimators for θ and η. For simplicity, we will restrict ourselves to the case where Θ [subset or is implied by] R1 and (W, Z) have bounded support, say [0, 1]2. To ensure the identifiability of the parameters, we assume that PVar(W|Z) is positive and that the support of Z contains at least k distinct points in [0, 1], see lemma 7.1 in [15].

Remark 1

Another interesting potential example we may apply the penalized profile sampler method to is the classic proportional hazards model with current status data by penalizing the cumulative hazard function with its Sobolev norm. There are two motivations for us to penalize the cumulative hazard function in the Cox model. One is that the estimated step functions from the unpenalized estimation cannot be used easily for other estimation or inference purposes. Another issue with the unpenalized approach is that without making stronger continuity assumptions, we cannot achieve uniform consistency even on a compact set [10]. The asymptotic properties of the corresponding penalized M-estimators have been studied in [12].

3 Preliminaries

In this section, we present some necessary preliminary material concerning least favorable sub-models and assume some structural requirements to achieve second order asymptotic expansion of the log-profile penalized likelihood (21).

3.1 Least favorable submodels

In this subsection, we briefly review the concept of a least favorable submodel. A submodel t [mapsto] likt,ηt is defined to be least favorable at (θ, η) if [l with hat]θ,η = [partial differential]/[partial differential]t log likt,ηt, given t = θ, where [l with hat]θ,η is the efficient score function for θ. The efficient score function for θ can be viewed as the projection of the score function for θ onto the tangent space of η. The inverse of its variance is exactly the efficient information matrix Ĩθ,η. We abbreviate hereafter [l with hat]θ0η0 and Ĩθ0,η0 with [l with hat]0 and Ĩ0, respectively. The “direction” along which ηt approaches η in the least favorable submodel is called the least favorable direction. An insightful review about least favorable submodels and efficient score functions can be found in Chapter 3 of [7]. We assume that in our setting a least favorable submodel always exists. By the above construction of the least favorable submodel, log plλn (θ) can be rewritten in the following form:

logplλn(θ)=n(Pn(θ,θ,η^θ,λn)λn2J2(ηθ(θ,η^θ,λn))),
(8)

where [ell](t, θ, η)(x) = [ell]t,ηt(θ,η)(x), t [mapsto] ηt(θ, η) is a general map from the neighborhood of θ into the parameter set for η, with ηθ(θ, η) = η. The concrete forms of (8) will depend on the situation.

The derivatives of the function [ell](t, θ, η) are with respect to its first argument, t. For the derivatives relative to the argument θ, we use the following shortened notation: [ell]θ(t, θ, η) indicates the first derivative of [ell](t, θ, η) with respect to θ and [ell]t,θ(t, θ, η) denotes the derivative of [l with dot above](t, θ, η) relative to θ. Also, [ell]t,t(θ) and [ell]t,θ(η) indicate the maps θ [mapsto] [l with umlaut](t, θ, η) and η [mapsto] [ell]t,θ(t, θ, η), respectively. For brevity, we denote [l with dot above]0 = [l with dot above](θ0, θ0, η0), [l with umlaut]0 = [l with umlaut](θ0, θ0, η0) and 0(3)=(3)(θ0,θ0,η0), where θ0 and η0 are the true values of θ and η. Of course, we can write [l with hat]0(X) as [l with dot above]0(X) based on the construction of the least favorable submodel. All the necessary derivatives of [ell](t, θ, η) w.r.t. t or θ in this paper are assumed to have integrable envelope functions in some neighborhood of (θ0, θ0, η0). In the following, we use Pθ,ηU to denote the expectation of a random variable U at the parameter (θ, η), and use PU to represent Pθ00U for simplicity.

3.2 Main Assumptions

The set of structural conditions about the least favorable submodel are the “no-bias” conditions:

P.(θ0,θn,η^θn,λn)=OP(λn+||θnθ0||)2,
(9)

P¨(θ0,θn,η^θn,λn)=P¨0+OP(λn+||θnθ0||),
(10)

for any sequence [theta w/ tilde]n satisfying [theta w/ tilde]n = θ0 + oP (1). The verifications of (9) and (10) depend on the smoothness of [ell](t, θ, η) and the convergence rate of the penalized nuisance parameter based on the functional Taylor expansions around the true values. The convergence rate typically has the following upper bound:

d(η^θn,λn,η0)=OP(λn+||θnθ0||).
(11)

The form of d(η, η0) may vary for different situations and does not need to be specified in this subsection beyond the given conditions. (11) implies that [eta w/ hat][theta w/ tilde]n,λn is consistent for η0 as [theta w/ tilde]nθ0 in probability. Hence (9) and (10) hold provided the Fréchet derivatives of the maps η [mapsto] [l with umlaut](θ0, θ0, η) and η [mapsto] [ell]t,θ(θ0, θ0, η) are bounded, and

P.(θ0,θ0,η)=O(d2(η,η0)),
(12)

which is usually implied by a bounded Fréchet derivative of η [mapsto] [l with dot above](θ0, θ0, η) and second order Fréchet differentiability of the map η [mapsto] lik(θ0, η).

The empirical version of the no-bias conditions,

Pn.(θ0,θn,η^θn,λn)=Pn0+OP(λn+||θnθ0||)2,
(13)

Pn¨(θ0,θn,η^θn,λn)=P¨0+OP(λn+||θnθ0||),
(14)

where Pn represents the empirical distribution of the observations, ensures that the penalized profile likelihood behaves like a penalized likelihood in the parametric model asymptotically and therefore yields a second order asymptotic expansion of the penalized profile log-likelihood. Obviously the empirical no-bias conditions are built upon (9) and (10) by assuming the sizes of the collections of the functions [l with dot above] and [l with umlaut] are manageable. This condition is expressed in the language of empirical processes. Provided that [l with umlaut]0 and [ell]t,θ(θ0, θ0, η0) are square integrable, (14) follows from (10) if we assume

Gn(¨(θ0,θn,η^θn,λn)¨0)=oP(1),
(15)

where Gnn(PnP) is used for the empirical processes of the observations. If we further assume that

Gn(t,θ(θ0,θ¯n,η^θn,λn)t,θ(θ0,θ0,η0))=oP(1),
(16)

Gn(.(θ0,θ0,η^θn,λn).0)=OP(n14k+2(λn+||θnθ0||)),
(17)

for any sequence [theta w/ macron]n satisfying [theta w/ macron]n = θ0 + oP (1), then (13) follows. Note that the conditions (15)–(17) are concerned with the asymptotic equicontinuity of the empirical process measure of [l with umlaut], [ell]t,θ and [l with dot above], respectively. Thus we will be able to use technical tools T2 and T5 given in the appendix to show (15)–(17). We next present the preliminary theorem about the second order asymptotic expansion of the log-profile penalized likelihood which prepares us for deriving the main results about the higher order structure of the penalized profile sampler in the next section.

Theorem 1

Let (13) and (14) be satisfied and suppose that

(PnP)(3)(θ¯n,θn,η^θn,λn)=oP(1),
(18)

λnJ(η^θn,λn)=OP(λn+||θnθ0||),
(19)

for any sequence [theta w/ tilde]n and [theta w/ macron]n satisfying [theta w/ tilde]n = θ0 + oP (1) and [theta w/ macron]n = θ0 + oP (1). If θ0 is an interior point in Θ and [theta w/ hat]λn is consistent, then we have

n(θ^λnθ0)=1ni=1nI010(Xi)+OP(n1/2λn2),
(20)

logplλn(θn)=logplλn(θ^λn)n2(θnθ^λn)TI0(θnθ^λn)+OP(gλn(||θnθ^λn||)),
(21)

where gλn(w)=nw3+nw2λn+nwλn2+n1/2λn2, provided the efficient information Ĩ0 is positive definite.

For the verification of (18), we need to make use of a Glivenko-Cantelli theorem for classes of functions that change with n which is a modification of theorem 2.4.3 in [22] and is explained in the appendix. Moreover, (19) implies that J([eta w/ hat]λn) = OP(1) if the [theta w/ hat]λn is asymptotically normal, which has been shown in (20).

Remark 2

The results in theorem 1 are useful in their own right for inference about θ. (20) is a second higher order frequentist result in penalized semiparametric estimation regarding the asymptotic linearity of the maximum penalized likelihood estimator of θ.

4 Main Results and Implications

We now state the main results on the penalized posterior profile distribution. A preliminary result, theorem 2 with corollary 1 below, shows that the penalized posterior profile distribution is asymptotically close enough to the distribution of a normal random variable with mean [theta w/ hat]λn and variance (0)−1 with second order accuracy, which is controlled by the smoothing parameter. Similar conclusions also hold for the penalized posterior moments. Another main result, theorem 3, shows that the penalized posterior profile log-likelihood can be used to achieve second order accurate frequentist inference for θ.

Let PθXλn be the penalized posterior profile distribution of θ with respect to the prior ρ(θ). Define

Δλn(θ)=n1{logplλn(θ)logplλn(θ^λn)}.

Theorem 2

Let (20) and (21) be satisfied and suppose that

Δλn(θn)=oP(1)impliesθn=θ0+oP(1),
(22)

for every random {[theta w/ tilde]n} [set membership] Θ. If proper prior ρ(θ0) > 0 and ρ(·) has continuous and finite first order derivative in some neighborhood of θ0, then we have,

supξRd|Pθ|Xλn(nI01/2(θθ^λn)ξ)Φd(ξ)=OP(n1/2λn2),
(23)

where Φd(·) is the distribution of the d-dimensional standard normal random variable.

Corollary 1

Under the assumptions of theorem 2, we have that if θ has finite second absolute moment, then

θ^λn=EθXλn(θ)+OP(λn2),
(24)

I0=n1(VarθXλn(θ))1+OP(n1/2λn2),
(25)

where EθXλn(θ) and VarθXλn(θ) are the penalized posterior profile mean and penalized posterior profile covariance matrix, respectively.

We now present another second order asymptotic frequentist property of the penalized profile sampler in terms of quantiles. The α-th quantile of the penalized posterior profile distribution, τ, is defined as τnα=inf{ξ:PθXλn(θξ)α}, where the inf is taken componentwise. Without loss of generality, we can assume PθXλn(θτnα)=α because of the assumed smoothness of both the prior and the likelihood in our setting. We can also define κnαn(τnαθ^λn), i.e., PθXλn(n(θθ^λn)κnα)=α. Note that neither τ nor κ are unique if the dimension of θ is larger than one.

Theorem 3

Under the assumptions of theorem 2 and assuming that [l with hat]0(X) has finite third moment with a nondegenerate distribution, then there exists a [kappa macron] based on the data such that P(n(θ^λnθ0)κ^nα)=α and κ^nακnα=OP(n1/2λn2) for each choice of κ.

Remark 3

Theorem 3 ensures that there exists a unique α-th quantile for θ up to OP(λn2) in the frequentist set-up for each fixed τ. Note that τ is not unique if the dimension of θ is larger than one.

Remark 4

Theorem 2, corollary 1 and theorem 3 above show that the penalized profile sampler generates second order asymptotic frequentist valid results in terms of distributions, moments and quantiles. Moreover, the second order accuracy of this procedure is controlled by the smoothing parameter.

Remark 5

Another interpretation for the role of λn in the penalized profile sampler is that we can view λn as the prior on J(η), or on η to some extent. To see this, we can write likλn (θ, η) in the following form:

likλn(θ,η)=likn(θ,η)×exp[J2(η)2(12nλn2)]

This idea can be traced back to [23]. In other words, the prior on J(η) is a normal distribution with mean zero and variance (2λn2n)1. Hence it is natural to expect λn to have some effect on the convergence rate of η. Other possible priors on the functional parameter include Dirichlet and Gaussian processes which are more commonly used in nonparametric Bayesian methodology.

5 Examples (Continued)

We now illustrate verification of the assumptions in section 3.2 with the two examples that were introduced in section 2. Thus this section is a continuation of the earlier examples.

5.1 Partly Linear Normal Model with Current Status Data

In this section we verify the regularity conditions for the partly linear model with current status data as well as present a small simulation study to gain insight into the moderate sample size agreement with the asymptotic theory.

5.1.1 Verification of conditions

We will concentrate on the estimation of the regression coefficient θ, considering the infinite dimensional parameter fHkM as a nuisance parameter. The strengthened condition on η, together with the requirement that the density for the joint distribution (U, V, C) is strictly positive and finite, is necessary to verify the rate assumptions (27) and (28) in the below lemma 1. The score function of θ, [l with dot above]θ,f, is given as follows:

.θ,f(x)=uQ(x;θ,f),

where

Q(X;θ,f)=(1Δ)φ(qθ,f(X))1Φ(qθ,f(X))Δφ(qθ,f(X))Φ(qθ,f(X)),

qθ,f (x) = cθuf(v), and [var phi] is the density of a standard normal random variable. The least favorable direction at the true parameter value is:

h0(v)=E0(UQ2(X;θ,f)V=v)E0(Q2(X;θ,f)V=v),

where E0 is the expectation relative to the true parameters. The derivation of [l with dot above]θ,f and h0(·) is given in [3]. Thus, the least favorable submodel can be constructed as follows:

(t,θ,f)=loglik(t,ft(θ,f)),
(26)

where ft(θ, f) = f + (θt)h0. The concrete forms of [ell](t, θ, η) and the related derivatives are given in [3] which considers a more rigid model with a known upper bound on the L2 norm of the kth derivative. The remaining assumptions are verified in the following three lemmas:

Lemma 1

Under the above set-up for the partly linear normal model with current status data, we then have for λn satisfying (3) and θnpθ0,

||f^θn,λnf0||2=OP(λn+||θnθ0||),
(27)

λnJ(f^θn,λn)=OP(λn+||θnθ0||),
(28)

where ||·||2 represents the regular L2 norm. Moreover, if we also assume that f [set membership] {g: ||g|| + J(g) ≤ [M with tilde]} for some known [M with tilde], then

||f^θnf0||2=OP(nk/(2k+1)+||θnθ0||),
(29)

provided condition (3) holds.

Remark 6

Lemma 1 implies that the convergence rate of the estimated nuisance parameter is slower than that of the regular nuisance parameter by comparing (27) and (29). This result is not surprising since the slower rate is the trade-off for the smoother nuisance parameter estimator. However, the advantage of the penalized profile sampler is that we can control the convergence rate by assigning the smoothing parameter with different rates. To obtain the convergence rate of the non-penalized estimated nuisance parameter, we would need to assume that the Sobolev norm of the nuisance parameter has some known upper bound. Thus we can argue that the penalized method enables a relaxation of the assumptions needed for the nuisance parameter. Lemma 1 also indicates that ||fλn − f0||2 = OPn) and ||fn − f0||2 = OP (n−k/(k+2)). Note that the convergence rate of the maximum penalized likelihood estimator, OPn), is deemed as the optimal rate in [23]. Similar remarks also hold for lemma 4 in semiparametric logistic regression model example below.

Lemma 1 and 4 imply that J([eta w/ hat]λn) = OP (1) and J(fλn) = OP (1), respectively. Thus the maximum likelihood estimators of the nuisance parameters in the two examples of this paper are consistent in the uniform norm, i.e. ||[eta w/ hat]λn − η0|| = oP (1) and ||fλn − f0|| = oP (1), since the sequences [eta w/ hat]λn and fλn consist of smooth functions defined on a compact set with asymptotically bounded first-order derivatives.

Lemma 2

Under the above set-up for the partly linear normal model with current status data, assumptions (13), (14) and (18) are satisfied.

Lemma 3

Under the above set-up for the partly linear normal model with current status data, condition (22) is satisfied.

5.1.2 Simulation study

In this subsection, we conducted simulations for the partly linear model with two different sizes of smoothing parameter, i.e. λn = n−1/3 and λn = n−2/5. Since we assume that fH2M in the model, the above smoothing parameters satisfy (3). Our experience indicates that, in applications involving moderate sample sizes, specification of M is not needed and λn = n−1/3 (n−2/5) appears to work most of the time. Perhaps using cross validation to choose λn may improve the performance of the estimator in some settings, but evaluating this issue requires further study and is beyond the scope of the current paper. The contrast of the above two simulations agrees with our theoretical results that we can control the accuracy of inferences based on the penalized profile sampler by adjusting the related smoothing parameter.

We next discuss the computation of fθ,λn in the simulations. For the special case of k = 2, we can use a cubic spline for estimating f given a fixed θ and λn. In practice, we take a computational sieve approach suggested by Xiang and Wahba [24], which states that an estimate with the number of basis functions growing at least at the rate O(n1/5) can achieve the same asymptotic precision as the full space, see section 8.2 in [10] for details.

In the following, the simulations are run for various sample sizes under a Lebesgue prior. For each sample size, 200 datasets were analyzed. The regression coefficient is θ = 1 and f(v) = sin(πv). We generate U ~ Unif[0, 1], V ~ Unif[−1, 1] and C ~ Unif[0, 2]. For each dataset, Markov chains of length 20, 000 with a burn-in period of 5, 000 were generated using the Metropolis algorithm. The jumping density for the coefficient was normal with current iteration and variance tuned to yield an acceptance rate of 20% – 40%. The approximate variance of the estimator of θ was computed by numerical differentiation with step size proportional to n−1/3 (n−2/5) for the model with smoothing parameter λn = n−1/3 (n−2/5) according to (21), see remark 1 in [3] for details.

Table 1 (2) in the below summarizes the simulation results for θ with smoothing parameter λn = n−1/3 (n−2/5) giving the average across 200 samples of the penalized maximum likelihood estimate (PMLE), mean of the penalized profile sampler (CM), estimated standard errors based on MCMC (SEM), estimated standard errors based on numerical derivatives (SEN), boundaries for the two-sided 95% confidence interval for θ generated by numerical differentiation and MCMC. LM (LN) and UM (UN) denote the lower and upper bound of the confidence interval from the MCMC chain (numerical derivative). According to the above theoretical results, the terms n2/3|PMLE − CM|(n4/5|PMLE − CM|), n1/6|SEM − SEN| (n3/10|SEM − SEN|), n2/3|LM − LN| (n4/5|LM − LN|) and n2/3|UM − UN| (n4/5|UM − UN|) in Table 1 (2) are bounded in probability. And the realizations of these terms summarized in Table 1 and and22 clearly illustrate their boundedness. Furthermore, we can conclude that the penalized profile sampler with respect to different sizes of smoothing parameter can yield statistical inference with different degree of accuracy.

Table 1
Partly Linear Model with λn = n−1/30 = 1 and 200 samples)
Table 2
Partly Linear Model with λn = n−2/50 = 1 and 200 samples)

5.2 Semiparametric Logistic Regression

In the semiparametric logistic regression model, we can obtain the score function for θ and η by similar analysis performed in the first example, i.e. [l with dot above]θ,η(x) = (yF (θw + η(z)))w and Aθ,ηhθ,η(x) = (yF(θw + η(z)))hθ,η(z) for J(h) < ∞, where Aη,η and hθ,η are the score operator for η and least favorable direction at (θ, η), respectively. And the least favorable direction at the true parameter is given in [15]:

h0(z)=P0[WF.(θ0W+η0(Z))Z=z]P0[F.(θ0W+η0(Z))Z=z],

where F(u) = F(u)(1 − F(u)). The above assumptions plus the requirement that J(h0) < ∞ ensures the identifiability of the parameters. Thus the least favorable submodel can be written as:

(t,θ,η)=loglik(t,ηt(θ,η)),

where ηt(θ, η) = η + (θt)h0. By differentiating [ell](t, θ, η) with respect to t or θ, we obtain,

.(t,θ,η)=(yF(tw+η(z)+(θt)h0(z)))(wh0(z)),¨(t,θ,η)=F.(tw+η(z)+(θt)h0(z))(wh0(z))2,t,θ(t,θ,η)=F.(tw+η(z)+(θt)h0(z))(wh0(z))h0(z),(3)(t,θ,η)=F¨(tw+η(z)+(θt)h0(z))(wh0(z))3,t,t,θ(t,θ,η)=F¨(tw+η(z)+(θt)h0(z))(wh0(z))2h0(z),t,θ,θ(t,θ,η)=F¨(tw+η(z)+(θt)h0(z))(wh0(z))h02(z),

where F(·) is the second derivative of the function F(·). The rate assumptions will be shown in lemma 4. The remaining assumptions are verified in the last two lemmas:

Lemma 4

Under the above set-up for the semiparametric logistic regression model, we have for λn satisfying condition (3) and any θnpθ0 that

||η^θn,λnη0||2=OP(λn+||θnθ0||),
(30)
λnJ(η^θn,λn)=OP(λn+||θnθ0||).
(31)

If we also assume that η [set membership] {g: ||g|| + J(g) ≤ [M with tilde]} for some known [M with tilde], then

||η^θnη0||2=OP(nk/(2k+1)+||θnθ0||),
(32)

provided condition (3) holds.

Lemma 5

Under the above set-up for the semiparametric logistic regression model, assumptions (13), (14) and (18) are satisfied.

Lemma 6

Under the above set-up for the semiparametric logistic regression model, condition (22) is satisfied.

6 Future Work

Our paper evaluates the penalized profile sampler method from the frequentist view and discusses the effect of the smoothing parameter on estimation accuracy. One potential problem of interest is to sharpen the upper bound for the convergence rate of the approximation error in this paper, like the typical second-order asymptotic results in Edgeworth expansions, see, for example [1]. A formal study about the higher order comparisons between the profile sampler procedure and fully Bayesian procedure [17], which assigns priors to both the finite dimensional parameter and the infinite dimensional nuisance parameter, is also interesting. We expect that the involvement of a suitable prior on the infinite dimensional parameter would at least not decrease the estimation accuracy of the parameter of interest.

Another worthwhile avenue of research is to develop analogs of the profile sampler and penalized profile sampler to likelihood estimation under model misspecification and to general M-estimation. Some first order results for this setting in the case where the nuisance parameter may not be root-n consistent have been developed for a weighted bootstrap procedure in [11]. The studies about second order asymptotics under mild model misspecifications can provide theoretical insights into semiparametric model selection problems.

Acknowledgments

The authors thank Dr. Joseph Kadane for several insightful discussions.

7 Appendix

We first state classical definitions for the covering number (entropy number) and bracketing number (bracketing entropy number) for a class of functions, and then present some technical tools about the entropy calculations and increments of empirical processes which will be employed in the proofs that follow. The notations [greater, similar] and [less, similar] mean greater than, or smaller than, up to a universal constant.

Definition

Let An external file that holds a picture, illustration, etc.
Object name is nihms195772ig2.jpg be a subset of a (pseudo-) metric space (L, d) of real-valued functions. The δ-covering number N(δ, An external file that holds a picture, illustration, etc.
Object name is nihms195772ig2.jpg, d) of An external file that holds a picture, illustration, etc.
Object name is nihms195772ig2.jpg is the smallest N for which there exist functions a1, …, aN in L, such that for each a [set membership] An external file that holds a picture, illustration, etc.
Object name is nihms195772ig2.jpg, d(a, aj) ≤ δ for some j [set membership] {1, …, N}. The δ-bracketing number NB(δ, An external file that holds a picture, illustration, etc.
Object name is nihms195772ig2.jpg, d) is the smallest N for which there exist pairs of functions {[ajL,ajU]}j=1NL, with d(ajL,ajU)δ, j = 1, …, N, such that for each a [set membership] An external file that holds a picture, illustration, etc.
Object name is nihms195772ig2.jpg there is a j [set membership] {1, …, N} such that ajLaajU. The δ-entropy number (δ-bracketing entropy number) is defined as H(δ, An external file that holds a picture, illustration, etc.
Object name is nihms195772ig2.jpg, d) = log N (δ, An external file that holds a picture, illustration, etc.
Object name is nihms195772ig2.jpg, d) (HB(δ, An external file that holds a picture, illustration, etc.
Object name is nihms195772ig2.jpg, d) = log NB(δ, An external file that holds a picture, illustration, etc.
Object name is nihms195772ig2.jpg, d)).

T1. For each 0 < C < ∞ and δ > 0 we have

HB(δ,{η:||η||C,J(η)C},||·||)(Cδ)1/k,
(33)
H(δ,{η:||η||C,J(η)C},||·||)(Cδ)1/k.
(34)

T2. Let F be a class of measurable functions such that P f2 < δ2 and ||f||M for every f in F. Then

EP||Gn||K(δ,,L2(P))(1+K(δ,,L2(P))δ2nM),

where ||An external file that holds a picture, illustration, etc.
Object name is nihms195772ig3.jpg||F = supf[set membership]F |Gnf| and K(δ,,||·||)=0δ1+HB(ε,,||·||)dε.

T3. Let F = {ft: t [set membership] T} be a class of functions satisfying |fs(x) − ft(x)| ≤ d(s, t)F (x) for every s and t and some fixed function F. Then, for any norm ||·||,

NB(2ɛ||F||,,||·||)N(ε,T,d).

T4. Let F be a class of measurable functions f: An external file that holds a picture, illustration, etc.
Object name is nihms195772ig4.jpg× An external file that holds a picture, illustration, etc.
Object name is nihms195772ig5.jpg [mapsto] R on a product of a finite set and an arbitrary measurable space (An external file that holds a picture, illustration, etc.
Object name is nihms195772ig5.jpg, An external file that holds a picture, illustration, etc.
Object name is nihms195772ig6.jpg). Let P be a probability measure on An external file that holds a picture, illustration, etc.
Object name is nihms195772ig4.jpg × An external file that holds a picture, illustration, etc.
Object name is nihms195772ig5.jpg and let An external file that holds a picture, illustration, etc.
Object name is nihms195772ig7.jpg be its marginal on An external file that holds a picture, illustration, etc.
Object name is nihms195772ig5.jpg. For every d [set membership] An external file that holds a picture, illustration, etc.
Object name is nihms195772ig4.jpg, let Fd be the set of functions w [mapsto] f(d, w) as f ranges over F. If every class Fd is PW-Donsker with supf[set membership]F |PW f(d, W )| < ∞ for every d, then F is P-Donsker.

T5. Let F be a uniformly bounded class of measurable functions such that for some measurable f0, supf[set membership]F ||ff0|| < ∞. Moreover, assume that HB(ε, F; L2(P)) ≤ K εα for some K < ∞ and α [set membership] (0, 2) and for all ε > 0. Then

supf[(PnP)(ff0)||ff0||21α/2n(α2)/[2(2+α)]]=OP(n1/2).

T6. For a probability measure P, let F1 be a class of measurable functions f1: An external file that holds a picture, illustration, etc.
Object name is nihms195772ig8.jpg [mapsto] R, and let F2 denote a class of continuous nondecreasing functions f2: R [mapsto] [0, 1]. Then,

HB(ε,2(1),L2(P))2HB(ε/3,1,L2(P))+supQHB(ε/3,2,L2(Q)).

T7. Let F and g be classes of measurable functions. Then for any probability measure Q and any 1 ≤ r ≤ ∞,

HB(2ε,+G,Lr(Q))HB(ε,,Lr(Q))+HB(ε,G,Lr(Q)),
(35)

and, provided F and g are bounded by 1 in terms of ||·||,

HB(2ε,·G,Lr(Q))HB(ε,,Lr(Q))+HB(ε,G,Lr(Q)),
(36)

where F · g [equivalent] {f × g: f [set membership] F and g [set membership] g}.

Remark 7

The proof of T1 is found in [22]. T1 implies that the Sobolev class of functions with known bounded Sobolev norm is P-Donsker. T2 and T3 are separately lemma 3.4.2 and theorem 2.7.11 in [22]. T4 is lemma 9.2 in [16]. T5 is a result presented on page 79 of [20] and is a special case of lemma 5.13 on the same page, the proof of which can be found in pages 79–80. T6 and T7 are separately lemma 15.2 and 9.24 in [7].

Proof of theorem 1

We first show (20), and then we need to state one lemma before proceeding to the proof of (21). For the proof of (20), note that

0=Pn.(θ^λn,θ^λn,η^λn)+2λn2Zη^λn(k)(z)h0(k)(z)dz.

Combining the third order Taylor expansion of [theta w/ hat]λn [mapsto] Pn[l with dot above]([theta w/ hat]λn, θ, η) around θ0, where θ = [theta w/ hat]λn, and η = [eta w/ hat]λn, with conditions (13), (14) and (18), the first term in the right-hand-side of the above displayed equality equals Pn[l with hat]0 − Ĩ0([theta w/ hat]λnθ0) + OP (λn + ||[theta w/ hat]λn −θ0||)2. By the inequality 2λn2Zη^λn(k)(z)h0(k)(z)dzλn2(J2(η^λn)+J2(h0)) and assumption (19), the second term in the right-hand-side of the above equality is equal to OP (λn + ||[theta w/ hat]λn − θ0||)2. Combining everything, we obtain the following:

1ni=1nI010(Xi)=n(θ^λnθ0)+OP(n1/2(λn+||θ^λnθ0||)2).
(37)

The right-hand-side of (37) is of the order OP(nλn2+nwn(1+wn+λn)), where wn represents ||[theta w/ hat]λnθ0||. However, its left-hand-side is trivially OP (1). Considering the fact that nλn2=oP(1), we can deduce that [theta w/ hat]λnθ0 = OP (n−1/2). Inserting this into the previous display completes the proof of (20).

We next prove (21). Note that [theta w/ hat]λnθ0 = OP (n −1/2). Hence the order of the remainder terms in (13) and (14) become OP (λn + ||[theta w/ tilde]n[theta w/ hat]λn||)2 and OP (λn + ||[theta w/ tilde]n[theta w/ hat]λn||), respectively. Expression (56) in lemma 7 below implies that

logplλn(θ^λn)=logplλn(θ0)+n(θ^λnθ0)TPn0n2(θ^λnθ0)TI0(θ^λnθ0)+OP(n1/2λn2).
(38)

The difference between (38) and (56) generates

logplλn(θn)=logplλn(θ^λn)+n(θnθ^λn)T(Pn0I0(θ^λnθ0))n2(θnθ^λn)TI0(θnθ^λn)+OP(gλn(||θnθ^λn||)).

(21) is now immediately obtained after considering (20).

Proof of theorem 2

Suppose that Fλn (·) is the penalized posterior profile distribution of nϱn with respect to the prior ρ(θ), where the vector [var rho]n is defined as I01/2(θθ^λn). The parameter set for [var rho]n is Ξn. Fλn (·) can be expressed as:

Fλn(ξ)=ϱn(,n1/2ξ]Ξnρ(θ^λn+I012ϱn)plλn(θ^λn+I012ϱn)plλn(θ^λn)dϱnϱnΞnρ(θ^λn+I012ϱn)plλn(θ^λn+I012ϱn)plλn(θ^λn)dϱn.
(39)

Note that d[var rho]n in the above is the short notation for d[var rho]n1 × … × d[var rho]nd. To prove theorem 2, we first partition the parameter set Ξn as {Ξn ∩ {||[var rho]n|| > rn}} [union or logical sum]n ∩ {||[var rho]n|| ≤ rn}}. By choosing the proper order of rn, we find the posterior mass in the first partition region is of arbitrarily small order, as verified in lemma 2.1 immediately below, and the mass inside the second partition region can be approximated by a stochastic polynomial in powers of n−1/2 with error of order dependent on the smoothing parameter, as verified in lemma 2.2 below. This basic technique applies to both the denominator and the numerator, yielding the quotient series, which gives the desired result.

lemma 2.1

Choose rn = o(n−1/3) and nrn. Under the conditions of theorem 2, we have

||ϱn||>rnρ(θ^λn+I012ϱn)plλn(θ^λn+I012ϱn)plλn(θ^λn)dϱn=OP(nM),
(40)

for any positive number M.

Proof

Fix r > 0. We then have

||ϱn||>rρ(θ^λn+I012ϱn)plλn(θ^λn+I012ϱn)plλn(θ^λn)dϱnI{Δλnr<n12}exp(n)Θρ(θ)dθ+I{Δλnrn12},

where Δλnr=sup||ϱn||>rΔλn(θ^λn+ϱnI01/2). According to lemma 3.2 in [2], I{Δλnrn12}=OP(nM) for any positive decreasing r → 0. Note that the above inequality holds uniformly for any decreasing rn 0. Therefore, we can choose a positive decreasing sequence rn = o(n−1/3) with nrn such that (40) holds.

lemma 2.2

Choose rn = o(n−1/3) and nrn. Under the conditions of theorem 2, we have

||ϱn||rn|plλn(θ^λn+I012ϱn)plλn(θ^)ρ(θ^λn+I012ϱn)exp(n2ϱnTϱn)ρ(θ^λn)|×dϱn=OP(n(d1)/2λn2).
(41)

Proof

The posterior mass over the region ||[var rho]n||2rn is bounded by

||ϱn||2rn|plλn(θ^λn+I012ϱn)plλn(θ^λn)ρ(θ^λn)exp(n2ϱnTϱn)ρ(θ^λn)|dϱn()+||ϱn||2rn|plλn(θ^λn+I012ϱn)plλn(θ^λn)ρ(θ^λn+I012ϱn)plλn(θ^λn+I012ϱn)plλn(θ^λn)ρ(θ^λn)|dϱn.()

By (21), we obtain

()=||ϱn||2rn[ρ(θ^λn)exp(nϱnTϱn2)exp(OP(gλn(||ϱn||)))1]dϱn.

Obviously the order of (*) depends on that of |exp(OP (gλn (||[var rho]n||))) − 1| for λn satisfying (3) and ||[var rho]n|| ≤ rn. In order to analyze its order, we partition the set {λn = oP (n−1/4) and λn1=OP(nk/(2k+1))} with the set {λn = OP (n−1/3)}, i.e. Un={λn=oP(n1/4)andλn1=OP(nk/(2k+1))}{λn=OP(n1/3)} and Ln={λn=oP(n1/4)andλn1=OP(nk/(2k+1))}{λn=OP(n1/3)}C. For the set Un, we have | exp(OP (gλn (||[var rho]n||))) − 1| = gλn (||[var rho]n||) × OP (1). For the set Ln, we have OP(gλn(||ϱn||))=OP(n||ϱn||λn2+n1/2λn2). We can take rn=n1δλn2 for some δ > 0 such that nrn and rn = o(n−1/3). Then exp(OP(gλn(||ϱn||)))1=(n||ϱn||λn2+n1/2λn2)×OP(1). Combining with the above, we know that ()=OP(n(d1)/2λn2). By similar analysis, we can also show that (**) has the same order. This completes the proof of lemma 2.2.

We next start the formal proof of theorem 2. By considering both lemma 2.1 and lemma 2.2, we know the denominator of (39) equals

{||ϱn||2rn}Ξn[exp(n2ϱnTϱn)ρ(θ^λn)]dϱn+OP(n(d1)/2λn2).

The first term in the above display equals

nd/2ρ(θ^λn){||un||2nrn}nΞneunTun/2dun=nd/2ρ(θ^λn)RdeunTun/2dun+O(n(d1)/2λn2),

where un=nϱn. The above equality follows from the inequality that xey2/2dyx1ex2/2 for any x > 0. Consolidating the above analyses, we deduce that the denominator of (39) equals nd2ρ(θ^λn)(2π)d/2+OP(n(d1)/2λn2). The same analysis also applies to the numerator, thus completing the whole proof.

Proof of corollary 1

We only show (24) in what follows. (25) can be verified similarly. Showing (24) is equivalent to establishing Eθxλn(ϱn)=OP(λn2). Note that Eθxλn(ϱn) can be written as:

Eθxλn(ϱn)=ϱnΞnϱnρ(θ^λn+I012ϱn)plλn(θ^λn+I012ϱn)plλn(θ^λn)dϱnϱnΞnρ(θ^λn+I012ϱn)plλn(θ^λn+I012ϱn)plλn(θ^λn)dϱn.

By analysis similar to that applied in the proof of theorem 2, we know the denominator in the above display is nd/2(2π)d/2ρ(θ^λn)+OP(n(d1)/2λn2) and the numerator is a random vector of order OP(nd/2λn2). This yields the conclusion.

Proof of theorem 3

Note that (23) implies κnα=I01/2zα+OP(n1/2λn2), for any ξ < α < 1 − ξ, where ξ(0,12). Note also that the α-th quantile of a d dimensional standard normal distribution, zα, is not unique if d > 1. The classical Edgeworth expansion implies that P(n1/2i=1nI01/20(Xi)zα+an(α))=α, where an(α) = O(n−1/2), for ξ < α < 1 − ξ. Note that an(α) is uniquely determined for each fixed zα since [l with hat]0(Xi) has at least one absolutely continuous component. Let κ^nα=I01/2zα+(n(θ^λnθ0)n1/2i=1nI010(Xi))+I01/2an(α). Then P(n(θ^λnθ0)κ^nα)=α. Combining with (20), we obtain κ^nα=κnα+OP(n1/2λn2). The uniqueness of [kappa macron] up to order OP(n1/2λn2) follows from that of an(α) for each chosen zα.

Proof of lemma 1

We first present a technical lemma before the formal proof of lemma 1. In lemma 1.1 we define

K={θ,η(X)0(X)1+J(η):||θθ0||C1,||ηη0||C1,J(η)<},

for a known constant C1 < ∞. Combining with T5, we use condition (42) below to control the order of the increments of the empirical processes indexed by [ell]θ,η:

HB(ε,K,L2(P))ε1/k.
(42)

We next assume two smoothness conditions about the criterion function (θ, η) [mapsto] P[ell]θ,η, i.e.,

||θ,η0||2||θθ0||+dθ(η,η0),
(43)
P(θ,ηθ,η0)dθ2(η,η0)+||θθ0||2.
(44)

Here dθ2(η,η0) can be thought of as the square of a distance, but the following lemma is valid for arbitrary functions ηdθ2(η,η0). Finally, we assume a somewhat stronger assumption on the density, i.e.,

pθ,η/pθ,η0isboundedawayfromzeroandinfinity.
(45)

But (45) is trivial to satisfy in our first model.

Lemma 1.1

Assume conditions (42)–(45) in the above hold for every θ [set membership] Θn and η [set membership] An external file that holds a picture, illustration, etc.
Object name is nihms195772ig9.jpg. Then we have

dθn(η^θn,λn,η0)=OP(λn+||θnθ0||),λnJ(η^θn,λn)=OP(λn+||θnθ0||),

for ([theta w/ tilde]n, [eta w/ hat][theta w/ tilde]n,λn) satisfying P([theta w/ tilde]n [set membership] Θn, [eta w/ hat][theta w/ tilde]n,λn [set membership] An external file that holds a picture, illustration, etc.
Object name is nihms195772ig9.jpg) → 1.

Proof of lemma 1.1

The definition of [eta w/ hat][theta w/ tilde]n,λn implies that

λn2J2(η^θn,λn)λn2J2(η0)+(PnP)(θn,η^θn,λnθn,η0)+P(θn,η^θn,λnθn,η0)λn2J2(η0)+I+II.

Note that by T5 and assumption (42), we have

I(1+J(η^θn,λn))OP(n1/2)×{θn,η^θn,λn01+J(η^θn,λn)2112kn2k12(2k+1)}+(1+J(η0))OP(n1/2)×{θn,η001+J(η0)2112kn2k12(2k+1)}.

By assumption (44), we have

IIdθn2(η^θn,λn,η0)+||θnθ0||2.

Combining with the above, we can deduce that

d^n2+λn2J^n2(1+J^n)OP(n1/2)×{(d^n+||θnθ0||1+J^n)112kn2k12(2k+1)}+(1+J0)OP(n1/2)×{(||θnθ0||1+J0)112kn2k12(2k+1)}+λn2J02+||θnθ0||2,
(46)

where dn = d[theta w/ tilde]n ([eta w/ hat][theta w/ tilde]n,λn, η0), J(η0) = J0 and Ĵn = J([eta w/ hat][theta w/ tilde]n,λn). The above inequality follows from assumption (43). Combining all of the above inequalities, we can deduce that

un2=OP(1)+OP(1)un112k,
(47)

vn=vn1OP(||θnθ0||2)+un112kOP(λn)+OP(n12λn1||θnθ0||112k),
(48)

where un = (dn + ||[theta w/ tilde]n ~ θ0||)/(λn + λnĴn) and vn = λnĴn + λn. The equation (47) implies that un = OP (1). Inserting un = OP (1) into (48), we can know that vn = OP (λn + ||[theta w/ tilde]nθ0||, which implies un has the desired order. This completes the whole proof.

We now apply lemma 1.1 to derive the related convergence rates in the partly linear model. Conditions (43)–(45) can be verified easily in this example because [l with umlaut]θ,f has finite second moment, and pθ,f is bounded away from zero and infinity uniformly for (θ, f) ranging over the whole parameter space. Note that dθ(f, f0) = ||pθ,fp0||2 [greater, similar] ||qθ,fqθ0,f0||2 by Taylor expansion. Then by the assumption that PVar(U|V) is positive definite, we know that ||q[theta w/ tilde]n,f[theta w/ tilde]n,λnqθ0f0||2 = OP(λn + ||[theta w/ tilde]nθ0||) implies ||f[theta w/ tilde]n,λnf0||2 = OP(λn + ||[theta w/ tilde]nθ0||). Thus we only need to show that the ε-bracketing entropy number of the function class An external file that holds a picture, illustration, etc.
Object name is nihms195772ig10.jpg defined below is of order ε−1/k to complete the proof of (27)–(28):

O{θ,f(X)1+J(f):||θθ0||C1,||ff0||C1,J(f)<},

for some constant C1. Note that [ell]θ,f (X)/(1 + J(f)) can be rewritten as:

ΔA1logΦ(q¯θ,fA)+(1Δ)A1log(1Φ(q¯θ,fA)),
(49)

where A = 1 + J(f) and qθ,f [set membership] An external file that holds a picture, illustration, etc.
Object name is nihms195772ig11.jpg, where

O1{qθ,f(X)1+J(f):||θθ0||C1,||ff0||C1,J(f)<},

and where we know HB(ε, An external file that holds a picture, illustration, etc.
Object name is nihms195772ig11.jpg, L2(P)) [less, similar] ε−1/k by T1.

We next calculate the ε-bracketing entropy number with L2 norm for the class of functions R1 [equivalent] {ka(t): t [mapsto] a−1 log Φ(at) for a ≥ 1 and t [set membership] R}. By some analysis we know that ka(t) is strictly decreasing in a for t [set membership] R, and supt[set membership]R |ka(t) − kb(t)| [less, similar] |ab| because |[partial differential]/[partial differential]a(ka(t))| is bounded uniformly over t [set membership] R. In addition, we know that supa,bA0,tRka(t)kb(t)A01 because the function u [mapsto] u log Φ(u−1t) has bounded derivative for 0 < u ≤ 1 uniformly over t [set membership] R. The above two inequalities imply that the ε-bracketing number with uniform norm is of order O(ε−2) for a [set membership] [1, ε−1] and is 1 for a > ε−1. Thus we know HB(ε, R1, L2) = O(log ε−2). By applying a similar analysis to R2 [equivalent] {ka(t): t [mapsto] a−1 log(1 − Φ(at)) for a ≥ 1 and t [set membership] R}, we obtain that HB(ε, R2, L2) = O(log ε−2). Combining this with T6 and T7, we deduce that HB(ε, An external file that holds a picture, illustration, etc.
Object name is nihms195772ig10.jpg, L2) [less, similar] ε−1/k. This completes the proof of (27)–(28).

For the proof of (29), we apply arguments similar to those used in the proof of lemma 1.1 but after setting λn, J0 and Ĵn to zero in (46). Then we obtain the following equality: d^n2=OP(n2k/(2k+1))+||θnθ0||2+OP(n1/2)||θnθ0||11/2k+OP(n1/2)(||θnθ0||+d^n)11/2k. By treating ||[theta w/ tilde]nθ0|| ≤ nk/(2k+1) and ||[theta w/ tilde]nθ0|| > nk/(2k+1) differently in the above equality, we obtain (29).

Proof of lemma 2

Based on the discussions of (13) and (14), we need to verify the smoothness conditions and asymptotic equicontinuity conditions, i.e. (15)–(17), for the function [ell](t, θ, η) and its related derivatives. The first set of conditions are verified in lemma 5 of [3]. For the verifications of (15)–(17), we first show condition (17). Without loss of generality, we assume that λn is bounded below by a multiple of nk/(2k+1) and bounded above by n−1/4 in view of (3). Thus

P(.(θ0,θ0,f^θn,λn).0n14k+2(λn+||θnθ0||))2||f^θn,λnf0||22n12k+1(λn+||θnθ0||)2=OP(n12k+1),

where (27) implies the equality in the above expression.

By (28), we know that J(f[theta w/ tilde]n,λn) = OP(1 + ||[theta w/ tilde]nθ0||/λn) and ||f[theta w/ tilde]n,λn|| is bounded by some constant, since fHkM. We then define the set An external file that holds a picture, illustration, etc.
Object name is nihms195772ig12.jpg as follows:

{.(θ0,θ0,f).0n14k+2(λn+||θθ0||):J(f)Cn(1+||θθ0||λn),||f||M,||θθ0||δ}{gL2(P):Pg2Cnn12k+1},

for some δ > 0. Obviously the function n−1/(4k+2)([l with dot above](θ0, θ0, f[theta w/ tilde]n,λn) − [l with dot above])/(λn + ||[theta w/ tilde]nθ0||)) [set membership] An external file that holds a picture, illustration, etc.
Object name is nihms195772ig12.jpg on a set of probability arbitrarily close to one, as Cn → ∞. If we can show limn→∞ E*||An external file that holds a picture, illustration, etc.
Object name is nihms195772ig3.jpg||An external file that holds a picture, illustration, etc.
Object name is nihms195772ig12.jpg < ∞ by T2, then assumption (17) is verified. Note that [l with dot above](θ0, θ0, f) depends on f in a Lipschitz manner. Consequently we can bound HB(ε, An external file that holds a picture, illustration, etc.
Object name is nihms195772ig12.jpg, L2(P)) by the product of some constant and H(ε, Rn, L2(P)) in view of T3. Rn is defined as

{Hn(f):J(Hn(f))λn1n1/(4k+2),||Hn(f)||λn1n1/(4k+2)},

where Hn(f) = f/(n1/(4k+2)(λn + ||θθ0||)). By [22],

we know that

H(ε,Rn,L2(P))(λn1n1(4k+2))/ε)1/k.

Note that δn = n−1/(4k+2) and Mn = n(2k−1)/(4k+2) in T2. Thus by calculation we know that K(δn,Qn,L2(P))λn1/2kn1/(4k+2). Then by T2 we can show that limn→∞ E*||An external file that holds a picture, illustration, etc.
Object name is nihms195772ig3.jpg||An external file that holds a picture, illustration, etc.
Object name is nihms195772ig12.jpg < ∞.

For the proof of (15), we only need to show (15) holds for [theta w/ tilde]n = [theta w/ hat]n + o(n−1/3) based on the arguments in lemma 2.2. We then show that

Gn(¨(θ0,θn,f^θn,λn)¨0)=oP(1+n1/3||θnθ0||)=oP(1).

By the rate assumptions (27), we have

P(¨(θ0,θn,f^θn,λn)¨01+n1/3||θnθ0||)2||θnθ0||2+||f^θn,λnf0||22(1+n1/3||θnθ0||)2=OP(n1/2).

We next define An external file that holds a picture, illustration, etc.
Object name is nihms195772ig13.jpg as follows:

{¨(θ0,θ,f)¨01+n1/3||θθ0||:J(f)Cn(1+||θθ0||λn),||f||M,||θθ0||<δ}{gL2(P):Pg2Cnn12}.

Obviously the function ([l with umlaut](θ0, [theta w/ tilde]n, f[theta w/ tilde]n,λn) − [l with umlaut])/(1 + n1/3||[theta w/ tilde]nθ0||) [set membership] An external file that holds a picture, illustration, etc.
Object name is nihms195772ig13.jpg on a set of probability arbitrarily close to one, as Cn →∞. If we can show limn→∞ E*||An external file that holds a picture, illustration, etc.
Object name is nihms195772ig3.jpg||An external file that holds a picture, illustration, etc.
Object name is nihms195772ig13.jpg → 0 by T2, then the proof of (15) is completed. Accordingly, note that [l with umlaut](θ0, θ, f) depends on (θ, f) in a Lipschitz manner. Consequently we can bound HB(ε, An external file that holds a picture, illustration, etc.
Object name is nihms195772ig13.jpg, L2(P)) by the product of some constant and (H(ε, Rn, L2(P)) + log(1)) in view of T3. Rn is defined as

{Hn(f):J(Hn(f))1+(n1/3λn)1,||Hn(f)||1+(n1/3λn)1},

where Hn(f) = f/(1 + n1/3||θθ0||). By [22], we know that

H(ε,R¯n,L2(P))((1+n1/3λn1)/ε)1/k.

Then by analysis similar to that used in the proof of (17), we can show that limn→∞ E*||An external file that holds a picture, illustration, etc.
Object name is nihms195772ig3.jpg||An external file that holds a picture, illustration, etc.
Object name is nihms195772ig13.jpg → 0 in view of T2. This completes the proof of (15).

For the proof of (16), it suffices to show that An external file that holds a picture, illustration, etc.
Object name is nihms195772ig3.jpg([ell]t,θ(θ0, [theta w/ macron]n, f[theta w/ tilde]n,λn)− [ell]t,θ(θ0, θ0, f0)) = oP(1) for [theta w/ tilde]n = [theta w/ hat]n + o(n−1/3) and for [theta w/ macron]n between [theta w/ tilde]n and θ0, in view of lemma 2.2. Then we can show that An external file that holds a picture, illustration, etc.
Object name is nihms195772ig3.jpg([ell]t,θ(θ0, [theta w/ macron]n, f[theta w/ tilde]n,λn) − [ell]t,θ(θ0, θ0, f0)) = oP(1 + n1/3||[theta w/ tilde]nθ0||) = oP(1) by similar analysis as used in the proof of (15).

In the last part, we show (18). It suffices to verify that the sequence of classes of functions An external file that holds a picture, illustration, etc.
Object name is nihms195772ig9.jpg is P-Glivenko-Cantelli, where An external file that holds a picture, illustration, etc.
Object name is nihms195772ig9.jpg [equivalent] {[ell](3)([theta w/ macron]n, [theta w/ tilde]n, f[theta w/ tilde]n,λn)(x)}, for every random sequence [theta w/ macron]nθ0 and [theta w/ tilde]nθ0 in probability. A Glivenko-Cantelli theorem for classes of functions that change with n is needed. By revising theorem 2.4.3 in [22] with minor notational changes, we obtain the following suitable extension of the uniform entropy Glivenko-Cantelli theorem: Let Fn be suitably measurable classes of functions with uniformly integrable functions and H(ε,n,L1(Pn))=oP(n) for any ε > 0. Then ||PnP||Fn → 0 in probability for every ε > 0. We then apply this revised theorem to the set Fn of functions [ell](3)(t, θ, f) with t and θ ranging over a neighborhood of θ0 and λnJ(f) bounded by a constant. By the form of [ell](3)(t, θ, f), the entropy number for An external file that holds a picture, illustration, etc.
Object name is nihms195772ig9.jpg is equal to that of

n{φ(qt,ft(θ,f)(x))R(qt,ft(θ,f)(x)):(t,θ)Vθ0,λnJ(f)C,||f||M}.

By arguments similar to those used in lemma 7.2 of [15], we know that supQH(ε,n,L1(Q))(1+λn1/ε)1/k=oP(n). Moreover, the Fn are uniformly bounded since fHkM. Considering the fact that the probability that An external file that holds a picture, illustration, etc.
Object name is nihms195772ig9.jpg is contained in Fn tends to 1, we have completed the proof of (18).

Proof of lemma 3

By the assumption that Δλn([theta w/ tilde]n) = oP(1), we have Δλn([theta w/ tilde]n) − Δλn(θ0) ≥ oP(1). Thus the following inequality holds:

n1i=1nlog[lik(θn,f^θn,λn,Xi)lik(θ0,f^θ0,λn,Xi)]λn2[J2(f^θn,λn)J2(f^θ0,λn)]oP(1)

By considering assumption (19), the above inequality simplifies to

n1i=1nlog[H(θn,f^θn,λn;Xi)H(θ0,f^θ0,λn;Xi)]oP(1),

where H(θ, f; X) = ΔΦ(CθUf(V)) + (1 − Δ)(1 − Φ(CθUf(V))). By arguments similar to those used in lemma 2 and by T4, we know H([theta w/ tilde]n, f[theta w/ tilde]n,λn; Xi) belongs to some P-Donsker class. Combining the above conclusion and the inequality α log x ≤ log(1 + α{x − 1}) for some α [set membership] (0, 1) and any x > 0, we can show that

Plog[1+α(H(θn,f^θn,λn;Xi)H(θ0,f^θ0,λn;Xi)1)]oP(1).
(50)

The remainder of the proof follows the proof of lemma 6 in [3].

Proof of lemma 4

The boundedness condition (45) in Lemma 1.1 can not be satisfied in semiparametric logistic regression model. Hence we propose lemma 4.1 below to relax this condition by choosing the criterion function mθ,η = log[(pθ,η + pθ,η0)/2pθ,η0]. Obviously, mθ,η is trivially bounded away from zero. It is also bounded above for (θ, η) around their true values if pθ,η0(x) is bounded away from zero uniformly in x and pθ,η is bounded above. The first condition is satisfied if the map θ [mapsto] pθ,η0(x) is continuous around θ0 and p0(x) is uniformly bounded away from zero. The second condition is trivially satisfied in the semiparametric logistic regression model by the given form of the density. The boundedness of mθ,η thus permits the application of lemma 4.2 below which is used to verify condition (52) in the following lemma 4.1. Note that lemma 4.1 and lemma 4.2 are theorem 3.2 and lemma 3.3 in [15], respectively.

Lemma 4.1

Assume for any given θ [set membership] Θn, [eta w/ hat]θ satisfies Pnmθ,[eta w/ hat]θPnmθ,η0 for given measurable functions x [mapsto] mθ,η(x). Assume conditions (51) and (52) below hold for every θ [set membership] Θn, every η [set membership] An external file that holds a picture, illustration, etc.
Object name is nihms195772ig9.jpg and every ε > 0:

P(mθ,ηmθ,η0)dθ2(η,η0)+||θθ0||2,
(51)

EsupθΘn,ηVn,||θθ0||<ε,dθ(η,η0)<εGn(mθ,ηmθ,η0)φn(ε).
(52)

Suppose that (52) is valid for functions [var phi]n such that δ [mapsto] [var phi]n(δ)/δα is decreasing for some α < 2 and sets Θn × An external file that holds a picture, illustration, etc.
Object name is nihms195772ig9.jpg such that P([theta w/ tilde] [set membership] Θn, [eta w/ hat][theta w/ tilde] [set membership] An external file that holds a picture, illustration, etc.
Object name is nihms195772ig9.jpg) → 1. Then dθ(η^θ,η0)OP(δn+||θθ0||) for any sequence of positive numbers δn such that φn(δn)nδn2 for every n.

Lemma 4.2 below is presented to verify the modulus condition for the continuity of the empirical process in (52). Let An external file that holds a picture, illustration, etc.
Object name is nihms195772ig14.jpg = {x [mapsto] mθ,η(x) − mθ,η0(x): dθ(η, η0) < δ, ||θθ0|| < δ} and write

K(δ,Sδ,L2(P))=0δ1+HB(ε,Sδ,L2(P))dε.
(53)

Lemma 4.2

Suppose the functions (x, θ, η) [mapsto] mθ,η(x) are uniformly bounded for (θ, η) ranging over a neighborhood of (θ0, η0) and that

P(mθ,ηmθ0,η0)2dθ2(η,η0)+||θθ0||2.

Then condition (52) is satisfied for any functions [var phi]n such that

φn(δ)K(δ,Sδ,L2(P))(1+K(δ,Sδ,L2(P))δ2n)

Consequently, in the conclusion of the above theorem, we may use K(δ, An external file that holds a picture, illustration, etc.
Object name is nihms195772ig14.jpg, L2(P)) rather than [var phi]n(δ).

We then apply lemma 4.1 to the penalized semiparametric logistic regression model by including λ in θ, i.e. mθ,λ,η=mθ,η12λ2(J2(η)J2(η0)), in the proof of lemma 4. First, lemma 7.1 in [15] establishes that

pθn,η^θn,λnpθ0,η02+λnJ(η^θn,λn)=OP(λn+||θnθ0||)
(54)

after choosing

mθ,λ,η=logpθ,η+pθ,η02pθ,η012λ2(J2(η)J2(η0))

in lemma 4.1. Note that the map θ [mapsto] pθ,η0/fW,Z(w, z) is uniformly bounded away from zero at θ = θ0 and continuous around a neighborhood of θ0. Hence mθ,λ,η is well defined. Moreover, Pnmθ,λ,[eta w/ hat]θλPnmθ,λ,η0 by the inequality that ((pθ,η + pθ,η0)/2pθ,η0)2 ≥ (pθ,η/pθ,η0). (54) now directly implies (31). For the proof of (30), we need to consider the conclusion of lemma 7.4 (i), which states that

||pθ,ηpθ0,η0||2(||θθ0||1+||ηη01||2)1.
(55)

Thus we have proved (30). For (32), we just replace the mθ,λ,η with mθ,0,η in the proof of lemma 7.1 in [15]. Thus we can show that dθ(η, η0) = || pθ,ηpθ0,η0||2. By combining lemma 4.2 and (55), we know that ||[eta w/ hat][theta w/ tilde]nη0||2 = OP(δn + ||[theta w/ tilde]nθ0||), for δn satisfying K(δn,Sδn,L2(P))nδn2. Note that K(δ, An external file that holds a picture, illustration, etc.
Object name is nihms195772ig14.jpg L2(P)) is as defined in (53). By similar analysis as used in the proof of lemma 7.1 in [15] and the strengthened assumption on η, we then find that K(δn,Sδn,L2(P))δn11/2k, which leads to the desired convergence rate given in (32).

Proof of lemma 5

The proof of lemma 5 follows that of lemma 2. The smoothness conditions of [ell](t, θ, η) and its related derivatives can be shown similarly since F(·), F(·) and F(·) are all uniformly bounded in (−∞, +∞), and h0(·) is intrinsically bounded over [0, 1]. Note that we can show (12) directly by the following analysis. P[l with dot above](θ0, θ0, η) can be written as P(F(θ0w+η0) − F(θ0w+ η(z)))(wh0(z)) since P[l with dot above]0 = 0. Note that P(wh0(z))F(θ0w+η0(z))(ηη0)(z) = 0. This implies that P[l with dot above](θ0, θ0, η) = P(F(θ0w+η0) − F(θ0w+η(z))+F(θ0w+η0(z))(ηη0)(z))(wh0(z)). However, by the common Taylor expansion, we have |F(θ0w + η) − F(θ0w + η0) − F(θ0w + η0)(ηη0)| ≤ ||F|||ηη0|2. This proves (12).

We next verify the asymptotic equicontinuity conditions, i.e. (15)–(17). For (17), we first apply analysis similar to that used in the proof of lemma 2 to obtain

P(.(θ0,θ0,η^θn,λn).0n14k+2(λn+||θnθ0||))2OP(n12k+1).

By lemma 7.1 in [15], we know that J([eta w/ hat][theta w/ tilde]n, λn) = OP (1 + ||[theta w/ tilde]nθ0||/λn) and ||[eta w/ hat][theta w/ tilde] n, λn|| is bounded in probability by a multiple of J([eta w/ hat][theta w/ tilde]n, λn) + 1. Now we construct the set An external file that holds a picture, illustration, etc.
Object name is nihms195772ig13.jpg as follows:

{.(θ0,θ0,η).0n14k+2(λn+||θθ0||):J(η)Cn(1+||θθ0||λn),||η||Cn(1+J(η)),||θθ0||<δ}{gL2(P):Pg2Cnn12k+1}.

Clearly, the probability that the function n−1/(4k+2)([l with dot above](θ0, θ0, [eta w/ hat][theta w/ tilde]n, λn) − [l with dot above]0)/(λn + ||[theta w/ tilde]nθ0||)) [set membership] An external file that holds a picture, illustration, etc.
Object name is nihms195772ig13.jpg approaches 1 as Cn → ∞. We next show that limn → ∞E*||An external file that holds a picture, illustration, etc.
Object name is nihms195772ig3.jpg||An external file that holds a picture, illustration, etc.
Object name is nihms195772ig13.jpg < ∞ by T2. Note that [l with dot above](θ0, θ0, η) depends on η in a Lipschitz manner. Consequently, we can bound HB(ε, An external file that holds a picture, illustration, etc.
Object name is nihms195772ig13.jpg, L2(P)) by the product of some constant and H(ε, Rn, L2(P)) in view of T3, where Rn is as defined in the proof of lemma 2. By similar calculations as those performed in lemma 2, we can obtain K(δn,Qn,L2(P))λn1/2kn1/(4k+2). Thus limn→ ∞ E*||An external file that holds a picture, illustration, etc.
Object name is nihms195772ig3.jpg||An external file that holds a picture, illustration, etc.
Object name is nihms195772ig13.jpg < ∞, and (17) follows.

The proof of (15) and (16) follows arguments quite similar to those used in the proof of lemma 2. In other words, we can show that An external file that holds a picture, illustration, etc.
Object name is nihms195772ig3.jpg([l with umlaut](θ0, [theta w/ tilde]n, [eta w/ hat][theta w/ tilde]n, λn) − [l with umlaut]0) = oP (1 + n1/3||[theta w/ tilde]nθ0||) = oP (1) and An external file that holds a picture, illustration, etc.
Object name is nihms195772ig3.jpg([ell]t,θ(θ0, [theta w/ tilde]n, [eta w/ hat][theta w/ tilde]n, λn) − [ell]t,θ(θ0, θ0, η0)) = oP (1 + n1/3||[theta w/ tilde]nθ0||).

Next we define An external file that holds a picture, illustration, etc.
Object name is nihms195772ig15.jpg [equivalent] {[ell](3)([theta w/ macron]n, [theta w/ tilde]n, [eta w/ hat][theta w/ tilde]n, λn)(x)}. Similar arguments as those used in the proof of lemma 2 can be directly applied to the verification of (18) in this second model. By the form of [ell](3)(t, θ, η), the entropy number for An external file that holds a picture, illustration, etc.
Object name is nihms195772ig15.jpg is bounded above by that of Fn [equivalent] {F(tw + η(z) + (θt)h0(z)): (t, θ) [set membership] Vθ0, λnJ(η) ≤ Cn, ||η|| Cn(1 + J(η))}. Similarly, we know supQH(ε,V¯n,L1(Q))supQH(ε,¯n,L1(Q))((1+λn1)/ε)1/k=oP(n). Moreover, the Fn are uniformly bounded. This completes the proof for (18). This concludes the proof.

Proof of lemma 6

The proof of lemma 6 is analogous to that of lemma 3.

Lemma 7

Assuming the assumptions in theorem 1, we have

logplλn(θn)=logplλn(θ0)+n(θnθ0)TPn0n2(θnθ0)TI0(θnθ0)+OP(gλn(||θnθ^λn||)),
(56)

for any sequence [theta w/ tilde]n satisfying [theta w/ tilde]n = θ0 + oP (1).

Proof

n−1(log plλn([theta w/ tilde]n) − log plλ n0)) is bounded above and below by

Pn((θn,θn,η^θn,λn)(θ0,θn,η^θn,λn))λn2(J2(η^θn,λn)J2(ηθ0(θn,η^θn,λn)))

and

Pn((θn,θ0,η^θ0,λn)(θ0,θ0,η^θ0,λn))λn2(J2(ηθn(θ0,η^θ0,λn))J2(η^θ0,λn)),

respectively. By the third order Taylor expansion of [theta w/ tilde]n [mapsto] Pn[ell]([theta w/ tilde]n, θ, η) around θ0, for θ = [theta w/ tilde]n and η = [eta w/ hat][theta w/ tilde]n,λn, (18) and the above empirical no-bias conditions (13) and (14), we can find that the order of the difference between Pn([ell]([theta w/ tilde]n, [theta w/ tilde]n, [eta w/ hat][theta w/ tilde]n, λn) − [ell](θ0, [theta w/ tilde]n, [eta w/ hat][theta w/ tilde]n, λn)) and ([theta w/ tilde]nθ0)T Pn[l with hat]0 − ([theta w/ tilde]nθ0)T (Ĩ0/2)([theta w/ tilde]nθ0) is OP(n−1gλn(||[theta w/ tilde]n[theta w/ hat]λn||)). Similarly, we have

λn2(J2(η^θn,λn)J2(ηθ0(θn,η^θn,λn)))=2λn2(θnθ0)TZη^θn,λn(k)h0(k)dz+2λn2(θnθ0)TZh0(k)h0(k)Tdz(θnθ0)=OP(n1gλn(||θnθ^λn||))

by Taylor expansion. The last equation holds because of the assumptions (3) and (19). Similar analysis also applies to the lower bound. This proves (56).

References

1. Bentkus V, Gotze F, van Zwer WR. An Edgeworth Expansion for Symmetric Statistics. Annals of Statistics. 1997;25:851–896.
2. Cheng G, Kosorok MR. Higher order semiparametric frequentist inference with the profile sampler. Annals of Statistics. 2007 In Press.
3. Cheng G, Kosorok MR. General Frequentist Properties of the Posterior Profile Distribution. Annals of Statistics. 2007 In Press.
4. Dalalyan A, Golubev G, Tsybakov A. A Penalized Maximum Likelihood and Semipara-metric Second-Order Efficiency. Annals of Statistics. 2006;34:169–201.
5. Good IJ, Gaskins RA. Non-parametric roughness penalties for probability densities. Biometrika. 1971;58:255–277.
6. Huang J. Efficient estimation of the partly linear Cox model. Annals of Statistics. 1999;27:1536–1563.
7. Kosorok MR. Introduction to Empirical Processes and Semiparametric Inference. Springer; New York: 2008.
8. Kuo HH. Lecture Notes in Mathematics. Berlin: Springer; 1975. Gaussian Measure on Banach Spaces.
9. Lee BL, Kosorok MR, Fine JP. The profile sampler. Journal of the American Statistical Association. 2005;100:960–969.
10. Ma S, Kosorok MR. Penalized Log-likelihood Estimation for Partly Linear Transformation Models with Current Status Data. Annals of Statistics. 2005;33:2256–2290.
11. Ma S, Kosorok MR. Robust semiparametric M-estimation and the weighted bootstrap. Journal of Multivariate Analysis. 2005;96:190–217.
12. Ma S, Kosorok MR. Adaptive penalized M-estimation with current status data. Annals of the Institute of Statistical Mathematics. 2006;58:511–526.
13. Mammen E, van de Geer S. Penalized quasi-likelihood estimation in partial linear models. Annals of Statistics. 1997;25:1014–1035.
14. Murphy SA. Asymptotic Theory for the Frailty Model. Annals of Statistics. 1995;23:182–198.
15. Murphy SA, Van der Vaart AW. Observed information in semiparametric models. Bernoulli. 1999;5:381–412.
16. Murphy SA, Van der Vaart AW. Semiparametric mixtures in case-control studies. Journal of Multivariate Analysis. 2001;79:1–32.
17. Shen X. Asymptotic normality in semiparametric and nonparametric posterior distributions. Journal of the American Statistical Association. 2002;97:222–235.
18. Silverman BW. On the estimation of a probability density function by the maximum penalized likelihood method. Annals of Statistics. 1982;10:795–810.
19. Silverman BW. Some aspects of the spline smoothing approach to nonparametric regression curve fitting (with discussion) Journal of the Royal Statistical Society Series B. 1985;47:1–52.
20. van de Geer S. Empirical Processes in M-estimation. Cambridge University Press; Cambridge: 2000.
21. van der Vaart AW. Maximum Likelihood Estimation with Partially Censored Observations. Annals of Statistics. 1994;22:1896–1916.
22. van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes: With Applications to Statistics. Springer; New York: 1996.
23. Wahba G. Spline Models for Observational Data. SIAM; Philadelphia: 1998.
24. Xiang D, Wahba G. Approximate smoothin spline methods for large data sets in the binary case. ASA Proc of the Biometrics Section. :94–99.