J Multivar Anal. Author manuscript; available in PMC 2010 April 28.
Published in final edited form as:
J Multivar Anal. 2009 March 1; 100(3): 345–362.
PMCID: PMC2860882
NIHMSID: NIHMS195772

# The Penalized Profile Sampler

## Abstract

The penalized profile sampler for semiparametric inference is an extension of the profile sampler method [9] obtained by profiling a penalized log-likelihood. The idea is to base inference on the posterior distribution obtained by multiplying a profiled penalized log-likelihood by a prior for the parametric component, where the profiling and penalization are applied to the nuisance parameter. Because the prior is not applied to the full likelihood, the method is not strictly Bayesian. A benefit of this approximately Bayesian method is that it circumvents the need to put a prior on the possibly infinite-dimensional nuisance components of the model. We investigate the first and second order frequentist performance of the penalized profile sampler, and demonstrate that the accuracy of the procedure can be adjusted by the size of the assigned smoothing parameter. The theoretical validity of the procedure is illustrated for two examples: a partly linear model with normal error for current status data and a semiparametric logistic regression model. Simulation studies are used to verify the theoretical results.

Keywords: Penalized Likelihood, Posterior Distribution, Profile Likelihood, Semiparametric Inference, Smoothing Parameter

## 1 Introduction

Semiparametric models are statistical models indexed by both a finite dimensional parameter of interest θ and an infinite dimensional nuisance parameter η. In order to make statistical inference about θ separately from η, we estimate the nuisance parameter with θ, its maximum likelihood estimate at each fixed θ, i.e.

$η^θ=argmaxη∈Hlikn(θ,η),$

where likn(θ, η) is the likelihood of the semiparametric model given n observations and is the parameter space for η. Therefore we can do frequentist inference about θ based on the profile likelihood, which is typically defined as

$pln(θ)=supη∈Hlikn(θ,η).$

The convergence rate of the nuisance parameter η is the order of d(n, η0), where d(·, ·) is some metric on η, n is any sequence satisfying n = θ0 + oP(1), and (η0, θ0) is the true value of (η, θ). Typically,

$d(η^θ∼n,η0)=OP(||θ∼n−θθ||+n−r),$
(1)

where ||·|| is the Euclidean norm and r > 1/4. Of course, a smaller value of r leads to a slower convergence rate of the nuisance parameter. For instance, the nuisance parameter in the Cox proportional hazards model with right censored data, the cumulative hazard function, has the parametric rate, i.e., r = 1/2. If current status data is applied to the Cox model instead, then the convergence rate will be slower, with r = 1/3, due to the loss of information provided by this kind of data.

The profile sampler is the procedure of sampling from the posterior of the profile likelihood in order to estimate and draw inference on the parametric component θ in a semiparametric model, where the profiling is done over the possibly infinite-dimensional nuisance parameter η. [9] show that the profile sampler gives a first order correct approximation to the maximum likelihood estimator n and consistent estimation of the efficient Fisher information for θ even when the nuisance parameter is not estimable at the $n$ rate. Another Bayesian procedure employed to do semiparametric estimation is considered in [17] who study the marginal semiparametric posterior distribution for a parameter of interest. In particular, [17] show that marginal semiparametric posterior distributions are asymptotically normal and centered at the corresponding maximum likelihood estimates or posterior means, with covariance matrix equal to the inverse of the Fisher information. Unfortunately, this fully Bayesian method requires specification of a prior on η, which is quite challenging since for some models there is no direct extension of the concept of a Lebesgue dominating measure for the infinite-dimensional parameter set involved [8]. The advantages of the profile sampler for estimating θ compared to other methods is discussed extensively in [2], [3] and [9].

The motivation for studying second order asymptotic properties of the profile sampler comes from the observed simulation differences in the Cox model with different types of data, i.e. right censored data [2] and current status data [9]. The profile sampler generated based on the first model yields much more accurate estimation results comparing to the second model when the sample size is relatively small. [2] and [3] have successfully explored the theoretical reasons behind the above phenomena by establishing the relation between the estimation accuracy of the profile sampler, measured in terms of second order asymptotics, and the convergence rate of the nuisance parameters. Specifically speaking, the profile sampler generated from a semiparametric model with a faster convergence rate usually yields more precise frequentist inference of θ. These second order results are verified in [2] and [3] for several examples, including the proportional odds model, case-control studies with missing covariates, and the partly linear model. The convergence rates for these models range from the parametric to the cubic. The work in [3] has shown clearly that the accuracy of the inference for θ based on the profile sampler method is intrinsically determined by the semiparametric model specifications through its entropy number.

In many semiparametric models involving a smooth nuisance parameter, it is often convenient and beneficial to perform estimation using penalization. One motivation for this is that, in the absence of any restrictions on the form of the function η, maximum likelihood estimation for some semiparametric models leads to over-fitting. Seminal applications of penalized maximum likelihood estimation include estimation of a probability density function in [18] and nonparametric linear regression in [19]. Note that penalized likelihood is a special case of penalized quasi-likelihood studied in [13]. Under certain reasonable regularity conditions, penalized semiparametric log-likelihood estimation can yield fully efficient estimates for θ (see, for example, [13]). As far as we are aware, the only general procedure for inference for θ in this context known to be theoretically valid is a weighted bootstrap with bounded random weights (see [11]). It is even unclear whether the usual nonparametric bootstrap will work in this context when the nuisance parameter has a convergence rate r < 1/2.

The purpose of this paper is to ask the somewhat natural question: does sampling from the exponential of a profiled penalized log-likelihood (which process we refer hereafter to as the penalized profile sampler) yield first and even second order accurate frequentist inference? The conclusion of this paper is that the answer is yes and, moreover, the accuracy of the inference depends in a fairly simple way on the size of the smoothing parameter.

The unknown parameters in the semiparametric models we study in this paper include θ, which we assume belongs to some compact set Θ d, and η, which we assume to be a function in the Sobolev class of functions k or its subset $HkM≡Hk∩{η:||η||∞≤M}$ for some known M < ∞ supported on some compact set on the real line. The Sobolev class of functions k is defined as the set {η: J2(η) (η(k)(z))2dz < ∞}, where η(j) is the j-th derivative of η with respect to z. Obviously J2(η) is some measurement of complexity of η. We denote k as the Sobolev function class with degree k. The penalized log-likelihood in this context is:

$loglikλn(θ,η)=loglik(θ,η)−nλn2J2(η),$
(2)

where log lik(θ, η) nnθ,η(X), θ,η(X) is the log-likelihood of the single observation X, and λn is a smoothing parameter, possibly dependent on data. In practice, λn can be obtained by cross-validation [23] or by inspecting the various curves for different values of λn. The penalized maximum likelihood estimators n and n depend on the choice of the smoothing parameter λn. Consequently we use the notation λn and λn for the remainder of this paper to denote the estimators obtained from maximizing (2). In particular, a larger smoothing parameter usually leads to a less rough penalized estimator of η0. It is of interest to establish the asymptotic property of the proposed penalized profile sampler procedure with a data-driven λn. Further studies on this issue are needed, but it is beyond the scope of this paper.

For the purpose of establishing first order accuracy of inference for θ based on the penalized profile sampler, we assume that the bounds for the smoothing parameter are in the form below:

$λn=oP(n−1/4)andλn−1=OP(nk/(2k+1)).$
(3)

The condition (3) is assumed to hold throughout this paper. One way to ensure (3) in practice is simply to set λn = nk/(2k+1). Or we can just choose λn = n−1/3 which is independent of k. It turns out that the upper bound guarantees that λn is $n$-consistent, while the lower bound controls the penalized nuisance parameter estimator convergence rate. Another approach to controlling estimators is to use sieve estimates with assumptions on the derivatives (see [6]). We will not pursue this further here.

The log-profile penalized likelihood is defined as follows:

$logplλn(θ)=loglik(θ,η^θ,λn)−nλn2J2(η^θ,λn),$
(4)

where θ,λn is argmaxηk log likλn(θ, η) for fixed θ and λn. Note that J(n,0) ≥ J(n,λn), where ηθ,0 = θ argmaxη log lik(θ, η) for a fixed θ, based on the inequality that log likλn(n, n,0) ≤ log likλn(n, n, λn). Hence again we verify that the smoothing parameter λn plays a role in determining the complexity degree of the estimated nuisance parameter. The penalized profile sampler is just the procedure of sampling from the posterior distribution of plλn(θ) by assigning a prior on θ. By analyzing the corresponding MCMC chain from the frequentist’s point of view, our paper obtains the following conclusions:

1. Distribution Approximation: The posterior distribution with respect to plλn(θ) can be approximated by the normal distribution with mean the maximum penalized likelihood estimator of θ and variance the inverse of the efficient information matrix, with error $OP(n1/2λn2)$;
2. Moment Approximation: The maximum penalized likelihood estimator of θ can be approximated by the mean of the MCMC chain with error $OP(λn2)$. The efficient information matrix can be approximated by the inverse of the variance of the MCMC chain with error $OP(n1/2λn2)$;
3. Confidence Interval Approximation: An exact frequentist confidence interval of Wald’s type for θ can be estimated by the credible set obtained from the MCMC chain with error $OP(λn2)$.

Obviously, given any smoothing parameter satisfying the upper bound in (3), the penalized profile sampler can yield first order frequentist valid inference for θ, similar as to what was shown for the profile sampler in [9]. Moreover, the above conclusions are actually second order frequentist valid results, whose approximation accuracy is directly controlled by the smoothing parameter. Note that the corresponding results for the usual (non-penalized) profile sampler with nuisance parameter convergence rate r in [3] are obtained by replacing in the above $OP(n1/2λn2)$ with OP(n−1/2 nr+1/2) and $OP(λn2)$ with OP(n−1 nr), for all respective occurrences, where r is as defined in (1).

Our results are the first general higher order frequentist inference results for penalized semi-parametric estimation. We also note, however, that some results on second order efficiency of semiparametric estimators were derived in [4]. The layout of the article is as follows. The next section, section 2, introduces the two main examples we will be using for illustration: partly linear regression for current status data and semiparametric logistic regression. Some background is given in section 3, including the concept of a least favorable submodel as well as the main model assumptions. One preliminary theorem concerning about second order asymptotic expansions of the log-profile penalized likelihood is also presented in section 3. The main results and implications are discussed in section 4, and all remaining model assumptions are verified for the examples in section 5. A brief discussion of future work is given in section 6. We postpone all technical tools and proofs to the last section, section 7.

## 2 Examples

### 2.1 Partly Linear Normal Model with Current Status Data

In this example, we study the partly linear regression model with normal residue error. The continuous outcome Y, conditional on the covariates (U, V) d × , is modeled as

$Y=θTU+f(V)+ε,$
(5)

where f is an unknown smooth function, and ε ~ N(0, σ2) with finite variance σ2. For simplicity, we assume for the rest of the paper that σ = 1. The theory we propose also works when σ is unknown, but the added complexity would detract from the main issues. We also assume that only the current status of response Y is observed at a random censoring time C . In other words, we observe X = (C, Δ, U, V), where indicator Δ = 1{YC}. Current status data may occur due to study design or measurement limitations. Examples of such data arise in several fields, including demography, epidemiology and econometrics. For simplicity of exposition, θ is assumed to be one dimensional.

Under the model (5) and given that the joint distribution for (C, U, V) does not involve parameters (θ, f), the log-likelihood for a single observation at X = x (c, δ, u, v) is

$loglikθ,f(x)=δlog{Φ(c−θu−f(v))}+(1−δ)log{1−Φ(c−θu−f(v))},$
(6)

where Φ is the cdf of the standard normal distribution. The parameter of interest, θ, is assumed to belong to some compact set in 1. The nuisance parameter is the function f, which belongs to the Sobolev function class of degree k. We further make the following assumptions on this model. We assume that (Y, C) is independent given (U, V). The covariates (U, V) are assumed to belong to some compact set, and the support for random censoring time C is an interval [lc, uc], where −∞ < lc < uc < ∞. In addition, PVar(U|V) is strictly positive and Pf (V) = 0. The first order symptotic behaviors of the penalized log-likelihood estimates of a slightly more general version of this model have been extensively studied in [10].

### 2.2 Semiparametric Logistic Regression

Let X1 = (Y1, W1, Z1), X2 = (Y2, W2, Z2), … be independent copies of X = (Y, W, Z), where Y is a dichotomous variable with conditional expectation P(Y|W, Z) = F(θTW + η(Z)). F(u) is the logistic distribution defined as eu/(eu + 1). Obviously the likelihood for a single observation is of the following form:

$likθ,η(x)=F(θTw+η(z))y(1−F(θTw+η(z)))1−yf(W,Z)(w,z).$
(7)

This example is a special case of quasi-likelihood in partly linear models when the conditional variance of response Y is taken to have some quadratic form of the conditional mean of Y. In the absence of any restrictions on the form of the function η, the maximum likelihood of this simple model often leads to over-fitting. Hence [5] propose maximizing instead the penalized likelihood of the form $loglik(θ,η)−nλn2J2(η)$; and [13] showed the asymptotic consistency of the maximum penalized likelihood estimators for θ and η. For simplicity, we will restrict ourselves to the case where Θ 1 and (W, Z) have bounded support, say [0, 1]2. To ensure the identifiability of the parameters, we assume that PVar(W|Z) is positive and that the support of Z contains at least k distinct points in [0, 1], see lemma 7.1 in [15].

#### Remark 1

Another interesting potential example we may apply the penalized profile sampler method to is the classic proportional hazards model with current status data by penalizing the cumulative hazard function with its Sobolev norm. There are two motivations for us to penalize the cumulative hazard function in the Cox model. One is that the estimated step functions from the unpenalized estimation cannot be used easily for other estimation or inference purposes. Another issue with the unpenalized approach is that without making stronger continuity assumptions, we cannot achieve uniform consistency even on a compact set [10]. The asymptotic properties of the corresponding penalized M-estimators have been studied in [12].

## 3 Preliminaries

In this section, we present some necessary preliminary material concerning least favorable sub-models and assume some structural requirements to achieve second order asymptotic expansion of the log-profile penalized likelihood (21).

### 3.1 Least favorable submodels

In this subsection, we briefly review the concept of a least favorable submodel. A submodel t likt,ηt is defined to be least favorable at (θ, η) if θ,η = /t log likt,ηt, given t = θ, where θ,η is the efficient score function for θ. The efficient score function for θ can be viewed as the projection of the score function for θ onto the tangent space of η. The inverse of its variance is exactly the efficient information matrix Ĩθ,η. We abbreviate hereafter θ0η0 and Ĩθ0,η0 with 0 and Ĩ0, respectively. The “direction” along which ηt approaches η in the least favorable submodel is called the least favorable direction. An insightful review about least favorable submodels and efficient score functions can be found in Chapter 3 of [7]. We assume that in our setting a least favorable submodel always exists. By the above construction of the least favorable submodel, log plλn (θ) can be rewritten in the following form:

$logplλn(θ)=n(Pnℓ(θ,θ,η^θ,λn)−λn2J2(ηθ(θ,η^θ,λn))),$
(8)

where (t, θ, η)(x) = t,ηt(θ,η)(x), t ηt(θ, η) is a general map from the neighborhood of θ into the parameter set for η, with ηθ(θ, η) = η. The concrete forms of (8) will depend on the situation.

The derivatives of the function (t, θ, η) are with respect to its first argument, t. For the derivatives relative to the argument θ, we use the following shortened notation: θ(t, θ, η) indicates the first derivative of (t, θ, η) with respect to θ and t,θ(t, θ, η) denotes the derivative of (t, θ, η) relative to θ. Also, t,t(θ) and t,θ(η) indicate the maps θ (t, θ, η) and η t,θ(t, θ, η), respectively. For brevity, we denote 0 = (θ0, θ0, η0), 0 = (θ0, θ0, η0) and $ℓ0(3)=ℓ(3)(θ0,θ0,η0)$, where θ0 and η0 are the true values of θ and η. Of course, we can write 0(X) as 0(X) based on the construction of the least favorable submodel. All the necessary derivatives of (t, θ, η) w.r.t. t or θ in this paper are assumed to have integrable envelope functions in some neighborhood of (θ0, θ0, η0). In the following, we use Pθ,ηU to denote the expectation of a random variable U at the parameter (θ, η), and use PU to represent Pθ00U for simplicity.

### 3.2 Main Assumptions

The set of structural conditions about the least favorable submodel are the “no-bias” conditions:

$Pℓ.(θ0,θ∼n,η^θ∼n,λn)=OP(λn+||θ∼n−θ0||)2,$
(9)

$Pℓ¨(θ0,θ∼n,η^θ∼n,λn)=Pℓ¨0+OP(λn+||θ∼n−θ0||),$
(10)

for any sequence n satisfying n = θ0 + oP (1). The verifications of (9) and (10) depend on the smoothness of (t, θ, η) and the convergence rate of the penalized nuisance parameter based on the functional Taylor expansions around the true values. The convergence rate typically has the following upper bound:

$d(η^θ∼n,λn,η0)=OP(λn+||θ∼n−θ0||).$
(11)

The form of d(η, η0) may vary for different situations and does not need to be specified in this subsection beyond the given conditions. (11) implies that n,λn is consistent for η0 as nθ0 in probability. Hence (9) and (10) hold provided the Fréchet derivatives of the maps η (θ0, θ0, η) and η t,θ(θ0, θ0, η) are bounded, and

$Pℓ.(θ0,θ0,η)=O(d2(η,η0)),$
(12)

which is usually implied by a bounded Fréchet derivative of η (θ0, θ0, η) and second order Fréchet differentiability of the map η lik(θ0, η).

The empirical version of the no-bias conditions,

$Pnℓ.(θ0,θ∼n,η^θ∼n,λn)=Pnℓ∼0+OP(λn+||θ∼n−θ0||)2,$
(13)

$Pnℓ¨(θ0,θ∼n,η^θ∼n,λn)=Pℓ¨0+OP(λn+||θ∼n−θ0||),$
(14)

where n represents the empirical distribution of the observations, ensures that the penalized profile likelihood behaves like a penalized likelihood in the parametric model asymptotically and therefore yields a second order asymptotic expansion of the penalized profile log-likelihood. Obviously the empirical no-bias conditions are built upon (9) and (10) by assuming the sizes of the collections of the functions and are manageable. This condition is expressed in the language of empirical processes. Provided that 0 and t,θ(θ0, θ0, η0) are square integrable, (14) follows from (10) if we assume

$Gn(ℓ¨(θ0,θ∼n,η^θ∼n,λn)−ℓ¨0)=oP(1),$
(15)

where $Gn≡n(Pn−P)$ is used for the empirical processes of the observations. If we further assume that

$Gn(ℓt,θ(θ0,θ¯n,η^θ∼n,λn)−ℓt,θ(θ0,θ0,η0))=oP(1),$
(16)

$Gn(ℓ.(θ0,θ0,η^θ∼n,λn)−ℓ.0)=OP(n14k+2(λn+||θ∼n−θ0||)),$
(17)

for any sequence n satisfying n = θ0 + oP (1), then (13) follows. Note that the conditions (15)–(17) are concerned with the asymptotic equicontinuity of the empirical process measure of , t,θ and , respectively. Thus we will be able to use technical tools T2 and T5 given in the appendix to show (15)–(17). We next present the preliminary theorem about the second order asymptotic expansion of the log-profile penalized likelihood which prepares us for deriving the main results about the higher order structure of the penalized profile sampler in the next section.

#### Theorem 1

Let (13) and (14) be satisfied and suppose that

$(Pn−P)ℓ(3)(θ¯n,θ∼n,η^θ∼n,λn)=oP(1),$
(18)

$λnJ(η^θ∼n,λn)=OP(λn+||θ∼n−θ0||),$
(19)

for any sequence n and n satisfying n = θ0 + oP (1) and n = θ0 + oP (1). If θ0 is an interior point in Θ and λn is consistent, then we have

$n(θ^λn−θ0)=1n∑i=1nI∼0−1ℓ∼0(Xi)+OP(n1/2λn2),$
(20)

$logplλn(θ∼n)=logplλn(θ^λn)−n2(θ∼n−θ^λn)TI∼0(θ∼n−θ^λn)+OP(gλn(||θ∼n−θ^λn||)),$
(21)

where $gλn(w)=nw3+nw2λn+nwλn2+n1/2λn2$, provided the efficient information Ĩ0 is positive definite.

For the verification of (18), we need to make use of a Glivenko-Cantelli theorem for classes of functions that change with n which is a modification of theorem 2.4.3 in [22] and is explained in the appendix. Moreover, (19) implies that J(λn) = OP(1) if the λn is asymptotically normal, which has been shown in (20).

#### Remark 2

The results in theorem 1 are useful in their own right for inference about θ. (20) is a second higher order frequentist result in penalized semiparametric estimation regarding the asymptotic linearity of the maximum penalized likelihood estimator of θ.

## 4 Main Results and Implications

We now state the main results on the penalized posterior profile distribution. A preliminary result, theorem 2 with corollary 1 below, shows that the penalized posterior profile distribution is asymptotically close enough to the distribution of a normal random variable with mean λn and variance (0)−1 with second order accuracy, which is controlled by the smoothing parameter. Similar conclusions also hold for the penalized posterior moments. Another main result, theorem 3, shows that the penalized posterior profile log-likelihood can be used to achieve second order accurate frequentist inference for θ.

Let $P∼θ∣X∼λn$ be the penalized posterior profile distribution of θ with respect to the prior ρ(θ). Define

$Δλn(θ)=n−1{logplλn(θ)−logplλn(θ^λn)}.$

### Theorem 2

Let (20) and (21) be satisfied and suppose that

$Δλn(θ∼n)=oP(1)impliesθ∼n=θ0+oP(1),$
(22)

for every random {n} Θ. If proper prior ρ(θ0) > 0 and ρ(·) has continuous and finite first order derivative in some neighborhood of θ0, then we have,

$supξ∈Rd|P∼θ|X∼λn(nI∼01/2(θ−θ^λn)≤ξ)−Φd(ξ)∣=OP(n1/2λn2),$
(23)

where Φd(·) is the distribution of the d-dimensional standard normal random variable.

### Corollary 1

Under the assumptions of theorem 2, we have that if θ has finite second absolute moment, then

$θ^λn=Eθ∣X∼λn(θ)+OP(λn2),$
(24)

$I∼0=n−1(Varθ∣X∼λn(θ))−1+OP(n1/2λn2),$
(25)

where $Eθ∣X∼λn(θ)$ and $Varθ∣X∼λn(θ)$ are the penalized posterior profile mean and penalized posterior profile covariance matrix, respectively.

We now present another second order asymptotic frequentist property of the penalized profile sampler in terms of quantiles. The α-th quantile of the penalized posterior profile distribution, τ, is defined as $τnα=inf{ξ:P∼θ∣X∼λn(θ≤ξ)≥α}$, where the inf is taken componentwise. Without loss of generality, we can assume $P∼θ∣X∼λn(θ≤τnα)=α$ because of the assumed smoothness of both the prior and the likelihood in our setting. We can also define $κnα≡n(τnα−θ^λn)$, i.e., $P∼θ∣X∼λn(n(θ−θ^λn)≤κnα)=α$. Note that neither τ nor κ are unique if the dimension of θ is larger than one.

### Theorem 3

Under the assumptions of theorem 2 and assuming that 0(X) has finite third moment with a nondegenerate distribution, then there exists a based on the data such that $P(n(θ^λn−θ0)≤κ^nα)=α$ and $κ^nα−κnα=OP(n1/2λn2)$ for each choice of κ.

### Remark 3

Theorem 3 ensures that there exists a unique α-th quantile for θ up to $OP(λn2)$ in the frequentist set-up for each fixed τ. Note that τ is not unique if the dimension of θ is larger than one.

### Remark 4

Theorem 2, corollary 1 and theorem 3 above show that the penalized profile sampler generates second order asymptotic frequentist valid results in terms of distributions, moments and quantiles. Moreover, the second order accuracy of this procedure is controlled by the smoothing parameter.

### Remark 5

Another interpretation for the role of λn in the penalized profile sampler is that we can view λn as the prior on J(η), or on η to some extent. To see this, we can write likλn (θ, η) in the following form:

$likλn(θ,η)=likn(θ,η)×exp[−J2(η)2(12nλn2)]$

This idea can be traced back to [23]. In other words, the prior on J(η) is a normal distribution with mean zero and variance $(2λn2n)−1$. Hence it is natural to expect λn to have some effect on the convergence rate of η. Other possible priors on the functional parameter include Dirichlet and Gaussian processes which are more commonly used in nonparametric Bayesian methodology.

## 5 Examples (Continued)

We now illustrate verification of the assumptions in section 3.2 with the two examples that were introduced in section 2. Thus this section is a continuation of the earlier examples.

### 5.1 Partly Linear Normal Model with Current Status Data

In this section we verify the regularity conditions for the partly linear model with current status data as well as present a small simulation study to gain insight into the moderate sample size agreement with the asymptotic theory.

#### 5.1.1 Verification of conditions

We will concentrate on the estimation of the regression coefficient θ, considering the infinite dimensional parameter $f∈HkM$ as a nuisance parameter. The strengthened condition on η, together with the requirement that the density for the joint distribution (U, V, C) is strictly positive and finite, is necessary to verify the rate assumptions (27) and (28) in the below lemma 1. The score function of θ, θ,f, is given as follows:

$ℓ.θ,f(x)=uQ(x;θ,f),$

where

$Q(X;θ,f)=(1−Δ)φ(qθ,f(X))1−Φ(qθ,f(X))−Δφ(qθ,f(X))Φ(qθ,f(X)),$

qθ,f (x) = cθuf(v), and is the density of a standard normal random variable. The least favorable direction at the true parameter value is:

$h0(v)=E0(UQ2(X;θ,f)∣V=v)E0(Q2(X;θ,f)∣V=v),$

where E0 is the expectation relative to the true parameters. The derivation of θ,f and h0(·) is given in [3]. Thus, the least favorable submodel can be constructed as follows:

$ℓ(t,θ,f)=loglik(t,ft(θ,f)),$
(26)

where ft(θ, f) = f + (θt)h0. The concrete forms of (t, θ, η) and the related derivatives are given in [3] which considers a more rigid model with a known upper bound on the L2 norm of the kth derivative. The remaining assumptions are verified in the following three lemmas:

##### Lemma 1

Under the above set-up for the partly linear normal model with current status data, we then have for λn satisfying (3) and $θ∼n→pθ0$,

$||f^θ∼n,λn−f0||2=OP(λn+||θ∼n−θ0||),$
(27)

$λnJ(f^θ∼n,λn)=OP(λn+||θ∼n−θ0||),$
(28)

where ||·||2 represents the regular L2 norm. Moreover, if we also assume that f {g: ||g|| + J(g) ≤ } for some known , then

$||f^θ∼n−f0||2=OP(n−k/(2k+1)+||θ∼n−θ0||),$
(29)

provided condition (3) holds.

##### Remark 6

Lemma 1 implies that the convergence rate of the estimated nuisance parameter is slower than that of the regular nuisance parameter by comparing (27) and (29). This result is not surprising since the slower rate is the trade-off for the smoother nuisance parameter estimator. However, the advantage of the penalized profile sampler is that we can control the convergence rate by assigning the smoothing parameter with different rates. To obtain the convergence rate of the non-penalized estimated nuisance parameter, we would need to assume that the Sobolev norm of the nuisance parameter has some known upper bound. Thus we can argue that the penalized method enables a relaxation of the assumptions needed for the nuisance parameter. Lemma 1 also indicates that ||λn − f0||2 = OPn) and ||n − f0||2 = OP (n−k/(k+2)). Note that the convergence rate of the maximum penalized likelihood estimator, OPn), is deemed as the optimal rate in [23]. Similar remarks also hold for lemma 4 in semiparametric logistic regression model example below.

Lemma 1 and 4 imply that J(λn) = OP (1) and J(λn) = OP (1), respectively. Thus the maximum likelihood estimators of the nuisance parameters in the two examples of this paper are consistent in the uniform norm, i.e. ||λn − η0|| = oP (1) and ||λn − f0|| = oP (1), since the sequences λn and λn consist of smooth functions defined on a compact set with asymptotically bounded first-order derivatives.

##### Lemma 2

Under the above set-up for the partly linear normal model with current status data, assumptions (13), (14) and (18) are satisfied.

##### Lemma 3

Under the above set-up for the partly linear normal model with current status data, condition (22) is satisfied.

#### 5.1.2 Simulation study

In this subsection, we conducted simulations for the partly linear model with two different sizes of smoothing parameter, i.e. λn = n−1/3 and λn = n−2/5. Since we assume that $f∈H2M$ in the model, the above smoothing parameters satisfy (3). Our experience indicates that, in applications involving moderate sample sizes, specification of M is not needed and λn = n−1/3 (n−2/5) appears to work most of the time. Perhaps using cross validation to choose λn may improve the performance of the estimator in some settings, but evaluating this issue requires further study and is beyond the scope of the current paper. The contrast of the above two simulations agrees with our theoretical results that we can control the accuracy of inferences based on the penalized profile sampler by adjusting the related smoothing parameter.

We next discuss the computation of θ,λn in the simulations. For the special case of k = 2, we can use a cubic spline for estimating f given a fixed θ and λn. In practice, we take a computational sieve approach suggested by Xiang and Wahba [24], which states that an estimate with the number of basis functions growing at least at the rate O(n1/5) can achieve the same asymptotic precision as the full space, see section 8.2 in [10] for details.

In the following, the simulations are run for various sample sizes under a Lebesgue prior. For each sample size, 200 datasets were analyzed. The regression coefficient is θ = 1 and f(v) = sin(πv). We generate U ~ Unif[0, 1], V ~ Unif[−1, 1] and C ~ Unif[0, 2]. For each dataset, Markov chains of length 20, 000 with a burn-in period of 5, 000 were generated using the Metropolis algorithm. The jumping density for the coefficient was normal with current iteration and variance tuned to yield an acceptance rate of 20% – 40%. The approximate variance of the estimator of θ was computed by numerical differentiation with step size proportional to n−1/3 (n−2/5) for the model with smoothing parameter λn = n−1/3 (n−2/5) according to (21), see remark 1 in [3] for details.

Table 1 (2) in the below summarizes the simulation results for θ with smoothing parameter λn = n−1/3 (n−2/5) giving the average across 200 samples of the penalized maximum likelihood estimate (PMLE), mean of the penalized profile sampler (CM), estimated standard errors based on MCMC (SEM), estimated standard errors based on numerical derivatives (SEN), boundaries for the two-sided 95% confidence interval for θ generated by numerical differentiation and MCMC. LM (LN) and UM (UN) denote the lower and upper bound of the confidence interval from the MCMC chain (numerical derivative). According to the above theoretical results, the terms n2/3|PMLE − CM|(n4/5|PMLE − CM|), n1/6|SEM − SEN| (n3/10|SEM − SEN|), n2/3|LM − LN| (n4/5|LM − LN|) and n2/3|UM − UN| (n4/5|UM − UN|) in Table 1 (2) are bounded in probability. And the realizations of these terms summarized in Table 1 and and22 clearly illustrate their boundedness. Furthermore, we can conclude that the penalized profile sampler with respect to different sizes of smoothing parameter can yield statistical inference with different degree of accuracy.

Partly Linear Model with λn = n−1/30 = 1 and 200 samples)
Partly Linear Model with λn = n−2/50 = 1 and 200 samples)

### 5.2 Semiparametric Logistic Regression

In the semiparametric logistic regression model, we can obtain the score function for θ and η by similar analysis performed in the first example, i.e. θ,η(x) = (yF (θw + η(z)))w and Aθ,ηhθ,η(x) = (yF(θw + η(z)))hθ,η(z) for J(h) < ∞, where Aη,η and hθ,η are the score operator for η and least favorable direction at (θ, η), respectively. And the least favorable direction at the true parameter is given in [15]:

$h0(z)=P0[WF.(θ0W+η0(Z))∣Z=z]P0[F.(θ0W+η0(Z))∣Z=z],$

where (u) = F(u)(1 − F(u)). The above assumptions plus the requirement that J(h0) < ∞ ensures the identifiability of the parameters. Thus the least favorable submodel can be written as:

$ℓ(t,θ,η)=loglik(t,ηt(θ,η)),$

where ηt(θ, η) = η + (θt)h0. By differentiating (t, θ, η) with respect to t or θ, we obtain,

$ℓ.(t,θ,η)=(y−F(tw+η(z)+(θ−t)h0(z)))(w−h0(z)),ℓ¨(t,θ,η)=−F.(tw+η(z)+(θ−t)h0(z))(w−h0(z))2,ℓt,θ(t,θ,η)=−F.(tw+η(z)+(θ−t)h0(z))(w−h0(z))h0(z),ℓ(3)(t,θ,η)=−F¨(tw+η(z)+(θ−t)h0(z))(w−h0(z))3,ℓt,t,θ(t,θ,η)=−F¨(tw+η(z)+(θ−t)h0(z))(w−h0(z))2h0(z),ℓt,θ,θ(t,θ,η)=−F¨(tw+η(z)+(θ−t)h0(z))(w−h0(z))h02(z),$

where (·) is the second derivative of the function F(·). The rate assumptions will be shown in lemma 4. The remaining assumptions are verified in the last two lemmas:

#### Lemma 4

Under the above set-up for the semiparametric logistic regression model, we have for λn satisfying condition (3) and any $θ∼n→pθ0$ that

$||η^θ∼n,λn−η0||2=OP(λn+||θ∼n−θ0||),$
(30)
$λnJ(η^θ∼n,λn)=OP(λn+||θ∼n−θ0||).$
(31)

If we also assume that η {g: ||g|| + J(g) ≤ } for some known , then

$||η^θ∼n−η0||2=OP(n−k/(2k+1)+||θ∼n−θ0||),$
(32)

provided condition (3) holds.

#### Lemma 5

Under the above set-up for the semiparametric logistic regression model, assumptions (13), (14) and (18) are satisfied.

#### Lemma 6

Under the above set-up for the semiparametric logistic regression model, condition (22) is satisfied.

## 6 Future Work

Our paper evaluates the penalized profile sampler method from the frequentist view and discusses the effect of the smoothing parameter on estimation accuracy. One potential problem of interest is to sharpen the upper bound for the convergence rate of the approximation error in this paper, like the typical second-order asymptotic results in Edgeworth expansions, see, for example [1]. A formal study about the higher order comparisons between the profile sampler procedure and fully Bayesian procedure [17], which assigns priors to both the finite dimensional parameter and the infinite dimensional nuisance parameter, is also interesting. We expect that the involvement of a suitable prior on the infinite dimensional parameter would at least not decrease the estimation accuracy of the parameter of interest.

Another worthwhile avenue of research is to develop analogs of the profile sampler and penalized profile sampler to likelihood estimation under model misspecification and to general M-estimation. Some first order results for this setting in the case where the nuisance parameter may not be root-n consistent have been developed for a weighted bootstrap procedure in [11]. The studies about second order asymptotics under mild model misspecifications can provide theoretical insights into semiparametric model selection problems.

## Acknowledgments

The authors thank Dr. Joseph Kadane for several insightful discussions.

## 7 Appendix

We first state classical definitions for the covering number (entropy number) and bracketing number (bracketing entropy number) for a class of functions, and then present some technical tools about the entropy calculations and increments of empirical processes which will be employed in the proofs that follow. The notations and mean greater than, or smaller than, up to a universal constant.

#### Definition

Let be a subset of a (pseudo-) metric space (, d) of real-valued functions. The δ-covering number N(δ, , d) of is the smallest N for which there exist functions a1, …, aN in , such that for each a , d(a, aj) ≤ δ for some j {1, …, N}. The δ-bracketing number NB(δ, , d) is the smallest N for which there exist pairs of functions ${[ajL,ajU]}j=1N⊂L$, with $d(ajL,ajU)≤δ$, j = 1, …, N, such that for each a there is a j {1, …, N} such that $ajL≤a≤ajU$. The δ-entropy number (δ-bracketing entropy number) is defined as H(δ, , d) = log N (δ, , d) (HB(δ, , d) = log NB(δ, , d)).

T1. For each 0 < C < ∞ and δ > 0 we have

$HB(δ,{η:||η||∞≤C,J(η)≤C},||·||∞)≲(Cδ)1/k,$
(33)
$H(δ,{η:||η||∞≤C,J(η)≤C},||·||∞)≲(Cδ)1/k.$
(34)

T2. Let be a class of measurable functions such that P f2 < δ2 and ||f||M for every f in . Then

$EP∗||Gn||ℱ≲K(δ,ℱ,L2(P))(1+K(δ,ℱ,L2(P))δ2nM),$

where |||| = supf |Gnf| and $K(δ,ℱ,||·||)=∫0δ1+HB(ε,ℱ,||·||)dε$.

T3. Let = {ft: t T} be a class of functions satisfying |fs(x) − ft(x)| ≤ d(s, t)F (x) for every s and t and some fixed function F. Then, for any norm ||·||,

$NB(2ɛ||F||,ℱ,||·||)≤N(ε,T,d).$

T4. Let be a class of measurable functions f: × on a product of a finite set and an arbitrary measurable space (, ). Let P be a probability measure on × and let be its marginal on . For every d , let d be the set of functions w f(d, w) as f ranges over . If every class d is PW-Donsker with supf |PW f(d, W )| < ∞ for every d, then is P-Donsker.

T5. Let be a uniformly bounded class of measurable functions such that for some measurable f0, supf ||ff0|| < ∞. Moreover, assume that HB(ε, ; L2(P)) ≤ K εα for some K < ∞ and α (0, 2) and for all ε > 0. Then

$supf∈ℱ[∣(Pn−P)(f−f0)∣||f−f0||21−α/2∨n(α−2)/[2(2+α)]]=OP(n−1/2).$

T6. For a probability measure P, let 1 be a class of measurable functions f1: , and let 2 denote a class of continuous nondecreasing functions f2: [0, 1]. Then,

$HB(ε,ℱ2(ℱ1),L2(P))≤2HB(ε/3,ℱ1,L2(P))+supQHB(ε/3,ℱ2,L2(Q)).$

T7. Let and be classes of measurable functions. Then for any probability measure Q and any 1 ≤ r ≤ ∞,

$HB(2ε,ℱ+G,Lr(Q))≤HB(ε,ℱ,Lr(Q))+HB(ε,G,Lr(Q)),$
(35)

and, provided and are bounded by 1 in terms of ||·||,

$HB(2ε,ℱ·G,Lr(Q))≤HB(ε,ℱ,Lr(Q))+HB(ε,G,Lr(Q)),$
(36)

where · {f × g: f and g }.

##### Remark 7

The proof of T1 is found in [22]. T1 implies that the Sobolev class of functions with known bounded Sobolev norm is P-Donsker. T2 and T3 are separately lemma 3.4.2 and theorem 2.7.11 in [22]. T4 is lemma 9.2 in [16]. T5 is a result presented on page 79 of [20] and is a special case of lemma 5.13 on the same page, the proof of which can be found in pages 79–80. T6 and T7 are separately lemma 15.2 and 9.24 in [7].

##### Proof of theorem 1

We first show (20), and then we need to state one lemma before proceeding to the proof of (21). For the proof of (20), note that

$0=Pnℓ.(θ^λn,θ^λn,η^λn)+2λn2∫Zη^λn(k)(z)h0(k)(z)dz.$

Combining the third order Taylor expansion of λn n(λn, θ, η) around θ0, where θ = λn, and η = λn, with conditions (13), (14) and (18), the first term in the right-hand-side of the above displayed equality equals n0 − Ĩ0(λnθ0) + OP (λn + ||λn −θ0||)2. By the inequality $2λn2∫Zη^λn(k)(z)h0(k)(z)dz≤λn2(J2(η^λn)+J2(h0))$ and assumption (19), the second term in the right-hand-side of the above equality is equal to OP (λn + ||λn − θ0||)2. Combining everything, we obtain the following:

$1n∑i=1nI∼0−1ℓ∼0(Xi)=n(θ^λn−θ0)+OP(n1/2(λn+||θ^λn−θ0||)2).$
(37)

The right-hand-side of (37) is of the order $OP(nλn2+nwn(1+wn+λn))$, where wn represents ||λnθ0||. However, its left-hand-side is trivially OP (1). Considering the fact that $nλn2=oP(1)$, we can deduce that λnθ0 = OP (n−1/2). Inserting this into the previous display completes the proof of (20).

We next prove (21). Note that λnθ0 = OP (n −1/2). Hence the order of the remainder terms in (13) and (14) become OP (λn + ||nλn||)2 and OP (λn + ||nλn||), respectively. Expression (56) in lemma 7 below implies that

$logplλn(θ^λn)=logplλn(θ0)+n(θ^λn−θ0)TPnℓ∼0−n2(θ^λn−θ0)TI∼0(θ^λn−θ0)+OP(n1/2λn2).$
(38)

The difference between (38) and (56) generates

$logplλn(θ∼n)=logplλn(θ^λn)+n(θ∼n−θ^λn)T(Pnℓ∼0−I∼0(θ^λn−θ0))−n2(θ∼n−θ^λn)TI∼0(θ∼n−θ^λn)+OP(gλn(||θ∼n−θ^λn||)).$

(21) is now immediately obtained after considering (20).

##### Proof of theorem 2

Suppose that Fλn (·) is the penalized posterior profile distribution of $nϱn$ with respect to the prior ρ(θ), where the vector n is defined as $I∼01/2(θ−θ^λn)$. The parameter set for n is Ξn. Fλn (·) can be expressed as:

$Fλn(ξ)=∫ϱn∈(−∞,n−1/2ξ]∩Ξnρ(θ^λn+I∼0−12ϱn)plλn(θ^λn+I∼0−12ϱn)plλn(θ^λn)dϱn∫ϱn∈Ξnρ(θ^λn+I∼0−12ϱn)plλn(θ^λn+I∼0−12ϱn)plλn(θ^λn)dϱn.$
(39)

Note that dn in the above is the short notation for dn1 × … × dnd. To prove theorem 2, we first partition the parameter set Ξn as {Ξn ∩ {||n|| > rn}} n ∩ {||n|| ≤ rn}}. By choosing the proper order of rn, we find the posterior mass in the first partition region is of arbitrarily small order, as verified in lemma 2.1 immediately below, and the mass inside the second partition region can be approximated by a stochastic polynomial in powers of n−1/2 with error of order dependent on the smoothing parameter, as verified in lemma 2.2 below. This basic technique applies to both the denominator and the numerator, yielding the quotient series, which gives the desired result.

##### lemma 2.1

Choose rn = o(n−1/3) and $nrn→∞$. Under the conditions of theorem 2, we have

$∫||ϱn||>rnρ(θ^λn+I∼0−12ϱn)plλn(θ^λn+I∼0−12ϱn)plλn(θ^λn)dϱn=OP(n−M),$
(40)

for any positive number M.

##### Proof

Fix r > 0. We then have

$∫||ϱn||>rρ(θ^λn+I∼0−12ϱn)plλn(θ^λn+I∼0−12ϱn)plλn(θ^λn)dϱn≲I{Δλnr<−n−12}exp(−n)∫Θρ(θ)dθ+I{Δλnr≥−n−12},$

where $Δλnr=sup||ϱn||>rΔλn(θ^λn+ϱnI∼0−1/2)$. According to lemma 3.2 in [2], $I{Δλnr≥−n−12}=OP(n−M)$ for any positive decreasing r → 0. Note that the above inequality holds uniformly for any decreasing rn 0. Therefore, we can choose a positive decreasing sequence rn = o(n−1/3) with $nrn→∞$ such that (40) holds.

##### lemma 2.2

Choose rn = o(n−1/3) and $nrn→∞$. Under the conditions of theorem 2, we have

$∫||ϱn||≤rn|plλn(θ^λn+I∼0−12ϱn)plλn(θ^)ρ(θ^λn+I∼0−12ϱn)−exp(−n2ϱnTϱn)ρ(θ^λn)|×dϱn=OP(n−(d−1)/2λn2).$
(41)

##### Proof

The posterior mass over the region ||n||2rn is bounded by

$∫||ϱn||2≤rn|plλn(θ^λn+I∼0−12ϱn)plλn(θ^λn)ρ(θ^λn)−exp(−n2ϱnTϱn)ρ(θ^λn)|dϱn(∗)+∫||ϱn||2≤rn|plλn(θ^λn+I∼0−12ϱn)plλn(θ^λn)ρ(θ^λn+I∼0−12ϱn)−plλn(θ^λn+I∼0−12ϱn)plλn(θ^λn)ρ(θ^λn)|dϱn.(∗∗)$

By (21), we obtain

$(∗)=∫||ϱn||2≤rn[ρ(θ^λn)exp(−nϱnTϱn2)∣exp(OP(gλn(||ϱn||)))−1∣]dϱn.$

Obviously the order of (*) depends on that of |exp(OP (gλn (||n||))) − 1| for λn satisfying (3) and ||n|| ≤ rn. In order to analyze its order, we partition the set {λn = oP (n−1/4) and $λn−1=OP(nk/(2k+1))$} with the set {λn = OP (n−1/3)}, i.e. $Un={λn=oP(n−1/4)andλn−1=OP(nk/(2k+1))}∩{λn=OP(n−1/3)}$ and $Ln={λn=oP(n−1/4)andλn−1=OP(nk/(2k+1))}∩{λn=OP(n−1/3)}C$. For the set Un, we have | exp(OP (gλn (||n||))) − 1| = gλn (||n||) × OP (1). For the set Ln, we have $OP(gλn(||ϱn||))=OP(n||ϱn||λn2+n1/2λn2)$. We can take $rn=n−1−δλn−2$ for some δ > 0 such that $nrn→∞$ and rn = o(n−1/3). Then $∣exp(OP(gλn(||ϱn||)))−1∣=(n||ϱn||λn2+n1/2λn2)×OP(1)$. Combining with the above, we know that $(∗)=OP(n−(d−1)/2λn2)$. By similar analysis, we can also show that (**) has the same order. This completes the proof of lemma 2.2.

We next start the formal proof of theorem 2. By considering both lemma 2.1 and lemma 2.2, we know the denominator of (39) equals

$∫{||ϱn||2≤rn}∩Ξn[exp(−n2ϱnTϱn)ρ(θ^λn)]dϱn+OP(n−(d−1)/2λn2).$

The first term in the above display equals

$n−d/2ρ(θ^λn)∫{||un||2≤nrn}∩nΞne−unTun/2dun=n−d/2ρ(θ^λn)∫Rde−unTun/2dun+O(n−(d−1)/2λn2),$

where $un=nϱn$. The above equality follows from the inequality that $∫x∞e−y2/2dy≤x−1e−x2/2$ for any x > 0. Consolidating the above analyses, we deduce that the denominator of (39) equals $n−d2ρ(θ^λn)(2π)d/2+OP(n−(d−1)/2λn2)$. The same analysis also applies to the numerator, thus completing the whole proof.

##### Proof of corollary 1

We only show (24) in what follows. (25) can be verified similarly. Showing (24) is equivalent to establishing $E∼θ∣xλn(ϱn)=OP(λn2)$. Note that $E∼θ∣xλn(ϱn)$ can be written as:

$E∼θ∣xλn(ϱn)=∫ϱn∈Ξnϱnρ(θ^λn+I∼0−12ϱn)plλn(θ^λn+I∼0−12ϱn)plλn(θ^λn)dϱn∫ϱn∈Ξnρ(θ^λn+I∼0−12ϱn)plλn(θ^λn+I∼0−12ϱn)plλn(θ^λn)dϱn.$

By analysis similar to that applied in the proof of theorem 2, we know the denominator in the above display is $n−d/2(2π)d/2ρ(θ^λn)+OP(n−(d−1)/2λn2)$ and the numerator is a random vector of order $OP(n−d/2λn2)$. This yields the conclusion.

##### Proof of theorem 3

Note that (23) implies $κnα=I∼0−1/2zα+OP(n1/2λn2)$, for any ξ < α < 1 − ξ, where $ξ∈(0,12)$. Note also that the α-th quantile of a d dimensional standard normal distribution, zα, is not unique if d > 1. The classical Edgeworth expansion implies that $P(n−1/2∑i=1nI∼0−1/2ℓ∼0(Xi)≤zα+an(α))=α$, where an(α) = O(n−1/2), for ξ < α < 1 − ξ. Note that an(α) is uniquely determined for each fixed zα since 0(Xi) has at least one absolutely continuous component. Let $κ^nα=I∼0−1/2zα+(n(θ^λn−θ0)−n−1/2∑i=1nI∼0−1ℓ∼0(Xi))+I∼0−1/2an(α)$. Then $P(n(θ^λn−θ0)≤κ^nα)=α$. Combining with (20), we obtain $κ^nα=κnα+OP(n1/2λn2)$. The uniqueness of up to order $OP(n1/2λn2)$ follows from that of an(α) for each chosen zα.

##### Proof of lemma 1

We first present a technical lemma before the formal proof of lemma 1. In lemma 1.1 we define

$K={ℓθ,η(X)−ℓ0(X)1+J(η):||θ−θ0||≤C1,||η−η0||∞≤C1,J(η)<∞},$

for a known constant C1 < ∞. Combining with T5, we use condition (42) below to control the order of the increments of the empirical processes indexed by θ,η:

$HB(ε,K,L2(P))≲ε−1/k.$
(42)

We next assume two smoothness conditions about the criterion function (θ, η) Pθ,η, i.e.,

$||ℓθ,η−ℓ0||2≲||θ−θ0||+dθ(η,η0),$
(43)
$P(ℓθ,η−ℓθ,η0)≲−dθ2(η,η0)+||θ−θ0||2.$
(44)

Here $dθ2(η,η0)$ can be thought of as the square of a distance, but the following lemma is valid for arbitrary functions $η↦dθ2(η,η0)$. Finally, we assume a somewhat stronger assumption on the density, i.e.,

$pθ,η/pθ,η0isboundedawayfromzeroandinfinity.$
(45)

But (45) is trivial to satisfy in our first model.

##### Lemma 1.1

Assume conditions (42)–(45) in the above hold for every θ Θn and η . Then we have

$dθ∼n(η^θ∼n,λn,η0)=OP(λn+||θ∼n−θ0||),λnJ(η^θ∼n,λn)=OP(λn+||θ∼n−θ0||),$

for (n, n,λn) satisfying P(n Θn, n,λn ) → 1.

##### Proof of lemma 1.1

The definition of n,λn implies that

$λn2J2(η^θ∼n,λn)≤λn2J2(η0)+(Pn−P)(ℓθ∼n,η^θ∼n,λn−ℓθ∼n,η0)+P(ℓθ∼n,η^θ∼n,λn−ℓθ∼n,η0)≤λn2J2(η0)+I+II.$

Note that by T5 and assumption (42), we have

$I≤(1+J(η^θ∼n,λn))OP(n−1/2)×{‖ℓθ∼n,η^θ∼n,λn−ℓ01+J(η^θ∼n,λn)‖21−12k∨n−2k−12(2k+1)}+(1+J(η0))OP(n−1/2)×{‖ℓθ∼n,η0−ℓ01+J(η0)‖21−12k∨n−2k−12(2k+1)}.$

By assumption (44), we have

$II≲−dθ∼n2(η^θ∼n,λn,η0)+||θ∼n−θ0||2.$

Combining with the above, we can deduce that

$d^n2+λn2J^n2≲(1+J^n)OP(n−1/2)×{(d^n+||θ∼n−θ0||1+J^n)1−12k∨n−2k−12(2k+1)}+(1+J0)OP(n−1/2)×{(||θ∼n−θ0||1+J0)1−12k∨n−2k−12(2k+1)}+λn2J02+||θ∼n−θ0||2,$
(46)

where n = dn (n,λn, η0), J(η0) = J0 and Ĵn = J(n,λn). The above inequality follows from assumption (43). Combining all of the above inequalities, we can deduce that

$un2=OP(1)+OP(1)un1−12k,$
(47)

$vn=vn−1OP(||θ∼n−θ0||2)+un1−12kOP(λn)+OP(n−12λn−1||θ∼n−θ0||1−12k),$
(48)

where un = (n + ||n ~ θ0||)/(λn + λnĴn) and vn = λnĴn + λn. The equation (47) implies that un = OP (1). Inserting un = OP (1) into (48), we can know that vn = OP (λn + ||nθ0||, which implies un has the desired order. This completes the whole proof.

We now apply lemma 1.1 to derive the related convergence rates in the partly linear model. Conditions (43)–(45) can be verified easily in this example because θ,f has finite second moment, and pθ,f is bounded away from zero and infinity uniformly for (θ, f) ranging over the whole parameter space. Note that dθ(f, f0) = ||pθ,fp0||2 ||qθ,fqθ0,f0||2 by Taylor expansion. Then by the assumption that PVar(U|V) is positive definite, we know that ||qn,n,λnqθ0f0||2 = OP(λn + ||nθ0||) implies ||n,λnf0||2 = OP(λn + ||nθ0||). Thus we only need to show that the ε-bracketing entropy number of the function class defined below is of order ε−1/k to complete the proof of (27)–(28):

$O≡{ℓθ,f(X)1+J(f):||θ−θ0||≤C1,||f−f0||∞≤C1,J(f)<∞},$

for some constant C1. Note that θ,f (X)/(1 + J(f)) can be rewritten as:

$ΔA−1logΦ(q¯θ,fA)+(1−Δ)A−1log(1−Φ(q¯θ,fA)),$
(49)

where A = 1 + J(f) and θ,f , where

$O1≡{qθ,f(X)1+J(f):||θ−θ0||≤C1,||f−f0||∞≤C1,J(f)<∞},$

and where we know HB(ε, , L2(P)) ε−1/k by T1.

We next calculate the ε-bracketing entropy number with L2 norm for the class of functions R1 {ka(t): t a−1 log Φ(at) for a ≥ 1 and t }. By some analysis we know that ka(t) is strictly decreasing in a for t , and supt |ka(t) − kb(t)| |ab| because |/a(ka(t))| is bounded uniformly over t . In addition, we know that $supa,b≥A0,t∈R∣ka(t)−kb(t)∣≲A0−1$ because the function u u log Φ(u−1t) has bounded derivative for 0 < u ≤ 1 uniformly over t . The above two inequalities imply that the ε-bracketing number with uniform norm is of order O(ε−2) for a [1, ε−1] and is 1 for a > ε−1. Thus we know HB(ε, R1, L2) = O(log ε−2). By applying a similar analysis to R2 {ka(t): t a−1 log(1 − Φ(at)) for a ≥ 1 and t }, we obtain that HB(ε, R2, L2) = O(log ε−2). Combining this with T6 and T7, we deduce that HB(ε, , L2) ε−1/k. This completes the proof of (27)–(28).

For the proof of (29), we apply arguments similar to those used in the proof of lemma 1.1 but after setting λn, J0 and Ĵn to zero in (46). Then we obtain the following equality: $d^n2=OP(n−2k/(2k+1))+||θ∼n−θ0||2+OP(n−1/2)||θ∼n−θ0||1−1/2k+OP(n−1/2)(||θ∼n−θ0||+d^n)1−1/2k$. By treating ||nθ0|| ≤ nk/(2k+1) and ||nθ0|| > nk/(2k+1) differently in the above equality, we obtain (29).

##### Proof of lemma 2

Based on the discussions of (13) and (14), we need to verify the smoothness conditions and asymptotic equicontinuity conditions, i.e. (15)–(17), for the function (t, θ, η) and its related derivatives. The first set of conditions are verified in lemma 5 of [3]. For the verifications of (15)–(17), we first show condition (17). Without loss of generality, we assume that λn is bounded below by a multiple of nk/(2k+1) and bounded above by n−1/4 in view of (3). Thus

$P(ℓ.(θ0,θ0,f^θ∼n,λn)−ℓ.0n14k+2(λn+||θ∼n−θ0||))2≲||f^θ∼n,λn−f0||22n12k+1(λn+||θ∼n−θ0||)2=OP(n−12k+1),$

where (27) implies the equality in the above expression.

By (28), we know that J(n,λn) = OP(1 + ||nθ0||/λn) and ||n,λn|| is bounded by some constant, since $f∈HkM$. We then define the set as follows:

${ℓ.(θ0,θ0,f)−ℓ.0n14k+2(λn+||θ−θ0||):J(f)≤Cn(1+||θ−θ0||λn),||f||∞≤M,||θ−θ0||≤δ}∩{g∈L2(P):Pg2≤Cnn−12k+1},$

for some δ > 0. Obviously the function n−1/(4k+2)((θ0, θ0, n,λn) − )/(λn + ||nθ0||)) on a set of probability arbitrarily close to one, as Cn → ∞. If we can show limn→∞ E*|||| < ∞ by T2, then assumption (17) is verified. Note that (θ0, θ0, f) depends on f in a Lipschitz manner. Consequently we can bound HB(ε, , L2(P)) by the product of some constant and H(ε, n, L2(P)) in view of T3. n is defined as

${Hn(f):J(Hn(f))≲λn−1n−1/(4k+2),||Hn(f)||∞≲λn−1n−1/(4k+2)},$

where Hn(f) = f/(n1/(4k+2)(λn + ||θθ0||)). By [22],

we know that

$H(ε,Rn,L2(P))≲(λn−1n−1(4k+2))/ε)1/k.$

Note that δn = n−1/(4k+2) and Mn = n(2k−1)/(4k+2) in T2. Thus by calculation we know that $K(δn,Qn,L2(P))≲λn−1/2kn−1/(4k+2)$. Then by T2 we can show that limn→∞ E*|||| < ∞.

For the proof of (15), we only need to show (15) holds for n = n + o(n−1/3) based on the arguments in lemma 2.2. We then show that

$Gn(ℓ¨(θ0,θ∼n,f^θ∼n,λn)−ℓ¨0)=oP(1+n1/3||θ∼n−θ0||)=oP(1).$

By the rate assumptions (27), we have

$P(ℓ¨(θ0,θ∼n,f^θ∼n,λn)−ℓ¨01+n1/3||θ∼n−θ0||)2≲||θ∼n−θ0||2+||f^θ∼n,λn−f0||22(1+n1/3||θ∼n−θ0||)2=OP(n−1/2).$

We next define as follows:

${ℓ¨(θ0,θ,f)−ℓ¨01+n1/3||θ−θ0||:J(f)≤Cn(1+||θ−θ0||λn),||f||∞≤M,||θ−θ0||<δ}∩{g∈L2(P):Pg2≤Cnn−12}.$

Obviously the function ((θ0, n, n,λn) − )/(1 + n1/3||nθ0||) on a set of probability arbitrarily close to one, as Cn →∞. If we can show limn→∞ E*|||| → 0 by T2, then the proof of (15) is completed. Accordingly, note that (θ0, θ, f) depends on (θ, f) in a Lipschitz manner. Consequently we can bound HB(ε, , L2(P)) by the product of some constant and (H(ε, n, L2(P)) + log(1)) in view of T3. n is defined as

${Hn(f):J(Hn(f))≲1+(n1/3λn)−1,||Hn(f)||∞≲1+(n1/3λn)−1},$

where Hn(f) = f/(1 + n1/3||θθ0||). By [22], we know that

$H(ε,R¯n,L2(P))≲((1+n−1/3λn−1)/ε)1/k.$

Then by analysis similar to that used in the proof of (17), we can show that limn→∞ E*|||| → 0 in view of T2. This completes the proof of (15).

For the proof of (16), it suffices to show that (t,θ(θ0, n, n,λn)− t,θ(θ0, θ0, f0)) = oP(1) for n = n + o(n−1/3) and for n between n and θ0, in view of lemma 2.2. Then we can show that (t,θ(θ0, n, n,λn) − t,θ(θ0, θ0, f0)) = oP(1 + n1/3||nθ0||) = oP(1) by similar analysis as used in the proof of (15).

In the last part, we show (18). It suffices to verify that the sequence of classes of functions is P-Glivenko-Cantelli, where {(3)(n, n, n,λn)(x)}, for every random sequence nθ0 and nθ0 in probability. A Glivenko-Cantelli theorem for classes of functions that change with n is needed. By revising theorem 2.4.3 in [22] with minor notational changes, we obtain the following suitable extension of the uniform entropy Glivenko-Cantelli theorem: Let n be suitably measurable classes of functions with uniformly integrable functions and $H(ε,ℱn,L1(Pn))=oP∗(n)$ for any ε > 0. Then ||nP||n → 0 in probability for every ε > 0. We then apply this revised theorem to the set n of functions (3)(t, θ, f) with t and θ ranging over a neighborhood of θ0 and λnJ(f) bounded by a constant. By the form of (3)(t, θ, f), the entropy number for is equal to that of

$ℱ∼n≡{φ(qt,ft(θ,f)(x))R(qt,ft(θ,f)(x)):(t,θ)∈Vθ0,λnJ(f)≤C,||f||∞≤M}.$

By arguments similar to those used in lemma 7.2 of [15], we know that $supQH(ε,ℱ∼n,L1(Q))≲(1+λn−1/ε)1/k=oP(n)$. Moreover, the n are uniformly bounded since $f∈HkM$. Considering the fact that the probability that is contained in n tends to 1, we have completed the proof of (18).

##### Proof of lemma 3

By the assumption that Δλn(n) = oP(1), we have Δλn(n) − Δλn(θ0) ≥ oP(1). Thus the following inequality holds:

$n−1∑i=1nlog[lik(θ∼n,f^θ∼n,λn,Xi)lik(θ0,f^θ0,λn,Xi)]−λn2[J2(f^θ∼n,λn)−J2(f^θ0,λn)]≥oP(1)$

By considering assumption (19), the above inequality simplifies to

$n−1∑i=1nlog[H(θ∼n,f^θ∼n,λn;Xi)H(θ0,f^θ0,λn;Xi)]≥oP(1),$

where H(θ, f; X) = ΔΦ(CθUf(V)) + (1 − Δ)(1 − Φ(CθUf(V))). By arguments similar to those used in lemma 2 and by T4, we know H(n, n,λn; Xi) belongs to some P-Donsker class. Combining the above conclusion and the inequality α log x ≤ log(1 + α{x − 1}) for some α (0, 1) and any x > 0, we can show that

$Plog[1+α(H(θ∼n,f^θ∼n,λn;Xi)H(θ0,f^θ0,λn;Xi)−1)]≥oP(1).$
(50)

The remainder of the proof follows the proof of lemma 6 in [3].

##### Proof of lemma 4

The boundedness condition (45) in Lemma 1.1 can not be satisfied in semiparametric logistic regression model. Hence we propose lemma 4.1 below to relax this condition by choosing the criterion function mθ,η = log[(pθ,η + pθ,η0)/2pθ,η0]. Obviously, mθ,η is trivially bounded away from zero. It is also bounded above for (θ, η) around their true values if pθ,η0(x) is bounded away from zero uniformly in x and pθ,η is bounded above. The first condition is satisfied if the map θ pθ,η0(x) is continuous around θ0 and p0(x) is uniformly bounded away from zero. The second condition is trivially satisfied in the semiparametric logistic regression model by the given form of the density. The boundedness of mθ,η thus permits the application of lemma 4.2 below which is used to verify condition (52) in the following lemma 4.1. Note that lemma 4.1 and lemma 4.2 are theorem 3.2 and lemma 3.3 in [15], respectively.

##### Lemma 4.1

Assume for any given θ Θn, θ satisfies nmθ,θnmθ,η0 for given measurable functions x mθ,η(x). Assume conditions (51) and (52) below hold for every θ Θn, every η and every ε > 0:

$P(mθ,η−mθ,η0)≲−dθ2(η,η0)+||θ−θ0||2,$
(51)

$E∗supθ∈Θn,η∈Vn,||θ−θ0||<ε,dθ(η,η0)<ε∣Gn(mθ,η−mθ,η0)∣≲φn(ε).$
(52)

Suppose that (52) is valid for functions n such that δ n(δ)/δα is decreasing for some α < 2 and sets Θn × such that P( Θn, ) → 1. Then $dθ∼(η^θ∼,η0)≤OP∗(δn+||θ∼−θ0||)$ for any sequence of positive numbers δn such that $φn(δn)≤nδn2$ for every n.

Lemma 4.2 below is presented to verify the modulus condition for the continuity of the empirical process in (52). Let = {x mθ,η(x) − mθ,η0(x): dθ(η, η0) < δ, ||θθ0|| < δ} and write

$K(δ,Sδ,L2(P))=∫0δ1+HB(ε,Sδ,L2(P))dε.$
(53)

##### Lemma 4.2

Suppose the functions (x, θ, η) mθ,η(x) are uniformly bounded for (θ, η) ranging over a neighborhood of (θ0, η0) and that

$P(mθ,η−mθ0,η0)2≲dθ2(η,η0)+||θ−θ0||2.$

Then condition (52) is satisfied for any functions n such that

$φn(δ)≥K(δ,Sδ,L2(P))(1+K(δ,Sδ,L2(P))δ2n)$

Consequently, in the conclusion of the above theorem, we may use K(δ, , L2(P)) rather than n(δ).

We then apply lemma 4.1 to the penalized semiparametric logistic regression model by including λ in θ, i.e. $mθ,λ,η=mθ,η−12λ2(J2(η)−J2(η0))$, in the proof of lemma 4. First, lemma 7.1 in [15] establishes that

$‖pθ∼n,η^θ∼n,λn−pθ0,η0‖2+λnJ(η^θ∼n,λn)=OP(λn+||θ∼n−θ0||)$
(54)

after choosing

$mθ,λ,η=logpθ,η+pθ,η02pθ,η0−12λ2(J2(η)−J2(η0))$

in lemma 4.1. Note that the map θ pθ,η0/fW,Z(w, z) is uniformly bounded away from zero at θ = θ0 and continuous around a neighborhood of θ0. Hence mθ,λ,η is well defined. Moreover, nmθ,λ,θλnmθ,λ,η0 by the inequality that ((pθ,η + pθ,η0)/2pθ,η0)2 ≥ (pθ,η/pθ,η0). (54) now directly implies (31). For the proof of (30), we need to consider the conclusion of lemma 7.4 (i), which states that

$||pθ,η−pθ0,η0||2≳(||θ−θ0||∧1+||∣η−η0∣∧1||2)∧1.$
(55)

Thus we have proved (30). For (32), we just replace the mθ,λ,η with mθ,0,η in the proof of lemma 7.1 in [15]. Thus we can show that dθ(η, η0) = || pθ,ηpθ0,η0||2. By combining lemma 4.2 and (55), we know that ||nη0||2 = OP(δn + ||nθ0||), for δn satisfying $K(δn,Sδn,L2(P))≤nδn2$. Note that K(δ, L2(P)) is as defined in (53). By similar analysis as used in the proof of lemma 7.1 in [15] and the strengthened assumption on η, we then find that $K(δn,Sδn,L2(P))≲δn1−1/2k$, which leads to the desired convergence rate given in (32).

##### Proof of lemma 5

The proof of lemma 5 follows that of lemma 2. The smoothness conditions of (t, θ, η) and its related derivatives can be shown similarly since F(·), (·) and (·) are all uniformly bounded in (−∞, +∞), and h0(·) is intrinsically bounded over [0, 1]. Note that we can show (12) directly by the following analysis. P(θ0, θ0, η) can be written as P(F(θ0w+η0) − F(θ0w+ η(z)))(wh0(z)) since P0 = 0. Note that P(wh0(z))(θ0w+η0(z))(ηη0)(z) = 0. This implies that P(θ0, θ0, η) = P(F(θ0w+η0) − F(θ0w+η(z))+(θ0w+η0(z))(ηη0)(z))(wh0(z)). However, by the common Taylor expansion, we have |F(θ0w + η) − F(θ0w + η0) − (θ0w + η0)(ηη0)| ≤ |||||ηη0|2. This proves (12).

We next verify the asymptotic equicontinuity conditions, i.e. (15)–(17). For (17), we first apply analysis similar to that used in the proof of lemma 2 to obtain

$P(ℓ.(θ0,θ0,η^θ∼n,λn)−ℓ.0n14k+2(λn+||θ∼n−θ0||))2≲OP(n−12k+1).$

By lemma 7.1 in [15], we know that J(n, λn) = OP (1 + ||nθ0||/λn) and || n, λn|| is bounded in probability by a multiple of J(n, λn) + 1. Now we construct the set as follows:

${ℓ.(θ0,θ0,η)−ℓ.0n14k+2(λn+||θ−θ0||):J(η)≤Cn(1+||θ−θ0||λn),||η||∞≤Cn(1+J(η)),||θ−θ0||<δ}∩{g∈L2(P):Pg2≤Cnn−12k+1}.$

Clearly, the probability that the function n−1/(4k+2)((θ0, θ0, n, λn) − 0)/(λn + ||nθ0||)) approaches 1 as Cn → ∞. We next show that limn → ∞E*|||| < ∞ by T2. Note that (θ0, θ0, η) depends on η in a Lipschitz manner. Consequently, we can bound HB(ε, , L2(P)) by the product of some constant and H(ε, n, L2(P)) in view of T3, where n is as defined in the proof of lemma 2. By similar calculations as those performed in lemma 2, we can obtain $K(δn,Q∼n,L2(P))≲λn−1/2kn−1/(4k+2)$. Thus limn→ ∞ E*|||| < ∞, and (17) follows.

The proof of (15) and (16) follows arguments quite similar to those used in the proof of lemma 2. In other words, we can show that ((θ0, n, n, λn) − 0) = oP (1 + n1/3||nθ0||) = oP (1) and (t,θ(θ0, n, n, λn) − t,θ(θ0, θ0, η0)) = oP (1 + n1/3||nθ0||).

Next we define {(3)(n, n, n, λn)(x)}. Similar arguments as those used in the proof of lemma 2 can be directly applied to the verification of (18) in this second model. By the form of (3)(t, θ, η), the entropy number for is bounded above by that of n {(tw + η(z) + (θt)h0(z)): (t, θ) Vθ0, λnJ(η) ≤ Cn, ||η|| Cn(1 + J(η))}. Similarly, we know $supQH(ε,V¯n,L1(Q))≤supQH(ε,ℱ¯n,L1(Q))≲((1+λn−1)/ε)1/k=oP(n)$. Moreover, the n are uniformly bounded. This completes the proof for (18). This concludes the proof.

##### Proof of lemma 6

The proof of lemma 6 is analogous to that of lemma 3.

##### Lemma 7

Assuming the assumptions in theorem 1, we have

$logplλn(θ∼n)=logplλn(θ0)+n(θ∼n−θ0)TPnℓ∼0−n2(θ∼n−θ0)TI∼0(θ∼n−θ0)+OP(gλn(||θ∼n−θ^λn||)),$
(56)

for any sequence n satisfying n = θ0 + oP (1).

##### Proof

n−1(log plλn(n) − log plλ n0)) is bounded above and below by

$Pn(ℓ(θ∼n,θ∼n,η^θ∼n,λn)−ℓ(θ0,θ∼n,η^θ∼n,λn))−λn2(J2(η^θ∼n,λn)−J2(ηθ0(θ∼n,η^θ∼n,λn)))$

and

$Pn(ℓ(θ∼n,θ0,η^θ0,λn)−ℓ(θ0,θ0,η^θ0,λn))−λn2(J2(ηθ∼n(θ0,η^θ0,λn))−J2(η^θ0,λn)),$

respectively. By the third order Taylor expansion of n n(n, θ, η) around θ0, for θ = n and η = n,λn, (18) and the above empirical no-bias conditions (13) and (14), we can find that the order of the difference between n((n, n, n, λn) − (θ0, n, n, λn)) and (nθ0)T n0 − (nθ0)T (Ĩ0/2)(nθ0) is OP(n−1gλn(||nλn||)). Similarly, we have

$λn2(J2(η^θ∼n,λn)−J2(ηθ0(θ∼n,η^θ∼n,λn)))=−2λn2(θ∼n−θ0)T∫Zη^θ∼n,λn(k)h0(k)dz+2λn2(θ∼n−θ0)T∫Zh0(k)h0(k)Tdz(θ∼n−θ0)=OP(n−1gλn(||θ∼n−θ^λn||))$

by Taylor expansion. The last equation holds because of the assumptions (3) and (19). Similar analysis also applies to the lower bound. This proves (56).

## References

1. Bentkus V, Gotze F, van Zwer WR. An Edgeworth Expansion for Symmetric Statistics. Annals of Statistics. 1997;25:851–896.
2. Cheng G, Kosorok MR. Higher order semiparametric frequentist inference with the profile sampler. Annals of Statistics. 2007 In Press.
3. Cheng G, Kosorok MR. General Frequentist Properties of the Posterior Profile Distribution. Annals of Statistics. 2007 In Press.
4. Dalalyan A, Golubev G, Tsybakov A. A Penalized Maximum Likelihood and Semipara-metric Second-Order Efficiency. Annals of Statistics. 2006;34:169–201.
5. Good IJ, Gaskins RA. Non-parametric roughness penalties for probability densities. Biometrika. 1971;58:255–277.
6. Huang J. Efficient estimation of the partly linear Cox model. Annals of Statistics. 1999;27:1536–1563.
7. Kosorok MR. Introduction to Empirical Processes and Semiparametric Inference. Springer; New York: 2008.
8. Kuo HH. Lecture Notes in Mathematics. Berlin: Springer; 1975. Gaussian Measure on Banach Spaces.
9. Lee BL, Kosorok MR, Fine JP. The profile sampler. Journal of the American Statistical Association. 2005;100:960–969.
10. Ma S, Kosorok MR. Penalized Log-likelihood Estimation for Partly Linear Transformation Models with Current Status Data. Annals of Statistics. 2005;33:2256–2290.
11. Ma S, Kosorok MR. Robust semiparametric M-estimation and the weighted bootstrap. Journal of Multivariate Analysis. 2005;96:190–217.
12. Ma S, Kosorok MR. Adaptive penalized M-estimation with current status data. Annals of the Institute of Statistical Mathematics. 2006;58:511–526.
13. Mammen E, van de Geer S. Penalized quasi-likelihood estimation in partial linear models. Annals of Statistics. 1997;25:1014–1035.
14. Murphy SA. Asymptotic Theory for the Frailty Model. Annals of Statistics. 1995;23:182–198.
15. Murphy SA, Van der Vaart AW. Observed information in semiparametric models. Bernoulli. 1999;5:381–412.
16. Murphy SA, Van der Vaart AW. Semiparametric mixtures in case-control studies. Journal of Multivariate Analysis. 2001;79:1–32.
17. Shen X. Asymptotic normality in semiparametric and nonparametric posterior distributions. Journal of the American Statistical Association. 2002;97:222–235.
18. Silverman BW. On the estimation of a probability density function by the maximum penalized likelihood method. Annals of Statistics. 1982;10:795–810.
19. Silverman BW. Some aspects of the spline smoothing approach to nonparametric regression curve fitting (with discussion) Journal of the Royal Statistical Society Series B. 1985;47:1–52.
20. van de Geer S. Empirical Processes in M-estimation. Cambridge University Press; Cambridge: 2000.
21. van der Vaart AW. Maximum Likelihood Estimation with Partially Censored Observations. Annals of Statistics. 1994;22:1896–1916.
22. van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes: With Applications to Statistics. Springer; New York: 1996.
23. Wahba G. Spline Models for Observational Data. SIAM; Philadelphia: 1998.
24. Xiang D, Wahba G. Approximate smoothin spline methods for large data sets in the binary case. ASA Proc of the Biometrics Section. :94–99.

 PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers.