Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Ann Stat. Author manuscript; available in PMC 2010 December 6.
Published in final edited form as:
Ann Stat. 2009 December; 37(6B): 4153–4183.
doi:  10.1214/09-AOS713
PMCID: PMC2997475

Local quasi-likelihood with a parametric guide*


Generalized linear models and quasi-likelihood method extend the ordinary regression models to accommodate more general conditional distributions of the response. Nonparametric methods need no explicit parametric specification and the resulting model is completely determined by the data themselves. However nonparametric estimation schemes generally have a slower convergence rate such as the local polynomial smoothing estimation of nonparametric generalized linear models studied in Fan, Heckman and Wand (1995). In this work, we propose two parametrically guided nonparametric estimation schemes by incorporating prior shape information on the link transformation of the response variable’s conditional mean in terms of the predictor variable. Asymptotic results and numerical simulations demonstrate the improvement of our new estimation schemes over the original nonparametric counterpart.

AMS 2000 subject classifications: Primary 62G08, secondary 62G20
Keywords: generalized linear model, local polynomial smoothing, parametric guide

1. Introduction

As an extension of the ordinary linear model, generalized linear model (GLM) broadens techniques of ordinary linear regression to accommodate more general conditional distributions of the response. It was first introduced by Nelder and Wedderburn (1972). Its estimation is based on the iteratively reweighted least squares (IRLS) algorithm, which only requires a relationship between the response’s conditional mean and variance instead of its full conditional distribution. This feature was noticed by Wedderburn (1974). In this important further extension, Wedderburn replaced the log-likelihood by a quasi-loglikelihood function. This is usually referred to as the quasi-likelihood method (QLM).

In generalized linear models (GLMs) (McCullagh and Nelder, 1989), a typical parametric assumption is that a transformation of the response’s conditional mean, referred to as the link function, belongs to some parametric family (say, linear or quadratic in the predictor variables). However misspecification of the parametric family can lead to a completely wrong picture of the underlying conditional mean function. This deficiency of parametric modeling has long been realized in ordinary regression and applies to GLMs as well. It calls for an extension of nonparametric regression techniques to the GLMs. Green and Yandell (1985), O’Sullivan et al. (1986), and Cox and O’Sullivan (1990) studied the extension of smoothing splines. Tibshirani and Hastie (1987) based their generalization on the “running lines” smoother. Fan, Heckman and Wand (1995) extended the local polynomial fitting technique and includes Staniswalis (1989) as a special case.

Local polynomial smoothing is a useful technique to explore unknown structure in regression and dates back to Stone (1975, 1977). This area was blossomed when Fan (1993) provided a deep theoretic understanding and discovered its elegant properties, especially the automatic boundary carpentry. Here we focus on local polynomial techniques although the idea can be extended to other nonparametric methods.

Nonparametric methods need no explicit specification of the form of the conditional mean for ordinary regression, and more generally the link transformation of the conditional mean in the context of GLMs. However they have in general a slower rate of convergence. In practice, prior knowledge or exploratory studies may provide us some prior information about the shape of the link function. This information is ready to guide us in the nonparametric modeling process. In the literature, parametrically guided nonparametric estimation methods were proposed to improve over its nonparametric counterpart in the context of density estimation (Hjort and Glad, 1995; Naito, 2004) and least squares regression (Glad, 1998; Martins-Filho et al., 2007). The idea is very easy to explain in the least squares regression case. Assume that the response Y given a covariate X has a conditional mean m(x) = E(Y|X = x). Once a parametric estimator m(x, [beta]) of m(x) is obtained, any nonparametric method can be applied on residuals {Yi/m(Xi,[beta]), i = 1,2, (...), n} and {Yi − m(Xi,[beta]), i = 1,2, (...), n} to estimate m(x)/m(x, [beta]) and m(x) − m(x,[beta]), respectively. The corresponding two final estimators are given by the product of m(x,[beta]) and the nonparametric estimator of m(x)/m(x,[beta]), which serves as a nonparametric correction of the parametric estimator m(x,[beta]), and the sum of m(x,[beta] ) and the nonparametric estimator of m(x) − m(x,[beta]), respectively Theoretically these two parametrically guided estimators are shown to achieve bias reduction comparing to the original nonparametric estimator when m(·) can be approximated by the family {m(·,β)}.

Due to its nice property of bias reduction, it is desirable to extend this parametrically guided estimation scheme to GLMs and QLM. However for response with a general distribution other than normal, the residuals Y/m(X,[beta]) and Y − m(Y,[beta]) do not have a nice statistical property to facilitate estimating m(x)/m(x, [beta]) and m(x) − m(x,[beta]) to make the straightforward extension possible. In this work, we take on this problem and extend the parametrically guided estimation scheme to GLMs and QLM. Asymptotic theory and numerical simulations are used to justify our proposed methods. In the literature, similar approaches have been used to reduce variance. Cheng et al. (2007) proposed to form a linear combination of a preliminary estimator to reduce variance in smoothing and Cheng and Hall (2003) studied variance reduction in nonparametric surface estimation.

The rest of the paper is organized as follows. Section 2 presents the fundamental framework of GLMs and QLM. Two new semiparametric estimation schemes based on additive and multiplicative correction are introduced in Section 3. Asymptotic properties are developed to show their improvement over the original nonparametric counterpart in Section 4. Section 5 gives a general pre-asymptotic bandwidth selector based on bias-variance tradeoff. Simulations in Section 6 and real data analysis in Section 7 show our new schemes’ finite sample performance in comparison to the original nonparametric method. We conclude with a short discussion in Section 8. All technical proofs are given in Appendix.

2. GLMs and quasi-likelihood models

Let (X1, Y1),(...),(Xn,Yn) be a set of i.i.d. random pairs where, for each i, Yi is a scalar response variable and Xi denotes its corresponding d-dimension explanatory covariates having density g with support supp(g) [subset, dbl equals] Rd. In GLMs, we assume that the response’s conditional distribution belongs to a one-parameter exponential family, which refers to the response’s conditional density in the format of


where b(·) and c(·, ·) are some known functions, ϕ is the dispersion parameter, and a(ϕ) typically can be written as ϕ/w with weight parameter w. The parameter θ is usually called the canonical or natural parameter. For the one-parameter exponential family (2.1), the conditional mean and variance of the response variable are given by μ(x) = b′(θ(x)) and var(Y|X = x) = a(ϕ)b″(θ(x)), respectively.

Parametric GLMs typically assume that a transformation of the conditional mean function μ(x) is linear in regression coefficients, i.e.,

η(x)=β0+xTβ and η(x)=g(μ(x)),

where g(·) is monotonic and referred to as the link function.

The canonical link refers to g = (b′)−1 in that b′(θ(x)) = μ(x). When the canonical link is used, the composition g [composite function (small circle)] b′(·) reduces to the identity function and θ(x) is the same as η(x). In this case, the conditional density can be simplified as


In common practice, there are many circumstances in which the full likelihood is unknown. However the relationship between the conditional mean and variance may be readily available. In this case, estimation of the conditional mean function can be achieved by replacing the conditional log-likelihood log fY|X(y|x) by a quasi-loglikelihood function Q(μ(x),y). When the conditional variance is modeled as var(Y|X = x) = V(μ(x)) for some known positive function V(·), the corresponding Q(μ, y) satisfies


where V(·) is called the variance function. More explicitly, Q(μ,y)=yμ(yw)/V(w)dw See Wedderburn (1974) and Chapter 9 of McCullagh and Nelder (1989) for more details. The quasi-score (2.2) possesses properties similar to those of the usual log-likelihood score function, i.e., it satisfies the first two moment conditions of Bartlett’s identities. Note that the loglikelihood of the one-parameter exponential family (2.1) is a special case of quasi-likelihood function with V(·) = a(ϕ)b* (b′)−1(·).

Due to the generality of the QLM and the fact that it includes the GLMs as a special case, we will focus on QLM. Fan, Heckman and Wand (1995) introduced nonparametric GLMs and QLM by extending the local polynomial techniques. We will follow their framework and notations. To ease our presentation, we focus on the one-dimension case as the extension to the multivariate case is straightforward. For the one-dimension case, our data consist of n pairs of observations (Xi,Yi)i=1n

To enhance flexibility, η(x) can be modeled nonparametrically. Its estimation can be achieved using the local polynomial technique. For any x0 in its domain, the local polynomial estimator of η(x0) is given by [eta w/ hat](x0) [equivalent] [eta w/ hat](x0; p, h) = [beta]0, where [beta] = ([beta]0, [beta]1, (...), [beta]p)T maximizes the locally weighted quasi-likelihood function


where, with slight abuse of notation, we define Xi = (1, Xi−x0, (...), (Xi−x0) p) T and β = (β0, β 1,(...), βp)T. Whenever there is no confusion, the extra arguments are dropped and Q(β) is used, similarly for some other notations. Here p is the order of local polynomial fitting and Kh(·) = K(· /h)/h is a re-scaling of the kernel function K(·) with a smoothing bandwidth h.

3. Nonparametric quasi-maximum likelihood with a parametric guide

As argued in the introduction, prior knowledge, physical model, or exploratory analysis may give us some useful information about the shape of η(x), which falls approximately into a parametric family {η(x, α) : α = (α1, α2,(...), αq)T [set membership] A; [subset or is implied by] Rq}. In this section, we present our two new estimation schemes by incorporating the available useful shape information of η(x) to guide us while estimating η(x). Within the parametric family η(x, α), we find the optimal fit by maximizing


with respect to α [set membership] A. Denote the best fit by η(x,[alpha]), where [alpha] is the maximizer of (3.1).

3.1. Bias reduction

In the local polynomial fitting framework, the bias is due to the approximation error of the Taylor expansion. The smaller approximation error the less bias in the local polynomial estimator. Recall that we identify some parametric family {η(x, α) : α [set membership] A} based on exploratory studies or prior knowledge and find the best fit η(x, [alpha]) within this family. As a result, η(x, [alpha]) should capture the major shape of η(x) and consequently η(x)/η(x, [alpha]) and η(x) − η(x, [alpha]) have less variation than the original η (x) does. Consequently they are easier to approximate and the approximation errors in their corresponding Taylor expansions are smaller than those of the original function η(x). For example, the true unknown η(·) is given by η0(x)=3sin(π4xπ2)+6 for x [set membership] [−2, 2] (as shown by the solid line in panel (A) of Figure 1) in our Poisson simulation Example 6.1. Nonparametric estimate [eta w/ hat](·) is given by the dotted line in Figure 2 and indicates a parabolic shape. Hence we identify a parametric family {η(x, α) = α1 + α2x + α3x2 : α = (α1, α2, α3)T [set membership] R3}, within which the best fit is given by the dotted line in panel (A) of Figure 1. The difference η (x) − η(x, [alpha]) and ratio η(x)/η (x, [alpha]) are shown in panels (B) and (C) of Figure 1, respectively. We can see that the difference and ratio functions are much flatter than the original function η(·) as we desired.

Plots of true η(·), estimated guide η(·, [alpha]), difference η(·) − η(·, [alpha]), and ratio η(·)/η(·, [alpha]) for ...
Plots of true η(·), nonparametric estimate [eta w/ hat](·), and two parametrically guided estimates [eta w/ hat]a(·) and [eta w/ hat]m(·) for one random sample in Example 6.1.

Based on the above argument, two different estimation schemes corresponding to multiplicative and additive corrections are introduced in Sections 3.2 and 3.3, respectively.

3.2. Multiplicative correction

Consider the multiplicative identity


where rm(x) = η(x)/η(x, α). When η(x, [alpha]) is a good estimate, the ratio rm(x) becomes almost flat and allows the choice of a larger bandwidth to reduce the bias. For any x0, we may estimate r(x0) by maximizing local quasi-likelihood


with respect to β and set [r with circumflex] (x0 = [beta]0, the first component of the maximizer. Then η(x0) can be estimated by η(x0, [alpha])[r with circumflex] (x0). This two-step formulation is equivalent to the following one, to which we shall prefer due to its facilitation to our theoretical development.

Locally approximating rm(·) by a polynomial function and re-scaling it by a fact η(x0, [alpha]), we have the local quasi-likelihood


We maximize (3.2) with respect to β and set the final estimator [eta w/ hat]m(x0) [equivalent] [eta w/ hat]m (x0; p, h, [alpha]) = [beta]0. In this formulation, the Taylor expansion XiTβ is supposed to approximate η(Xi)η(x0, [alpha])/η(Xi, [alpha]) locally at x = x0. This immediately justisifies setting [beta]0 as our estimator.

3.3. Additive correction

The other additive identity


with ra(x) = η(x) − η(x, α) leads to another parametrically guided nonparametric estimator [eta w/ hat]a(x0) [equivalent] [eta w/ hat]a(x0; p, h, [alpha]) = [beta]0, where [beta]0 is the first component of the maximizer of


with respect to β. Similarly the expansion XiTβ in this formulation targets at approximating η(Xi) − η(Xi, [alpha]) + η(x0, [alpha]) locally at x = x0. Hence [beta]0 estimates η(x0).

As argued in Section 3.1, η(x)η(x0, [alpha])/η(x, [alpha]) and η(x) − η(x, [alpha])+η(x0, [alpha]) are flatter and easier to approximate with a larger bandwidth. As a result our two new parametrically guided nonparametric estimation schemes lead to estimators with smaller bias. This is confirmed by our theoretical results in Section 4 and Monte Carlo studies carried out in Section 6.

4. Asymptotic properties

To proceed our theoretical justification for the bias reduction of our new methods, we assume that our data (Xi,Yi)i=1n are generated from the quasi-likelihood model with unknown true η0(x). Asymptotic properties of our final estimates are achieved in two steps: (1) establish asymptotic properties with a fixed parametric guide in Section 4.1 and (2) show that the same asymptotic properties apply to the case with an estimated parametric guide in Section 4.2.

The asymptotic properties for local polynomial estimator are different for x0 lying in the interior of supp(f) from for x0 lying near the boundary Suppose that K is supported on [−1, 1]. Then the support of Kh(x0 − ·) is εx0,h = {z : |z − x0| ≤ h}. We will call x0 an interior point of supp(f) if εx0,h [subset or is implied by] supp(f) and a boundary point otherwise. If supp(f) = [a, b], then x0 is a boundary point if and only if x0 = a + αh or x0 = b−αh for some 0 ≤ α < 1. Denote Dx0,h = {z : x0−hz [set membership] supp(f)}∩[−1, 1]. For any measurable set [mathematical script A] [set membership] R, define νl([mathematical script A]) = ∫[mathematical script A] zlK(z)dz. Let Np([mathematical script A]) be the (p +1) × (p +1) matrix having (i,j) entry equal to νi+j−2([mathematical script A]) and let Mr,p(z; [mathematical script A]) be the same as Np([mathematical script A]), but with the (r + 1)th column replaced by (1, z, (...), zp)T. Then for |Np([mathematical script A])| ≠ 0, define


When [−1, 1] [subset, dbl equals] [mathematical script A], we will suppress [mathematical script A] and simply write νl, Np, Mr,p, and Kr,p. It can be shown that (−1)r Kr,p is an order (r, s) kernel as defined by Gasser et al. (1985), where s = p + 1 if pr is odd and s = p + 2 if p − r is even. This family of kernels is useful for giving concise expressions for the asymptotic distribution of local polynomial estimator for x0 lying both in the interior of supp(f) and near its boundaries. Denote ρ(x0) = {g′(μ (x0))2V (μ(x0))}−1. Note that when the model belongs to a one-parameter exponential family and the canonical link is used, then g′ (μ (x0)) = 1/var(Y|X = x0) and ρ(x0) = var(Y|X = x0), if the variance function V(·) is correctly specified. The asymptotic variance of our local polynomial estimator depends on


4.1. Asymptotic properties with a fixed guide

Recall that our parametrically guided nonparametric estimators are achieved by maximizing Qm(β; h, x0, [alpha]) and Qa(β; h, x0, [alpha]) defined by (3.2) and (3.3) using the multiplicative and additive corrections, respectively. Note that the definitions of Qm(β; h, x0, [alpha]) and Qa(β; h, x0, [alpha]) involve [alpha] which corresponds to the best fit within the parametric family {η(x, α), α [set membership] A} and depends on our data {(Xi,Yi), i = 1,2, (...), n}. This dependency consequently makes it intractable to directly study the asymptotical properties of the maximizer of Qm(β; h, x0, [alpha]) and Qa (β; h, x0, [alpha]). To avoid the complication caused by the use of estimated [alpha], we first consider the case with a fixed guide η(x, α).

4.1.1. Asymptotic properties for multiplicative correction

For multiplicative correction with a fixed guide η(·, α), asymptotic properties of the corresponding estimator [eta w/ hat]m(x0; p, h, α) are given by Theorems 1 and 2.

Theorem 1. Let p > 0 be odd and suppose that Conditions (A1)–(A5) stated in Appendix are satisfied. Assume that h = hn → 0, nh2p+1 → ∞, and nh2p+3 < ∞ as n → ∞. If x0 is a fixed point in the interior of supp(f) satisfying η (x0, α) ≠ 0, then we have




and the subscripts m and o stand for multiplicative correction and odd p. If x 0 = xn is of the form x0 = xδ + ch satisfying η(x0, α) ≠ 0, where xδ is a point on the boundary of supp (f) and c [set membership] [−1, 1], then (4.2) holds with σ0,0,p2(x0;K) and ∫ zp+1K0,p(z)dz replaced by σ0,0,p2(x0;K,𝒟x0,h) andDx0,h zp+1 K0,p(z;Dx0,h)dz.

Theorem 2. Let p > 0 be even and suppose that Conditions (A1)–(A5) stated in Appendix are satisfied. Assume that h = hn 0, nh2p+1 → ∞, and nh2p+3 < ∞ as n → ∞. If x0 is a fixed point in the interior of supp(f) satisfying η (x0, α) ≠ 0, then we have




If x0 = xn is of the form x0 = xδ + ch satisfying η(x0, α) ≠ 0, where xδ is a point on the boundary of supp(f) and c [set membership] [−1, 1], then (4.3) holds with σ0,0,p2(x0;K) andzp+1 K0,p(z)dz replaced by σ0,0,p2(x0;K,Dx0,h) andDx0,h zp+1 K0,p(z;Dx0,h)dz.

Remark 1. Note that we use η(x0, α) in the denominator for the multiplicative correction. This poses difficulty handling any zero point of η(·, α), i.e., x0 satisfying η(x0, α) = 0. These zero points are ruled out in Theorems 1 and 2. Similar observation was made by Hjort and Glad (1995). However, this difficulty does not occur in our limited numerical experiments.

Remark 2. To simplify our presentation, we only state the asymptotic properties of the estimator for the original function η0(·). However our method can also estimate its high order derivatives, for which the asymptotic properties can be found in Propositions 1 and 2 in Appendix.

4.1.2. Asymptotic Properties for additive correction

Our next two theorems give the asymptotic properties of the estimator [eta w/ hat]a(x0; p, h, α) using a fixed guide η(·, α).

Theorem 3.. Let p > 0 be odd and suppose that Conditions (A1)–(A5) stated in Appendix are satisfied. Assume that h = hn 0, nh2p+1 → ∞, and nh2p+3 < ∞ as n → ∞. If x0 is a fixed point in the interior of supp(f), then




If x0 = xn is of the form x0 = xδ + ch, where xδ is a point on the boundary of supp(f) and c [set membership] [−1, 1], then (4.4) holds with σ0,0,p2(x0;K) andzp+1 K0,p(z)dz replaced by σ0,0,p2(x0;K,𝒟x0,h) andDx0,h zp+1 K0,p(z; Dx0,h)dz.

Theorem 4. Let p > 0 be even and suppose that Conditions (A1)–(A5) stated in Appendix are satisfied. Assume that h = hn → 0, nh2p+1 → ∞, and nh2p+3 < ∞ as n → ∞. If x0 is a fixed point in the interior of supp (f), then




If x0= xn is of the form x0 = xδ + ch, where xδ is a point on the boundary of supp(f) and c [set membership] [−1, 1], then (4.5) holds with σ0,0,p2(x0;K) andzp+1 K0,p(z)dz replaced by σ0,0,p2(x0;K,𝒟x0,h) andDx0,h zp+1 K0,p(z; Dx0,h)dz.

4.2. Asymptotic properties with an estimated guide

Note that our proposed estimation schemes use the best parametric fit η(x, [alpha]) estimated based on our data instead of a fixed guide η(x, α). Compared to the simpler case with a fixed guide, parameter estimation variability now influences the asymptotic result. However, we shall show below that asymptotically there is no precision loss caused by the additional estimation step.

Clearly the parametric family used in the first step of our estimation schemes is most likely an incorrect specification. Consequently, the first-stage parametric estimator is a maximum quasi-likelihood estimator with a misspecified model. As in Hurvich and Tsai (1995), denote the assumed parametric joint density of (Xi, Yi) and the corresponding true unknown joint density by h(x, y; α) = gX(x) exp(Q(g−1(η(x, α)), y)) and h0(x, y) = gX(x) exp(Q(g−10(x)), y)), respectively, where gX(·) is the marginal density of X. Denote α0 to be the pseudo parameter value that minimizes the Kullback-Liebler distance between h(x, y; α) and h0(x,y), i.e.,


where the expectation E is taken with respect to the unknown true density.

To proceed, we make regularity assumptions (B1)–(B5) given in Appendix to assure that the pseudo maximum quasi-likelihood estimator [alpha] is root-n consistent of α0, i.e., n(α^α0)=Op(1) (see White, 1982).

Theorem 5. Under additional Conditions (B1)–(B5), the asymptotic results, with α replaced by α0, of Theorems 1–4 continue to hold when an estimated fit η(x, [alpha]) is used.

5. Pre-asymptotic bandwidth selection

While optimizing (2.3), (3.2), and (3.3), we need to tune the corresponding smoothing bandwidths. In this work, we will use the pre-asymptotic bandwidth selection method introduced in Fan et al. (1998), which is based on the bias-variance tradeoff.

5.1. Estimating bias and variance

Without loss of generality, we use (3.3) to demonstrate the idea. It will include (2.3) as a special case by setting η(x, [alpha]) to be a constant function and can be analogously extended to handle (3.2). In the remaining of this section we denote β^=β^(x0,α^)=arg maxβQa(β;h,x0,α^). The bias of the estimate [beta] comes from the approximation error in the Taylor expansion. Denote by


the approximation error at Xi. Suppose that the (p + a + 1)th derivatives of functions η(·) and η(·, [alpha]) exist at x0 for some integer a > 0. Further expansions of η(Xi) and η(Xi, [alpha]) give


Here the choice of a, the approximation order, will affect the performance of the estimated bias. Practically it can be chosen as a = 1 or 2.

Now pretend that the approximated approximation errors ri are known. A more accurate local quasi-loglikelihood is


The maximizer of the local quasi-loglikelihood Qa*(β) is denoted by [beta]* = [beta]* (x0, [alpha]). Define


and similarly 2ββTQa*(β) to denote the gradient vector and Hessian matrix of the local quasi-likelihood Qa* respectively. Applying Taylor expansion to Qa*(β) around [beta](x0, [alpha]), we get


which implies the following approximation of the estimation bias


Next we try to access the variance of the estimate [beta]. To obtain variance, note that


where β0=(β00,β10,,βp0)T and βj0=(η(j)(x0)η(j)(x0,α^))/(j!) for j = 0,1, (...), p. This implies that


and an approximation for the conditional variance is given by


Here the Hessian matrix can be approximated by Qa(β^) and the variance term can be approximated as follows




Note that Xi has significant weight only in a neighborhood around x0 and, for such i, ξ i ≈ [(g−1)′(η(x0))]2/V (g−1(η(x0) Consequently we have




Combining the above results, we get


where the unknown η(x0) and β0 can be replaced by their estimates [eta w/ hat]a(x0) and [beta], respectively.

5.2. Bandwidth selection via bias-variance tradeoff

Based on the above arguments, we first select a pilot bandwidth ĥp+a+1,p+a* which can be chosen using the Extended Residual Squares Criterion (ERSC) (see Equation (5.6) of Fan et al., 1998). Next we fit a local polynomial with degree p + a + 1 and bandwidth ĥp+a+1,p+a* to get an estimate [beta](p+a) = ([beta]0, [beta]1, (...), [beta]p+a)T via maximizing quasi-loglikelihood function (3.3). Using [beta](p+a), we get the approximation error ri and hence the estimated bias [B with circumflex]p, 0 (x; h) and variance Vp,0(x; h) of [beta]0 the first element of [beta](x0, [alpha]). An estimator of the mean squared error(MSE) of [beta]0 is given by


which leads to our final bandwidth selector


6. Monte Carlo Study

In this section, we use simulations to illustrate the improvement of our new proposed estimators, by comparing them with the original nonparametric method. For simulations in this section and real data analysis in next section, we use the corresponding canonical link and local linear fitting by setting p = 1. To access the bias term in the pre-asymptotic bandwidth selection as discussed in Section 5, we choose the order of the approximation to Taylor expansion error to be a = 2. In simulation studies, we first generate ten independent data sets, on which the pre-asymptotic bandwidth selector based on a grid search is applied. We set our final selected bandwidth to be the median of the obtained ten bandwidths and it is fixed and used in our simulation. Different methods with their corresponding final selected bandwidths are applied to another 1000 independent data sets. Mean and standard deviation (in parenthesis) of root average squared error (RASE) over these 1000 independent samples are reported. Epanechnikov kernel is used in all of our numerical examples.

Example 6.1 Poisson

Each observation pair (X, Y) in this example is generated in two steps: (1) the predictor variable X is marginally uniformly distributed over [−2, 2]; (2) given X = x, the response Y is generated from Poisson distribution with mean exp(η0(x)), where η0(x)=3sin(π4xπ2)+6. Each sample consists of 100 i.i.d.pairs of observations. We estimate η0(·) over 100 uniform grid points {xj}j=1100 on the interval [−2, 2]. For an estimate [eta w/ hat](·), we define RASE as Σi=1100(η^(xi)η0(xi))2/100. The RASEs for different methods and different parametric guides are reported in Table 1.

Result on RASE × 100 of Example 6.1

Example 6.2 Bernoulli

In this example, we consider Bernoulli distribution. The predictor variable X is generated from Uniform[−1, 1]. Conditioning on X = x, the response Y is generated from Bernoulli distribution with success probability exp(η0(x))/(1 + exp(η0(x))), where η0(x) = 2sin(πx). In this case, we consider samples of size 500 due to two reasons: (1) the estimation of Bernoulli success probability is harder than the case of Poisson, (2) the use of a full sinusoid true η0(x) makes it even harder. Function η0(·) is estimated over a uniform grid with 100 points over [−1, 1]. The RASEs for different methods and different parametric guides are reported in Table 2.

Result on RASE × 100 of Example 6.2

In Tables 1 and and2,2, the first column summarizes the result of the original nonparametric method; the second and third columns correspond to our parametrically guided nonparametric method with additive and multiplicative corrections, respectively. For our parametrically guided methods, each row corresponds to the parametric guide specified in the fourth column with the corresponding parametric RASE given in the last column.

Our new proposed parametrically guided nonparametric estimation schemes give smaller RASEs comparing to the original nonparametric method, especially when the parametric family is correctly specified. The only exception is the case of linear guide α1 + α2x in Example 6.2. We can resort to our theoretical results to understand this exception. As we use local linear fitting, theoretically asymptotic bias depends on the second order derivative of η0(·) − η(·, α0) + η(x0, α0) and η0(·)η(x0, α0)/η(·, α0) for additive and multiplicative corrections, respectively. A linear guide cannot reduce the second order derivative of η0(·)−η(·, α0) + η(x0, α0) and consequently does not reduce bias. However, a linear guide slightly reduces the second order derivative of η0(·)η(x0, α0)/η(·, α0) and improves the corresponding performance. This is consistently evidenced by our numerical results.

For one random sample in Example 6.1, the best fit within the quadratic family is shown by the dotted line in panel (A) of Figure 1. The true unknown η0(·), nonparametric estimate [eta w/ hat](·), two parametrically guided estimates [eta w/ hat]a(·) and [eta w/ hat]m(·) are given by the solid, dotted, dashed, and dot dashed lines, respectively, in Figure 2. From this, we can see that parametrically guided estimates improve the nonparametric counterpart around x = 0, where the curvature of η0(·) is large and makes non-parametric estimation difficult.

7. Real Data Analysis

In this section, we apply our new proposed parametrically guide nonparametric estimation schemes to the Financial Aid Award Data, provided by National Longitudinal Survey of the High School Class of 1972. The data set is available online and interested readers may find more information about this data set at There are twenty variables. We are interested in using SAT score (X) to predict whether a student received financial aid grants. There are 3076 students in total with SAT score between 600 and 1300. Out of these 3076 students, 916 students received some financial aid grants. The binary response Y is coded in this way: Y = 1 means that this student received financial aid grants and Y = 0 otherwise. Our parametrically guided logistic regression is applied with a cubic guide. The pre-asymptotic bandwidth selector gives bandwidths 258.4615, 296.1538, and 296.1538 for the nonparametric, parametrically guided additive, and parametrically guided multiplicative methods, respectively. Result is summarized in Figure 3. Nonparametric estimate of the log odds ratio log logP(Y=1|X=x)P(Y=0|X=x) is given by the dot dashed line; cubic parametric estimate of the log odds ratio is given by the solid line; our parametrically guided estimates are given by the dashed and dotted lines for additive and multiplicative methods, respectively.

Plots of the nonparametric estimate, cubic estimate, and our parametrically guided estimates of the log odds ratio function for the Financial Aid Award Data.

We observe that our parametrically guided additive and multiplicative estimates follow the cubic fit very closely. This suggests that there is no model specification bias by using a cubic model. However the nonparametric estimate differs from the cubic fit for lower SAT score.

8. Discussion

In this work, we extend the methodology of parametrically guided nonparametric estimation to GLMs and QLM. Asymptotical properties and numerical evidence demonstrate its improvement over the original nonparametric estimation scheme. There are possible extensions. For example the whole estimation scheme can be easily extended to varying-coefficient GLMs and QLM.

Appendix: Conditions and proofs

Let qi(x, y) = ([partial differential]i/[partial differential]xi)Q(g−1(x), y). Note that qi is linear in y for fixed x and that q10(x0),μ (x0)) = 0 and q20(x0), μ(x0)) = −ρ(x0)

The following technical conditions are imposed.

  • (A1)
    The function q2 (x, y) < 0 for x [set membership] R and y in the range of the response variable.
  • (A2)
    The functions f,η0(p+2),p+2xp+2η(x,α) , var(Y|X = ·), V″, and g[triple prime] are continuous.
  • (A3)
    For each x [set membership] supp(f),ρ(x), var(Y|X = ·), and g′( μ (x)) are nonzero.
  • (A4)
    The kernel K is a symmetric probability density with support [−1, 1].
  • (A5)
    For each point xδ on the boundary of supp(f), there exists an interval C containing xδ having nonnull interior such that infx[set membership]C f(x) > 0.

White (1982)-type conditions:

  • (B1)
    E log(h0(x,y)) exists and there exists a m1(x,y) such that |log(h(x,y; α))| ≤ m1(x,y) for any α [set membership] A and Em1(x,y) < ∞.
  • (B2)
    E(log(h0(x,y)/h(x,y; α))) has a unique minimizer α0.
  • (B3)
    αjlogh(x,y;α) is continuously differentiable in α for j = 1, 2, (...), q.
  • (B4)
    There exist m1(x,y) and m2(x,y) such that |αilogh(x,y;α)αjlogh(x,y;α)|m2(x,y) and |2αiαjlogh(x,y;α)|m3(x,y) for any α [set membership] A, 1 ≤ i, jq. Furthermore, both Em2(X,Y) and Em3(X,Y) exist.
  • (B5)
    Assume that α0 is an interior point of A; the matrix (Eαilogh(x,y;α)αjlogh(x,y;α))1i,jq is nonsingular at α0; α0 is a regular point of matrix (E2αiαjlogh(x,y;α))1i,jq


In the following, we provide detailed proofs for Theorems 1 and 2. To save space, we skip the proofs for Theorems 3 and 4, which are similar and even simpler.

For the case of multiplicative correction with a fixed guide η(x, α), denote β^=β^(x0,α^)=argmaxβQm(β;h,x0,α). Because [beta]is calculated using Xi near x0, we expect that


Consequently we expect that [beta]0 → η0(x0) and


We define ϕα(x) = η0(x)/η(x, α) to simplify our notations. We thus study the asymptotic properties of


so that each component has the same rate of convergence. Let Qp([mathematical script A]) and Tp([mathematical script A]) be the (p + 1) × (p+ 1) matrices having (i,j)th entry equal to νi+j−1([mathematical script A]) and ∫[mathematical script A] zi+j−2 K2(z)dz. Also, define D = diag(1, 1/1!, (...), 1/p!), Σx([mathematical script A]) = ρ(x)f(x)DNp([mathematical script A])D,





Let bx0([mathematical script A]) be the (p + 1) × 1 vector having jth entry equal tonh2p+3a1,j(𝒜)+nh2p+5a2,j(𝒜)

Main Theorem Suppose that Conditions (A1)–(A5) hold and that h = hn → 0, nh2p+1 → ∞, nh2p+3 < ∞ as n → ∞. If x0 is an interior point of supp(f) and p > 0, then


If x0 = xn, is of the form x0 = xδ + hc, where c [set membership] [−1, 1] is fixed and xδ is a fixed point on the boundary of supp(f), then


The proof of the main theorem follows directly from Lemmas 1 and 2, which are stated and proved as follows. Denote


Lemma 1. Let η¯(x0,x)=η(x0,α)j=0pϕα(j)(x0)(xx0)j/j! and Wn=(nh)1/2i=1nYi* where


Then, under Conditions (A1)–(A5), nh3 → ∞, and h → 0, we have


Proof Recall that [beta] maximizes Qm(β; x0, p, α). Let




where an = (nh)−1/2. If [beta] maximizes Qm(β; x0, p, α), then [beta]* maximizes


as a function of β*. To study the asymptotic properties of [beta]* we apply the quadratic approximation lemma (Fan and Gijbels, 1995) to the maximization of the normalized function


Then [beta]* maximize ln. We remark that Condition (A1) implies that ln is concave in β*. Using a Taylor series expansion of Q(g−1(·), Yi),


where ηi is between [eta w/ macron](x0, Xi) and [eta w/ macron](x0, Xi) + anβ*T Zi. Let


Then the second term in (8.1) equals 12β*TAnβ* Now (An)ij=(EAn)ij+OP(var((An)ij)) and


since q2 is linear in y for fixed x. Because supp(K) = [−1, 1], we need only consider |X1x0| ≤ h, and thus




Similar arguments show that var{(An)ij} = O{(nh)−1} and that the last term in (8.1) is Op{(nh)−1/2}. Therefore, ln(β*)=WnTβ*12β*T(Σx0+hΛx0)β*+oP(h), because nh3 → ∞ and h → 0. Similar arguments show that ln(β*)=Wn(Σx0+hΛx0)β*+oP(h) and ln(β*)=(Σx0+hΛx0)+oP(h) The result follows directly from the quadratic approximation lemma of Fan and Gijbels (1995).

Lemma 2. Suppose that the conditions of Theorem 1 hold. For Wn as defined in Lemma 1,




Proof. We compute the mean and covariance matrix of the random vector Wn by studying Y1* as defined in Lemma 1. The mean of the ith component of Y1* is easily shown to be


Now by Taylor expansion,






Note that


By Lemma 3 of Fan and Gijbels (1995), the ith component of Σx01EWn is


Next, consider the second term in the expression


Using the fact that (Qp)kl = (Np)k,l+1 for l < p + 1, it can be shown that, for i = 2, (...) , p + 1,


and for the similar reasoning,


So by Lemma 3 of Fan and Gijbels (1995),


The statement concerning the asymptotic mean follows immediately. By (8.2), the covariance between the ith and jth component of Y1* is E((Y1*)i(Y1*)j)+O(h2p+4). By a Taylor series expansion,


Noticing that


we can derive


Therefore, Γx01/2cov(Wn)Γx01/2Ip+1. Now, we use the Cramer-Wold device to derive the asymptotic normality of Wn. For any unit vector u [set membership] Rp+1, if


then h1/2cov(Y1*)1/2(WnEWn)DN(0,Ip+1) , and so Γx01/2(WnEWn)DN(0,Ip+1). To prove (8.3). We only need to check Lyapounov’s condition for that sequence, which can be easily verified.

Noting that [beta]0 → η0(x0) and β^jη(x0,α)(η0η(,α))(j)(x0)/j! for 1 ≤ jp, we can define the estimator of η0(j)(x0) iteratively as [eta w/ hat]m,0 (x0; p, h, α) [equivalent] [eta w/ hat]m (x0; p, h, α) = [beta]0 and


where (ji)=j!/(i!(ji)!). Simple algebra leads to


Denote wj=η^m,j(x0;p,h)η0(j)(x0),υj=j!β^jη(x0,α)(η0η(,α))(j)(x0) and


Let L be a (p+1) × (p+1) matrix. For 0 ≤ i, jp, its (i + 1, j + 1)element, denoted by Li,j, is defined as follows. Set Li,j = 1 when i = j, Li,j = 0 when i < j, and


when i > j. Then wj=υji=0j1ωi,jwi=i=0jLj,jiυji=i=0jLj,iυi

With the above notations, we have


The above equation allows us to study the asymptotic bias and variance of η^m,j(x0;p,h)η0(j)(x) using those of i!β^iη(x0,α)(η0η(,α))(i)(x0) , where 0 ≤ ip.

Proposition 1. Let p − j > 0 be odd and suppose that Conditions (A1)–(A5) stated in Appendix are satisfied. Assume that h = hn → 0, nh2p+1 → ∞, and nh2p+3 < ∞ as n → ∞. If x0 is a fixed point in the interior of supp(f) satisfying η (x0, α) ≠ 0, then


where the bias term is given by


Based on (8.4), we get the asymptotic distribution result for our estimates [eta w/ hat]m,j (x0; p, h, α) as follows.


If x0 = xn is of the form x0 = xδ + ch satisfying η(x0, α) ≠ 0, where xδ is a point on the boundary of supp(f) and c [set membership] [−1, 1], then (8.6) holds with σr,s,p2(x0;K) andzp+1 Kr,p(z)dz replaced byσr,s,p2(x0;K,𝒟x0,h) andDx0,h zp+1 Kr,p (z; Dx0,h)dz.

Proposition 2.. Let p − j ≥ 0 be even and p > 0 and suppose that Conditions (A1)–(A5) stated in Appendix are satisfied. Assume that h = hn → 0, nh2p+1 → ∞, and nh2p+3 < ∞ as n → ∞. If x0 is a fixed point in the interior of supp (f) satisfying η(x0, α) ≠ 0, then


where the bias term is given by


Based on (8.4), we get the asymptotic distribution result for our estimates [eta w/ hat]m,j(x0; p, h, α) as follows.


If x0 = xn is of the form x0 = xδ + ch satisfying η(x0, α) ≠ 0, where xδ is a point on the boundary of supp(f) and c [set membership] [−1, 1], then (8.8) holds with σr,s,p2(x0;K) andzp+1 Kr,p(z)dz replaced by σr,s,p2(x0;K,𝒟x0,h) andDx0,h zp+1 Kr,p (z; Dx0,h)dz

Proof of Propositions 1 and 2. The results (8.5) and (8.7) in Propositions 1 and 2 follow from the main theorem by reading off the marginal distributions of the components of [beta]*. To calculate the asymptotic variance, we calculate the (r + 1, s + 1) entry of r!s!Np ([mathematical script A])−1Tp([mathematical script A])Np([mathematical script A])−1 as


where cij is the cofactor of {Np([mathematical script A])}ij. The equation comes from the following argument


The asymptotical results in (8.6) and (8.8) for [eta w/ hat]m,j (x0; p, h, α) are easily proved by noting that its bias and variance are dominated by those of the single term j!β^jη(x0,α)(η0()η(,α))(j)(x0) based on (8.4) since Lis a lower triangle matrix.

Proof of Theorems 1 and 2. The results of Theorems 1 and 2 are the special cases of Propositions 1 and 2 for j = 0.

Proof of Theorem 5. It is enough to prove the result for the multiplicative case as the parallel extension to the additive case is straightforward. Note first that, under Conditions (B1)–(B5), we have ‖[alpha]α0 ‖ = n−1/2 Op (1) by Theorem 3.2 of White (1982). This implies that 1n(Qm(β;h,x0,α0)Qm(β;h,x0,α^))P0. Note that Condition (A1) implies that both Qm(β; h, x0, α0) and Qm (β; h, x0, [alpha])) are strictly concave in β. Consequently β^(x0,α0)β^(x0,α^)P0 as we have obtained the asymptotic result of [beta] (x0, α0).

Note that 1n(Qm(β;h,x0,α0)Qm(β;h,x0,α^))=Op(1/n), 1nβQm(β;h,x0,α0)βQm(β;h,x0,α^)=Op(1/n), 1n2ββTQm(β;h,x0,α0)2ββTQm(β;h,x0,α^)F=Op(1/n) for every β, where ‖ · ‖F denotes matrix’s Frobenius norm defined as the square root of the sum of squares of each element. With consistency established above, we can consider a local compact set. By the standard argument of Taylor expansion used for proving asymptotic normality we get ‖ [beta] (x0, α0) − [beta] (x0, [alpha]) ‖ = n−1/2 Op (1), which is faster than the convergence rates in our Theorems 1–4. Hence using estimated [alpha] does not affect our asymptotic convergence rates as desired.


*Supported in part by NIH grant R01-GM07261 and NSF grant DMS-0704337.


  • Cheng M-Y, Hall P. Reducing variance in nonparametric surface estimation. Journal of Multivariate Analysis. 2003;86:375–397.
  • Cheng M-Y, Peng L, Wu J-S. Reducing variance in univariate smoothing. Annals of Statistics. 2007;35:522–542.
  • Cox DD, O’Sullivan F. Asymptotic analysis of penalized likelihood and related estimators. The Annals of Statistics. 1990;18:1676–1695.
  • Fan J. Local linear regression smoothers and their minimax efficiency. The Annals of Statistics. 1993;21:196–216.
  • Fan J, Farmen M, Gijbels I. Local maximum likelihood estimation and inference. Journal of the Royal Statistical Society. Series B. Statistical Methodology. 1998;60:591–608.
  • Fan J, Gijbels I. Data-driven bandwidth selection in local polynomial fitting: Variable bandwidth and spatial adaptation. Journal of the Royal Statistical SocietySeries B (Methodological) 1995;57:371–394.
  • Fan J, Heckman NE, Wand MP. Local polynomial kernel regression for generalized linear models and quasi-likelihood functions. J. Amer. Statist. Assoc. 1995;90:141–150.
  • Gasser T, MÜller H-G, Mammitzsch V. Kernels for nonparametric curve estimation. J. Roy. Statist. Soc. Ser. B. 1985;47:238–252.
  • Glad IK. Parametrically guided non-parametric regression. Scandinavian Journal of Statistics. Theory and Applications. 1998;25:649–668.
  • Green PJ, Yandell B. Semiparametric generalized linear models. Proceedings of the 2nd International GLIM conference; Berlin. Sprinter-Verlag; 1985.
  • Hjort NL, Glad IK. Nonparametric density estimation with a parametric start. The Annals of Statistics. 1995;23:882–904.
  • Hurvich CM, Tsai C-L. Model selection for extended quasi-likelihood models in small samples. Biometrics. 1995;51:1077–1084. [PubMed]
  • Martins-Filho C, Mishra S, Ullah A. A class of improved parametrically guided nonparametric regression estimators. Econometric Reviews. 2007 To appear.
  • McCullagh P, Nelder JA. Generalized linear models. 2nd ed. London: Chapman & Hall; 1989.
  • Naito K. Semiparametric density estimation by local L2-fitting. The Annals of Statistics. 2004;32:1162–1191.
  • Nelder JA, Wedderburn RWM. Generalized linear models. Journal of the Royal Statistical Society. Series A. 1972;135:370–384.
  • O’Sullivan F, Yandell B, Raynor W. Automatic smoothing of regression functions in generalized linear models. Journal of the American Statistical Association. 1986;81:96–103.
  • Staniswalis JG. The kernel estimate of a regression function in likelihood-based models. Journal of the American Statistical Association. 1989;84:276–283.
  • Stone CJ. Nearest neighbor estimators of a nonlinear regression function. Proc. Computer Science and Statistics, 8th Annual Symposium on the Interface; 1975. pp. 413–418.
  • Stone CJ. Consistent nonparametric regression, with discussion. The Annals of Statistics. 1977;5:549–645.
  • Tibshirani R, Hastie T. Local likelihood estimation. Journal of the American Statistical Association. 1987;82:559–568.
  • Wedderburn RWM. Quasi-likelihood functions, generalized linear models, and the gauss-Newton method. Biometrika. 1974;61:439–447.
  • White H. Maximum likelihood estimation of misspecified models. Econometrica. 1982;50:1–25.