PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
Acta Math Appl Sin. Author manuscript; available in PMC 2010 July 1.
Published in final edited form as:
Acta Math Appl Sin. 2009 July 1; 25(3): 427–444.
doi:  10.1007/s10255-008-8813-3
PMCID: PMC2779551
NIHMSID: NIHMS104023

Local Linear Regression for Data with AR Errors

Abstract

In many statistical applications, data are collected over time, and they are likely correlated. In this paper, we investigate how to incorporate the correlation information into the local linear regression. Under the assumption that the error process is an auto-regressive process, a new estimation procedure is proposed for the nonparametric regression by using local linear regression method and the profile least squares techniques. We further propose the SCAD penalized profile least squares method to determine the order of auto-regressive process. Extensive Monte Carlo simulation studies are conducted to examine the finite sample performance of the proposed procedure, and to compare the performance of the proposed procedures with the existing one. From our empirical studies, the newly proposed procedures can dramatically improve the accuracy of naive local linear regression with working-independent error structure. We illustrate the proposed methodology by an analysis of real data set.

Keywords: Auto-regressive error, local linear regression, partially linear model, profile least squares, SCAD

1 Introduction

When data are correlated, it is of great interest to improve effciency of parameter estimates by including the correlation information into estimation procedures. This issue has been well studied in the longitudinal or panel data. The generalized method of moments (GMM, Hansen, 1982), the generalized estimating equation (GEE, Liang and Zeger, 1986; Zeger and Liang, 1986) and quadratic inference function (QIF, Qu, Li and Lindsay, 2000) are well-known methods to incorporate the correlation information into estimation procedure for parametric regression models with longitudinal data. Lin and Carroll (2000) showed that kernel GEE, a direct estimation of the parametric GEE, fails to incorporate the correlation information into the kernel estimate for the nonparametric function of clustered/longitudinal data. Wang (2003) proposed the marginal kernel method for longitudinal data. The marginal kernel method achieves its effciency by incorporating the true correlation structure. Fan, Huang and Li (2007) proposed the idea of minimizing generalized variance (MGV) to improve the effciency of estimates of nonparametric regression function under the context of longitudinal data using working independence.

Beyond the setting of longitudinal data, many authors have studied the topic of nonparametric regression with correlated errors. A good review on this topic is given in Opsomer, Wang and Yang (2001), in which attentions have been paid to a nonparametric regression model with fixed designs:

yt=m(xt)+εt,
(1.1)

where xt = t/n or xt = (t − 0.5)/n, and εt is a correlated error. Opsomer, Wang and Yang (2001) discussed the problems with correlation and illustrate the failure of inclusion the correlation between the errors may yield an undesirable results. Pioneer works in this topic are Altman (1990) and Hart (1991). Both Altman (1990) and Hart (1991) assumed that the correlation between εt and εs was of the form ρn (|ts|), and addressed the issue how to select a bandwidth adjusting the correlation structure ρn (|ts|), which is assumed to be known or required to estimate based on the observed data.

In this paper, we consider the situation in which xt is a random design. More specifically, it is assumed that (xt, yt), t = 1,2,… , is a sequence of strictly stationary random vectors. Thus, (xt, yt), t = 1,2,… , are identically distributed. In this paper, we are interested in that when the data are correlated, how to incorporate the correlation into the local linear regression estimation procedure. This issue has been studied in Xiao, Linton, Carroll and Mammen (2003), in which the authors proposed a new estimation procedure based on a pre-whitening transformation of the dependent variable that must be estimated from the data. They also established the asymptotic distribution of their estimator under weak conditions, while they did not address the critical issues related to practical implementation, such as the selection of smoothing parameter of their nonparametric regression. They assumed that the error process was is an invertible linear process, i.e., a moving average (MA) process with order infinite, while they did not discuss how to determine the order of the error process. In this paper, it is assumed that the error process is an AR process of order d, where d can be large. We propose a new estimation procedure for the regression using the profile least squares techniques, and study the asymptotic property of the resulting estimate. We discuss how to select the bandwidth in the local linear regression. We further propose the penalized profile least squares with the smoothly clipped absolute deviation (SCAD) penalty (Fan and Li, 2001) to determine the order of the AR process. Monte Carlo simulation studies are conducted to examine the finite sample performance. Our simulation results show that the proposed estimation procedure is much more efficient than the conventional local linear regression method when the error is highly correlated. The effciency gain can be achieved in moderate-sized samples.

The remainder of this paper is organized as follows. In Section 2, we propose a new estimation procedure, and discuss the issues related to practical implementation. Section 3 presents numerical comparison and analysis of a real data example. Regularity conditions and technical proofs are given in Section 4. Some discussions and final concluding remarks are given in Section 5.

2 A new estimation procedure

Suppose that (xt, yt), t = 1, … , n be a random sample from the nonparametric regression model

yt=m(xt)+εt,
(2.1)

where error process εt is a correlated random error with mean zero. Throughout this paper, it is assumed that the error process εt is independent of the covariate process xt. regression estimation for m (·) under the ‘fixed design’ case, i.e., either xt = t/n or xt = (t − 0.5)/n and cov(εt, εt+k) = σ2 ρn (|k|), and further studied how to choose the bandwidth with adjusting for the correlation. Xiao, et al (2003) proposed a local polynomial estimate for the regression function with xt being a ‘random design’ and following a non-degenerate distribution, and the residual process is stationary, mean zero and has an invertible linear process representation. That is, it can be represented as a moving average with infinite order (MA(∞)). Throughout this paper, xt is a random design and that εt is an autoregressive (AR) series

εt=β1εt1++βdεtd+ηt,

where ηt is independent and identically distributed random error with mean zero and variance σ2. The order d can be large, and the selection of the order d will be discussed in next section. If the values for εt were available, then we would work on the following partially linear model

yt=m(xt)+β1εt1++βdεtd+ηt.

In practice, εt is not available, but it may be estimated by [epsilon with circumflex]t=ytmI(xt), where mI(·) is a local linear estimate of m(·) based on (2.1) pretending that the random error εt’s are independent. We will address the issue of bandwidth selection for mI(·) in next section.

Replacing εt’s with [epsilon with circumflex]t’s, we have

yt=m(xt)+etTβ+ηt,
(2.2)

where et = ([epsilon with circumflex] t−1, … , [epsilon with circumflex]t−d)T and β = (β1, … , βd)T . In Section 2.1, we propose a new estimation procedure for m(·) and β based on model (2.2). We further propose an order selection procedure for the AR series by using penalized profile least squares method in section 2.2.

2.1 Profile least squares estimate

As to the partially linear model (2.2), there exist various estimation procedures, including partially spline estimate (Wahba, 1984, Heckman, 1986, Engle et al., 1986), partial residual method (Speckman, 1998) and profile least squares or likelihood method (Severini and Staniswalis, 1994). Here we will employ the profile least squares techniques to estimate β and m(·).

For given β, denote yt=ytetTβ for t = d + 1, … , n. Then

yt=m(xt)+ηt
(2.3)

which is one-dimensional nonparametric model. We may employ existing linear smoothers, such as local polynomial regression, and smoothing splines (Gu, 2002), to estimate m(·). Here we will employ the local linear regression. For a given x0, we locally approximate the regression function

m(x)m(x0)+m(x0)(xx0)a+b(xx0)

for x in the local neighborhood of x 0. Thus, the local linear estimate of m (·) is the minimizer of the following weighted least squares function

(a^,b^)T=argmin(a,b)t=d+1n{ytab(xtx0)}2Kh(xtx0),

where Kh(u) = h−1K(u/h) is a scaled kernel function of kernel K (·) with bandwidth h. It is clear that the local linear estimate is linear in terms of y=(yd+1,,yn)T. Let m = (m(xd+1), … , m(xn))T . Then m can be represented by

m^=Shy,
(2.4)

where Sh is a (n − d) × (n − d) smoothing matrix depending on xt’s and the bandwidth only.

Substituting m(xt) in (2.3) by m (xt), we obtain a synthetic linear regression model

(ISh)y=(ISh)Eβ+η,

where I is the identity matrix, E = (ed+1, … , en)T and η = (ηd+1, … , ηn)T . Thus, the profile least squares estimator for β and m are

β^={ET(ISh)T(ISh)E)1ET(ISh)T(ISh)y,
(2.5)

and

m^=Sh(yEβ^),
(2.6)

respectively.

The asymptotic distribution of [beta] and the asymptotic bias and variance of m(x0) are given in the following theorem. Denote μi = ∫ xiK(x) dx and νi = ∫ xiK2(x) dx.

Theorem 1. Suppose that Conditions A—G listed in Section 4 hold. Then

  1. The asymptotic distribution of [beta]is given
    n(β^β)N(0,σ2{E(ffT)}1)
    where ft = (εt−1, … , εt−d)Tand σ2= var(ηt) .
  2. The asymptotic distribution of m(x0,[beta]), conditioning on x1, … , xn, is given below
    nh{m^(x0,β^)m(x0)12μ2m(x0)h2}N(0,ν0σ2f(x0)),
    where f(x) is the density function of x.

Note that the asymptotic distribution of [beta] is the same as that of Yule-Walker estimator for the AR model:

εt=β1εt1++βdεtd+ηt.

(see Theorem 8.1.1 of Brockwell and Davis, 1991). In other words, Theorem 1 implies that [beta] is as efficient as if one knew the true regression function m(·) in advance. The asymptotic bias and variance of m(·, [beta]) are the same as those of the local linear regression for independent and identically distributed observations, respectively. This implies that the proposed profile least square estimate is very efficient.

2.2 SCAD procedure for the AR process

To implement the profile least squares estimation procedure, we have to determine the order of AR process. In practice, we may start with a large order AR process, and then apply variable selection procedure to select its order. The penalized likelihood procedures with the smoothly clipped absolute deviation (SCAD) penalty was proposed for variable selection in parametric models in Fan and Li (2001). The SCAD procedure is distinguished from the traditional variable selection procedures, such as the stepwise regression and the best subset selection with the AIC and BIC, in that it selects significant variables and estimates their coefficients simultaneously. Thus, it can be directly applied for high-dimensional data analysis. The SCAD procedure was further developed for partially linear model with longitudinal data in Fan and Li (2004). In this section, we apply the SCAD procedure to determine the complexity of AR process.

The SCAD penalized least squares function is defined to be

12t=d+1n{ytm(xt)etTβ}2+nj=1dpλj(|βj|),

where (|β|) is the SCAD penalty with a tuning parameter λ, defined by

pλ(|β|)={λ|β|,if0|β|<λ;(a21)λ2(|β|aλ)22(a1),ifλ|β|<aλ;(a+1)λ22,if|β|aλ.

Fan and Li (2001) suggested fixing a = 3.7 from a Bayesian argument.

Applying the profile techniques for the penalized least squares, we can derive the penalized profile least squares estimate, the minimizer of the following penalized least squares

12(ISh)y(ISh)Eβ2+nj=1dpλj(|βj|).
(2.7)

As demonstrated in Fan and Li (2004), with proper choice of tuning parameter, the resulting estimate contains some exact zero coefficients. This is equivalent to excluding the corresponding terms from the selected model and reducing model complexity. Since the SCAD penalty function is a nonconvex function over [0, ∞), it is challenging in minimizing the SCAD penalized profile least squares function. Following Fan and Li (2004), we employ the local quadratic approximation (LQA) for the SCAD penalty function. Suppose we can get an estimate βj(k) in the kth step that is close to the true βj. If |βj(k)| is close to 0, then set [beta]j = 0. Otherwise, the SCAD penalty can be locally approximated by a quadratic function as

[pλj(|βj|)]=pλj(|βj|).sgn(βj)pλj(|βj(k)|)/ |βj(k)|βj.

This is equivalent to

pλj(|βj|)pλj(|βj0|).+12{pλj(|βj(k)|)/ |βj(k)|}(βj2βj(k)2).

With the aid of LQA, we may employ the following iterative ridge regression to find the minimizer of (2.7):

β(k+1)={ET(ISh)T(ISh)E+nλ(β(k))}1ET(ISh)T(ISh)y,
(2.8)

where λ(β(k))=diag{pλ1(|β1(k)|)/ |β1(k)|,,pλd(|βd(k)|)/ |βd(k)|} for nonvanished β(k).

2.3 Tuning parameter selection and bandwidth selection

In this section, we address how to determine λ in the SCAD procedure and how to select a bandwidth for the profile least squares estimation procedure, two important issues in the practical implementation of the proposed methodology.

Tuning parameter selection

In the implementation of the SCAD procedure, we need to choose the tuning parameter λ = (λ1, … , λd). Following the advocation of Wang, Li and Tsai (2007), we use the BIC selector to find the optimal λ = (λ1, … , λd). From (2.8), we define the effective number of parameters of the penalized least square estimator (2.8) to be

e(λ)=tr[{D+λ(β^)}1D]

where D = ET (I − Sh)T (I − Sh)E for nozero [beta].

The BIC score is defined to be

BIC(λ)=log{RSS(λ)n}+e(λ)lognn

where RSS (λ) = ||(I −Sh)y−(I −Sh)E[beta]λ ||2 is the residual sum of squares with [beta]λ, the penalized profile least squares estimate of β with tuning parameter λ .

It is challenging to minimize BIC (λ) over a d-dimensional space of λ. By heuristic, the magnitude of λj is proportional to the standard error of the profile least squares estimate of βj. In our implementation, we set λ = λ se([beta]LS), where se([beta]LS) is the standard error of the unpenalized profile least square estimates and λ is a scalar variable. Thus, the original d-dimensional optimization becomes a 1-dimensional problem. In section 3, we minimize BIC(λ) over a grid of points evenly distributed in the interval [0.1n,2lognn], and set [lambda with circumflex] = argminλ BIC(λ).

Bandwidth selection

Xiao et al. (2003) pointed out it was challenging in selecting a bandwidth for their procedure, and the authors simply used the rule of thumb bandwidth, h=1.06SXn15, to prewhite AR process, where SX is the standard error of xt. Note that h=1.06SXn15is the rule of thumb bandwidth for kernel density estimate (Silverman, 1986). From our limited simulation experience, the bandwidth is not appropriate for the regression problem. Here we propose a bandwidth selector for the profile least squares estimate.

We use local linear regression to get the initial estimate mI (·) with the plug-in bandwidth selector (Ruppert, Sheather and Wand, 1995), pretending the data are independent. Since model (2.2) is a partially linear model, we can use the existing bandwidth selector for partially linear model in the literature. Here we suggest using the proposal of Fan and Li (2004). Specifically, we calculate the difference-based estimate for β, denoted by βdbe. Plug-in the difference-based estimate in (2.3), and further apply the plug-in bandwidth selector, we can select an appropriate bandwidth for the profile least squares procedures. The selected bandwidth is used for the SCAD procedure in (2.7).

3 Numerical comparison and application

In this section, we investigate the finite sample performance of the proposed procedures by Monte Carlo simulation, and compare the performance of proposed procedures with existing ones by the mean squares errors, defined by

MSE{m^(·)}=1nt=1n{m^(xt)m(xt)}2.

We summarize our simulation results in terms of relative MSE (RMSE), defined by the ratio of the MSE of an estimation procedure to the MSE of mI (·), the estimate of m (·) pretending the error εt being independent. We report the percentage of accuracy gain, defined by (1 − RMSE) * 100%.

Example 1. In this example, a random sample of size n, either n = 100 or n = 500, is generated from

yt=m(xt)+εt.

In this example, we consider two scenarios for m(x). The first one is

m(x)=4cos(2πx),

and the second one is

m(x)=exp(2x).

The mean function m (x) is not monotone in the first scenario, while it is monotone in the second scenario. The error process εt is an AR process of order d = 10 or d = 20, i.e.,

εt=j=1dβjεtj+ηt,

where ηt ~ N (0, σ2) with σ = 0.5 or 1. In our simulation we consider two situations: the first one is β1 = 0.5, or 0.7, and all other βj ’s equal 0, the second one is β1 = 0.5, β2 = 0.4 or β1 = 0.7, β2 = 0.2 and all others equal 0. In the first situation, the error process indeed is an AR(1), while in the second situation, the error process is an AR(2). The number of replication is 500.

To understand how the sampling scheme of xt affects the proposed procedure, we consider two sampling schemes in our simulation.

  1. xt is independent and identically distributed according to the uniform distribution over [0, 1].
  2. ut is independent and identically distributed according to the standard normal distribution for t = 1,2, … Let xt=Φ{(aut+but1)/a2+b2} for t = 2,3, … , where Φ(u) is the cumulative distribution function of the standard normal distribution. Thus, xt is 1-dependent process. In our simulation, we take a = 0.9 and b = 0.1.

For each sampling scheme, three methods, Xiao, Linton, Carroll and Mammen (2003) method (XLCM), profile least squares method (Profile) and penalized profile least square method with SCAD penalty function (SCAD) are compared with regard to the effciency improvement. In addition, oracle procedure by substituting the true autoregressive coefficient and order is listed as a benchmark. The overall pattern for d = 10 and d = 20 is the same. To save space, we present results with d = 20 only.

For sample scheme I, the covariate xt’s are independent and identically distributed, only the random error is correlated. Table 1 summarizes the simulation for sampling scheme I with d = 20. The SCAD procedures performs better than the Xiao’s et al (2003)’e method and the profile least squares estimate, and its performance is very close to the oracle procedure. The performance of Xiao et al’s method and the profile least squares procedure is very close to each other, and no one dominates the other one.

Table 1
Simulation Results for Sampling Scheme I when d = 20

When the sample size is large, such as n = 500, the performance of Xiao et al’s method, the profile least squares method and the SCAD procedure are very closely to each other, although the SCAD procedure is slightly better than the other two. The gain for these three methods in terms of RMSE with large sample is more than the one with the smaller sample size (n = 100). This is expected because with the large sample size, all three methods can estimate β more accurate. This leads the decorrelation method works better.

Simulation results for sampling scheme II are summarized in Table 2. The overall pattern of Table 2 is similar to that in Table 1. Although the sampling scheme II is different from the sampling scheme I in that the covariate xt ’s is dependent in the sample scheme II, while they are independent in the sample scheme I. For the sample scheme II, the SCAD procedures performs best among the three methods in the comparison, and its performance is very close to the oracle procedure. The performances of Xiao et al’s method and the profile least squares procedure are similar, and no one dominates the other one. As a summary, the performance of the proposed profile least squares procedure and the SCAD procedures seems not to rely on the sampling scheme of covariate xt.

Table 2
Simulation Results for Sampling Scheme II when d = 20

Example 2. In this example, we illustrate the proposed methodology by analysis of a data set about U.S. macroeconomics, collected from January 1980 to December 2006 in a monthly basis. Our interest here is to investigate the relationship between the unemployment rate and house price index change.

In the past few years, house price in U.S. has shown a strong upward trend, although the bubble warning always exists. Many home buyers who do not have sound credit history nor suffcient financial capability become home owners with the help of sub-prime mortgage. They bear with high level of interest payments but believe the property will keep appreciating. In the meantime, the mortgage agent packages the debt and sell it to other institutional investors. This long chain prospers and works well when the housing market is booming. However, when the house price began to plummet in spring 2007, borrowers had to default and many houses went to foreclosure. Consequently, a number of big financial institutions that have heavy investment in sub-prime mortgage market claimed billions dollars write-off due to the crisis.

In this example, we are interested in the effect of unemployment rate on the house price. By classical economics theory, unemployment rate is an important indicator of the overall economics. If many people claim unemployment, the purchase power is definitely be hurt. However, to our best knowledge, there are no many literatures to study the relationship between the unemployment rate and the housing market in a quantitative manner. Motivated by the sub-prime mortgage turmoil and recent suspicion of recession, it is believed that the historical data might shed some interesting insights on how these two indexes are related. Thus, we take the unemployment rate as the covariate x and House Price Index Change as the response variable y, and consider the following model

yt=m(xt)+εt.
(3.1)

Initial estimate and residual analysis

We ignore the correlation of the random errors temporarily and estimate (3.1) by the conventional local linear model as Fan and Gijbels (1996). The Ruppert, Sheather and Wand (1995)’s direct plug-in bandwidth is 0.2969.

When the initial estimate m(xt) is obtained, we can estimate the residual εt as ytm(xt). There is an obvious correlation pattern present in the autocorrelation plot of [epsilon with circumflex]t (Figure 1 (a)). The partial-autocorrelation plot (Figure 1 (b)) indicates an autoregressive model and the first lag effect is most outstanding. Furthermore, the Ljung-Box-Pierce test used to check the autocorrelation pattern for white noise has the P-value less than 0.0001. It also verifies that autocorrelation exists in [epsilon with circumflex]t.

Figure 1
Correlogram of residual [epsilon with circumflex]t and [eta w/ hat]t. Plot (a) and (b) are the autocorrelation and partial autocorrelation for [epsilon with circumflex]t. Plot (c) and (d) are the autocorrelation and partial autocorrelation for [eta w/ hat] ...

From a conservative point of view, we suspect that the house price might have a year lag. So we assume AR(12) model on errors and employ the penalized profile least square method to select the AR order and estimate m(·) simultaneously. The plug-in bandwidth in the profile least square estimation is 0.2140. By the BIC criterion, the optimal tuning parameter used in the order selection procedure is 0.000019.

As a result, AR(1) model with a strong autocorrelation coefficient 0.9438 is most appropriate. It means that the error has only one month lag, which agrees with the partial-autocorrelation plot in Figure 1 (b). After accounting for the autocorrelation, the correlogram of [eta w/ hat]t does not have any significant pattern. (See Figure 1 (c) and (d)) In addition, the P-value in the Ljung-Box-Pierce test at the first 24 lags, 0.9134, also shows that the autocorrelation has been successfully removed.

Final model

By applying the penalized profile least squares estimation method, the relationship between the House Price Index Change and the unemployment rate turns out to be

y^t=m^(xt)+0.9438ε^t1
(3.2)

where m(·) is displayed in Figure 2. The penalized profile least square approach yields a smoother estimate than the conventional local linear regression because it takes the correlation into account. As expected, the unemployment rate has a negative correlation with house price index change. But this effect is most significant when the unemployment varies between 4% and 5% or between 8% and 10%.

Figure 2
Estimation of m(·). Dashed curve is the initial estimates; Solid curve is the penalized profile least squares estimate.

4 Proofs

4.1 Preliminaries

To present the regularity conditions, we need the following definitions for a sequence of random vectors {zt, t = 0, ±1, ±2, … }. The following notation and definitions are adopted from Chapter 2 of Fan and Yao (2003).

Definition 1. A sequence of random vectors {zt, t = 0, ±1, ±2, …} is said to be strictly stationary if {z1, … , zn} and {z1+k, … , z1+n} have the same joint distributions for any integer n ≥ 1 and any integer k.

Denote Fij to be the σ-algebra generated by events {zt, itj}, and L2(Fij) consists of Fij-measurable random variables with finite second moment. Intuitively, Fij assembles all information on the sequence collected between time i and j. Define

α(n)=supAεF0,BεFn|P(A)P(B)P(AB)|
(4.1)

Definition 2. A sequence of random vectors {zt, t = 0, ±1, ±2, … } is said to be α-mixing if it is strictly stationary and α(n) → 0 as n → ∞.

4.2 Regularity conditions and proofs

To make the argument concise, denote F = (f1, … , fn)T with ft = (εt−1, … , εtd)T , and E = (e1, … , en)T with et = ([epsilon with circumflex]t− 1, … ,[epsilon with circumflex]t−d)T. Define Δ = EF. Our proof follows the same strategy as that in Fan and Huang (2005). The following conditions are imposed to facilitate the proof and are adopted from Fan and Huang (2005). They are not the weakest possible conditions.

  1. The random variable xt has a bounded support Ω. Its density function f (.) is Lipschitz continuous and bounded away from 0 on its support.
  2. There is an s > 2 such that E||ft||s < ∞ and for some ξ > 0 such that n1−2s−1−2ξ h → ∞.
  3. m(·) has the continuous second derivative in x ε Ω.
  4. The function K (·) is a bounded symmetric density function with bounded support [−M, M], satisfying the Lipschitz condition.
  5. nh8 → 0 and nh2/ (log n)2 → ∞.
  6. supxεΩ|m^I(x)m(x)|=op(n14) where mI (xt) is obtained by local linear regression pretending that data are i.i.d.
  7. The sequence of random vector (xt, εt), t = 1,2, … , is a strictly stationary and satisfies the mixing condition for α-mixing processes: assume that for some δ > 2 and a > 1 − 2/δ,
    lla[α(l)]12/δ<,E|ε1|δ<,gx1|ε1(x|ε)A1<

Lemma 4.1. is taken from Lemma 6.1 of Fan and Yao (2003) and will be used in our proof repeatedly.

Lemma 4.1. Let (x1, ε1), … , (xn, εn) be a strictly stationary sequence satisfying the mixing condition α (l) ≤ cl−τ for some c > 0 and τ > 5/2. Assume further that for some s > 2 and interval [a, b],

E|εt|s<andsupx[a,b]|εt|sg(x,ε)dε<,

where g denote the joint density of (xt, εt). In addition, Condition G holds, and the conditional density gx1,xl|ε1,εl(x1,xl|ε1,εl) ≤ A2 < ∞, ∀l ≥ 1. Let K satisfy Condition D. Then

supx[a,b]|1ni=1n{Kh(xix)εiE[Kh(xix)εi]}|=Op({lognnh}1/2)

provided that h → 0, for some ξ > 0, n1−2s−1−2ξh → ∞ and n(τ+1.5)(s−1+ξ)−τ/2+5/4hτ/2−5/4 → 0.

Lemma 4.2. Under Conditions A—G, it follows that

1nFT(IS)T(IS)FPE(ffT).

Proof Denote Wx be a n × n diagonal matrix with j-th diagonal element Kh (xjx) and

Dx=(1x1xh1xnxh)

Then the smoothing matrix S for the local linear regression can be expressed as

S=([1,0]{Dx1TWx1Dx1}1Dx1TWx1[1,0]{DxnTWxnDxn}1DxnTWxn)

where

DxTWxDx=(i=1nKh(xix)i=1n(xix)Kh(xix)/hi=1n(xix)Kh(xix)/hi=1n(xix)2Kh(xix)/h2)

The generic element of matrix DxTWxDx is in the form of i=1n(xixh)jKh(xix), j = 0, 1, 2. Denote Sn,j=i=1n(xixh)jKh(xix). By using the formula Sn,j=E(Sn,j)+Op(Var(Sn·j)), it is easy to show that if j is even,

Sn,j=nνjK(ν)f(x+hν)dν+Op(nE{(x1x)2jKh2(x1x)})=nf(x)μj+Op(h2+1/nh)

Because of the symmetry of kernel function, for any odd numbered j, μj = 0 and then Sn,j=Op(h+1/nh). Indeed, with Lemma 4.1, it can be further shown that for even j,

Sn,j=nf(x)μj+Op(h2+log(n)/nh),

and for odd j,

Sn,j=Op(h+log(n)/nh)

holds uniformly in x. Therefore,

1nDxTWxDx=(f(x)(1+Op(h2+log(n)/nh))Op(h+log(n)/nh)Op(h+log(n)/nh)f(x)μ2(1+Op(h2+log(n)/nh)))

holds uniformly in x.

Since h+log(n)/nh=op(1), we can regard the above matrix as being approximately diagonal.

Then its inverse is

{1nDxTWxDx}1=({f(x)}1(1+Op(h2+log(n)/nh)Op(h+log(n)/nh)Op(h+log(n)/nh){f(x)μ2}1(1+Op(h2+log(n)/nh)))

holds uniformly in x.

Similarly, by Lemma 4.1 and the assumption of independence between the process εt and the process xt, it follows that

1nDxTWxF=(Op(h2+lognnh)Op(h+lognnh))

holds uniformly in x.

Consequently,

[1,0]{1nDxTWxDx}1{1nDxTWxF}=[1,0]({f(x)}1(1+Op(h2+log(n)/nh))Op(h+log(n)/nh)Op(h+log(n)/nh){f(x)μ2}1(1+Op(h2+log(n)/nh)))(Op(h2+lognnh)Op(h+lognnh))={f(x)}1Op(h2+lognnh)(1+op(1))=op(1)

Substituting this result into the smoothing matrix S, we have

SF=([1,0]{Dx1TWx1Dx1}1Dx1TWx1F[1,0]{DxnTWxnDxn}1DxnTWxnF)=(op(1)op(1)).

Thus,

FSF=F{1+op(1)}.

Finally, by the WLLN,

1nFT(IS)T(IS)F=(1ni=1nfifiT){1+op(1)}2PE(ffT)

Lemma 4.3. Under Conditions A—G, we have

1nET(IS)T(IS)EPE(ffT).

Proof Since Δ = EF, the generic element of Δ is of the form m(xt) − m(xt), which is of order oP (n−1/4) uniformly in x by Condition F. Thus, Δ = oP (n−1/4). Therefore

1nET(IS)T(IS)E=1n(F+Δ)T(IS)T(IS)(F+Δ)

By using similar argument in the proof of Lemma 4.2, it can be shown that

1nET(IS)T(IS)E=1nFT(IS)T(IS)F+oP(1)

Thus, Lemma 4.3 follows by Lemma 4.2.

Lemma 4.4. Suppose Conditions A—G hold. It follows

1nFT(IS)T(IS)m=oP(1)

Proof It is noted that

1nFT(IS)T(IS)m=1ni=1n[fi(Sf)i][m(xi)[1,0]{DxiTWxiDxi}1DxiTWxim]
(4.2)

Similar to the argument in the proof of Lemma 4.2, we can show that

[1,0]{1nDxTWxDx}1{1nDxTWxm}=m(x)(1+Op(h2+log(n)/nh))

holds uniformly in x ε Ω. Plugging this in (4.2), it follows that

1nFT(IS)T(IS)m=1ni=1n[fi(Sf)i][m(xi)m(xi)(1+Op(h2+log(n)/nh)]=1ni=1nfim(xi)[1+op(1)]Op(h2+log(n)/nh)

Note that E{fim (xi)} = 0, and covariance matrix for {fim(xi)} is finite. Thus, using R=E(R)+Op(Var(R)), it follows that 1nFT(IS)T(IS)m=op(1).

Lemma 4.5.Under Conditions A—G, we have

1nET(IS)T(IS)m=op(1)

Proof Since E = F + Δ, we can break 1nET(IS)T(IS)m into two terms: 1nFT(IS)T(IS)m, which is oP (1) by Lemma 4.4, and 1nΔT(IS)T(IS)m, which is also oP (1) as Δ = oP(n−1/4).

Lemma 4.6. Suppose that Conditions A—G hold. We have

1nET(IS)T(IS)Δβ=op(1)

Proof This is a direct result from the proof of Lemma 4.3.

Lemma 4.7. Under Conditions A—G, let η = (η1, … , ηn)T . Then

n[FT(IS)T(IS)F]1FT(IS)T(IS)ηN(0,σ2{E(ffT)}1)

Proof We observe that

FT(IS)T(IS)η=i=1nfi[ηi[1,0]{DxiTWxiDxi}1DxiTWxiη][1+op(1)]
(4.3)

By using Lemma 4.1 on {xi, ηi}, we can show that

[1,0]{1nDxTWxDx}1{1nDxTWxη}=[1,0]({f(x)}1(1+Op(h2+log(n)/nh)Op(h+log(n)/nh)Op(h+log(n)/nh){f(x)μ2}1(1+Op(h2+log(n)/nh))(Op(h2+lognnh)Op(h+lognnh))=op(1)

Then ηi[1,0]{DxiTWxiDxi}1DxiTWxiη=ηi{1+op(1)} Plugging this in (4.3), we obtain that

FT(IS)T(IS)η=i=1nfiηi{1+op(1)}

Since E(fiηi) = 0, Var(fiηi) = σ2{E(ffT)} < ∞, and E(fiηifjηj) = 0 for ij since ηi is independent of fi. By Central Limit Theorem for strictly stationary sequence (see Theorem 2.21 of Fan and Yao, 2003),

1nFT(IS)T(IS)ηLN(0,σ2{E(ffT)}).

By Lemma 4.2, 1nFT(IS)T(IS)FPE(ffT). Apply the Slutsky theorem, it follows that

n[FT(IS)T(IS)F])1FT(IS)T(IS)ηLN(0,σ2{E(ffT)}1).

Lemma 4.8. Under Conditions A—G, we have

n[ET(IS)T(IS)E])1ET(IS)T(IS)ηPN(0,σ2{E(ffT)}1)

Proof Since E = F + Δ, we may write ET (I − S)T (I − S)η = FT (I − S)T (I − S)η + ΔT (I −S)T (I − S)η. Note that Δ = oP (n−1/4) by Condition F, it can be shown that

1nΔT(IS)T(IS)η=op(1).

Furthermore, we have shown in the last lemma that 1nFT(IS)T(IS)ηN(0,σ2E(ffT)). So 1nET(IS)T(IS)ηN(0,σ2E(ffT)) as well. The proof is completed by the Slutsky theorem and Lemma 4.3.

Proof of Theorem 1 Let us first show the asymptotic normality of [beta] . According to the expression of [beta] in (2.5), we can break n(β^β) into the sum of the following three terms (a), (b) and (c)

(a)n[{ET(IS)T(IS)E}1ET(IS)T(IS)m](b)n[{ET(IS)T(IS)E}1ET(IS)T(IS)Δβ](c)n[{ET(IS)T(IS)E}1ET(IS)T(IS)η]

For term (a), it is a product of two terms [{ET(IS)T(IS)E}n]1 and [ET(IS)T(IS)mn]. From Lemmas 4.3 and 4.5, the asymptotic properties of these two terms lead to the conclusion that (a) = op(1). Similarly, applying Lemmas 4.3 and 4.6 on two product components of term (b) results in (b) = op(1) as well. In addition, Lemma 4.8 states that term (c) converges to N(0, σ2{E(ffT)}−1). Put three terms together and we get the asymptotic distribution of [beta].

Next we derive the asymptotic bias and variance of m (·). From Lemmas 4.1— 4.8, we have

m^(x0,β^)=[1,0]{Dx0TWx0Dx0}1Dx0TWx0(m+η){1+oP(1)}

Note that E(η |χ) = 0, where χ = (x1, … , xn). Thus, So

E{m^(x0,β^)|χ}=[1,0]{Dx0TWx0Dx0}1Dx0TWx0m{1+oP(1)}

which is same as the conditional expected mean for local linear regression derived in Fan and Gijbels(1992). So the asymptotic bias is 12m(x0)h2x2K(x).

Regarding to the asymptotic variance of m(·), conditioning on x1, … , xn,

Var[m^(x0,β^)|χ]=[1,0]{Dx0TWx0Dx0}1Dx0TWx0Var{η}Wx0Dx0{Dx0TWx0Dx0}1[1,0]T

Using the same argument as that in the proof of Lemma 4.2, we have

Var[m^(x0,β^)|χ]=σ2nhf(x0)K2(x)dx

As to the asymptotic normality,

m^(x0,β^)E{m^(x0,β^)|χ}=[1,0]{Dx0TWx0Dx0}1Dx0TWx0η{1+oP(1)}

Thus, conditioning on χ, the asymptotic normality can be established using the CLT since ηi is independent and identically distributed with mean zero and variance σ2.

5 Discussions

In this paper, we proposed a new estimation procedure for the nonparametric regression model with AR error by using profile least squares techniques. We further propose to determine the order of the AR error process using the penalized profile least squares with the SCAD penalty. We studied the asymptotic properties of the proposed estimators, and established their asymptotic normality. We conducted extensive Monte Carlo simulation studies to examine the finite sample performance of the proposed procedure and compare the proposed procedure with the proposal of Xiao et al. (2003).

It is of interest to theoretic compare the asymptotic mean squares errors of the proposed procedures and the local linear estimator without taking into account the error correlation. It is also of interest to investigate the effect of misspecification of error model. This needs more research in future.

Acknowledgments

This article is dedicated to the 30th Anniversary of Institute of Applied Mathematics, Chinese Academy of Sciences, Beijing. The study and work experience at the Institute has been invaluable treasure in the first author’s life. Runze Li would like to thank supports from the faculty members, staffs and friends of the Institute during his stay in the Institute. The authors are very grateful to Professor Jia-An Yan for his invitation. Runze Li’s research was supported by National Institute on Drug Abuse grant R21 DA024260, and Yan Li is supported by National Science Foundation grant DMS 0348869 as a graduate research assistant.

Contributor Information

Runze Li, Department of Statistics and The Methodology Center, Pennsylvania State University, University Park, PA 16802-2111, USA, ude.usp.tats@ilr.

Yan Li, Cards Acquisitions, Capital One Financial Inc, 1680 Capital One Dr, 19050-0701, McLean, VA 22102 USA, moc.enolatipac@il.eiggam.

References

1. Altman NS. Kernel Smoothing of Data with Correlated Errors. Journal of the American Statistical Association. 1990;85:749–759.
2. Brockwell PJ, Davis RA. Time Series: Theorey and Methods. Springer; New York: 1991.
3. Engle RF, Granger CWJ, Rice J, Weiss A. Semiparametric estimates of the relation between weather and electricity sales. J Amer Stat Assoc. 1986;81:310–320.
4. Fan J, Gijbels I. Chapman and Hall. London: 1996. Local Polynomial Modeling and Its Applications.
5. Fan J, Huang T. Profile Likelihood Inferences on Semiparametric Varying-coefficient Partially Linear Models. Bernoulli. 2005;11:1031–1059.
6. Fan J, Huang T, Li R. Analysis of Longitudinal Data with Semiparametric Estimation of Covariance Function. Journal of American Statistical Association. 2007;102:632–641. [PMC free article] [PubMed]
7. Fan J, Li R. Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties. Journal of American Statistical Association. 2001;96:1348–1360.
8. Fan J, Li R. New Estimation and Model Selection Procedures for Semiparametric Modeling in Longitudinal Data Analysis. Journal of American Statistical Association. 2004;99:710–723.
9. Fan J, Yao Q. Nonlinear Time Series: Nonparametric and Parametric Methods. Springer-Verlag; New York: 2003.
10. Gu C. Smoothing Spline ANOVA Models. Springer-Verlag; New York: 2002.
11. Hart JD. Kernel Regression Estimation with Time Series Errors. Journal of the Royal Statistical Society, Series B. 1991;53:173–187.
12. Hansen LP. Large sample properties of generalized method of moments estimators. Econometrics. 1982;59:1029–1054.
13. Heckman N. Spline smoothing in partly linear models. J Royal Stat Soc Ser B. 1986;48:244–248.
14. Liang K, Zeger S. Longitudinal Data-Analysis Using Generalized Linear-Models. Biometrika. 1986;73:13–22.
15. Lin X, Carroll R. Nonparametric Function Estimation for Clustered Data When the Predictor is Measured without/with Error. Journal of The American Statistical Association. 2000;95:520–534.
16. Opsomer J. Estimating a Function by Local Linear Regression when the Errors are Correlated. Department of Statistics, Iowa State University; 1995. pp. 95–42. preprint.
17. Opsomer J, Wang Y, Yang Y. Nonparametric Regression with Correlated Errors. Statistical Science. 2001;16:134–153.
18. Qu A, Lindsay B, Li B. Improving Generalized Estimating Equations Using Quadratic Inference Functions. Biometrika. 2000;87:823–836.
19. Ruppert D, Sheather SJ, Wand MP. An Effective Bandwidth Selector for Local Least Squares Regression. Journal of American Statistical Association. 1995;90:1257–1270.
20. Severini TA, Staniswalis JG. Quasi-likelihood estimation in semiparametric models. J Amer Stat Assoc. 1994;89:501–511.
21. Silverman BW. Density Estimation for Statistics and Data Analysis. Chapman and Hall; London: 1986.
22. Speckman P. Kernel smoothing in partial linear models. Journal Royal Statist Soc B. 1988;50:413–436.
23. Wahba G. Partial spline models for semiparametric estimation of functions of several variables. Statist Analysis of Time Ser; Proceedings of the Japan U.S Joint Seminar; Tokyo. 1984. pp. 319–329.
24. Wang H, Li R, Tsai C-L. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika. 2007;94:553–568. [PMC free article] [PubMed]
25. Wang N. Marginal nonparametric kernel regression accounting within-subject correlation. Biometrika. 2003;90:29–42.
26. Xiao Z, Linton O, Carroll RJ, Mammen E. More Efficient Local Polynomial Estimation in Nonparametric Regression with Autocorrelated Errors. Journal of the AMerican Statistical Association. 2003;98:980–992.
27. Zeger S, Liang K. Longitudinal Data-Analysis for Discrete and Continuous Outcomes. Biometrics. 1986;42:121–130. [PubMed]