Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2864536

Formats

Article sections

- Abstract
- 1 Introduction
- 2 Estimators
- 3 Unknown error densities
- 4 Theoretical properties
- 5 Numerical results
- 6 Conclusion
- Supplementary Material
- References

Authors

Related links

J Am Stat Assoc. Author manuscript; available in PMC 2010 May 5.

Published in final edited form as:

J Am Stat Assoc. 2009 September 1; 104(487): 993–1014.

doi: 10.1198/jasa.2009.tm07543PMCID: PMC2864536

NIHMSID: NIHMS184134

Raymond J. Carroll

Department of Statistics, 3143 TAMU, Texas A&M University, College Station, Texas 77843, USA

Aurore Delaigle

Department of Mathematics, University of Bristol, Bristol BS8 1TW, UK

Department of Mathematics and Statistics, University of Melbourne, VIC, 3010, Australia

Department of Mathematics and Statistics, University of Melbourne, VIC, 3010, Australia

Department of Statistics, University of California at Davis, Davis, CA 95616, USA

See other articles in PMC that cite the published article.

Predicting the value of a variable *Y* corresponding to a future value of an explanatory variable *X*, based on a sample of previously observed independent data pairs (*X*_{1}, *Y*_{1}), …, (*X _{n}, Y_{n}*) distributed like (

We consider prediction in a problem of nonparametric errors-in-variables regression. In the classical errors-in-variables context, the data consist of a sample of independent and identically distributed observations (*W _{i}, Y_{i}*),

In empirical applications, however, the above model can be too restrictive, because individuals are not necessarily observed in similar conditions. For example, the data may have been collected from different laboratories (see National Research Council, 1993), and future observations may come from yet another laboratory. In such cases, the data are a sample of independent observations (*W _{i}*,

$${Y}_{i}=g\left({X}_{i}\right)+{\u220a}_{i},\phantom{\rule{1em}{0ex}}{W}_{i}={X}_{i}+{U}_{i},$$

(1.1)

where each *W _{i}* represents a contaminated version of a variable

Heteroscedasticity in the errors arises in many different ways and has been treated by several authors. See, for example, Devanarayana and Stefanski (2002), Kulathinal et al. (2002), Thamerus (2003), Cheng and Riu (2006), Delaigle and Meister (2007), Staudenmayer et al. (2008) and Delaigle and Meister (2008). In some contexts, it is reasonable to assume that there is only a small number of different error densities. In other cases of interest, the error densities could reasonably all come from the same parametric family and differ only through the parameters of their distributions. Indeed, it is commonly assumed that all errors are centered normal random variables. See, for example, Cook and Stefanski (1994), Carroll et al. (1999), Berry et al. (2002), Devanarayana and Stefanski (2002), Kulathinal et al. (2002), Staudenmayer and Ruppert (2004) and Staudenmayer et al. (2008).

The work in this paper was originally motivated by applications where the errors in the sample *W*_{1}, …, *W _{n}* are of only two types, and the error on future observations is of one of these two types. To fix ideas, suppose the data have been rearranged such that, for

There is an extensive literature on estimation of a regression curve from contaminated data sets. A contemporary introduction to this problem is provided by Carroll et al. (2006), and recent contributions include those of Kim and Gleser (2000), Stefanski (2000), Taupin (2001), Linton and Whang (2002), Schennach (2004a,b) and Huang et al. (2006). Nonparametric estimation of a regression curve without contamination is a much older problem, treated in monographs such as those by Wand and Jones (1995) and Simonoff (1996).

An outline of this paper is as follows. In Section 2, when the measurement error densities are known, we describe estimators of the target μ(*t*). In Section 3, we show how to extend these methods to the case that the measurement error densities are unknown. Section 4 gives the rates of convergence of our estimators, and in particular discusses cases where our estimators achieve parametric rates of convergence. Section 5 gives numerical results, both in simulations and in a nutritional epidemiology context.

Suppose we have a sample of data (*W _{i}, Y_{i}*),

$$\mu \left(t\right)=E(Y\mid T=t)=\int y{f}_{T,Y}(t,y)dy\u2215{f}_{T}\left(t\right).$$

(2.1)

The task seems challenging, as we need to estimate *T*-related quantities from a sample of *W*-related quantities. The relationship between *T* and each *W _{j}*, however, when expressed in terms of their characteristic functions, is relatively simple. Let ${f}_{V}^{\mathrm{Ft}}$ denote the characteristic function of the distribution of a random variable

$${f}_{T}\left(x\right)=\frac{1}{2\pi}\int {e}^{-itx}{f}_{T}^{\mathrm{Ft}}\left(t\right)dt,$$

(2.2)

where we can write

$${f}_{T}^{\mathrm{Ft}}\left(t\right)={f}_{{U}^{F}}^{\mathrm{Ft}}\left(t\right){n}^{-1}\sum _{j}\{{f}_{{W}_{j}}^{\mathrm{Ft}}\left(t\right)\u2215{f}_{{U}_{j}}^{\mathrm{Ft}}\left(t\right)\}.$$

(2.3)

Based on these considerations, we show below how to estimate ${f}_{T}^{\mathrm{Ft}}$ and *f _{T}*. Then, we construct an estimator of the numerator of (2.1), and finally obtain an estimator of μ.

The simplest device to obtain a consistent estimator of *f _{T}* is to replace ${f}_{{W}_{j}}^{\mathrm{Ft}}\left(t\right)$ in (2.3) by

$${f}_{T}^{\mathrm{Ft}}\left(t\right)={f}_{{U}^{F}}^{\mathrm{Ft}}\left(t\right)\sum _{j}{f}_{{W}_{j}}^{\mathrm{Ft}}\left(t\right){f}_{{U}_{j}}^{\mathrm{Ft}}(-t)\u2215\sum _{k}{\mid {f}_{{U}_{k}}^{\mathrm{Ft}}\left(t\right)\mid}^{2}=\sum _{j}{f}_{{W}_{j}}^{\mathrm{Ft}}\left(t\right){\Psi}_{j}\left(t\right){f}_{{U}^{F}}^{\mathrm{Ft}}\left(t\right).$$

Now, replacing ${f}_{{W}_{j}}^{\mathrm{Ft}}\left(t\right)$ by its unbiased estimate *e*^{itWj}, we can estimate ${f}_{T}^{\mathrm{Ft}}\left(t\right)$ from the data (*W*_{1}, *Y*_{1}), …, (*W _{n}, Y_{n}*) by

$${\widehat{f}}_{T}^{\mathrm{Ft}}\left(t\right)=\sum _{j=1}^{n}{e}^{it{W}_{j}}{\Psi}_{j}\left(t\right){f}_{{U}^{F}}^{\mathrm{Ft}}\left(t\right).$$

(2.4)

We shall show in Section 4 that this procedure leads to optimal estimators of μ.

If ${f}_{{U}^{F}}^{\mathrm{Ft}}\left(t\right){\Sigma}_{j}{\Psi}_{j}\left(t\right)\in {L}_{1}$ we can obtain an estimator of the denominator of (2.1) by plugging (2.4) into (2.2), which gives $\widehat{{f}_{T}}\left(x\right)={\Sigma}_{j}{f}_{T,j}(x-{W}_{j})$, where we employed the notation

$${f}_{T,j}\left(x\right)=\frac{1}{2\pi}\int {e}^{-itx}{f}_{{U}^{F}}^{\mathrm{Ft}}\left(t\right){\Psi}_{j}\left(t\right)dt.$$

(2.5)

Using a similar approach, we estimate the numerator of (2.1) by Σ_{j }*Y _{j}*

$$\widehat{\mu}\left(x\right)=\frac{\sum _{j}{Y}_{j}{f}_{T,j}(x-{W}_{j})}{\sum _{j}{f}_{T,j}(x-{W}_{j})}.$$

(2.6)

When ${f}_{{U}^{F}}^{\mathrm{Ft}}\left(t\right){\Sigma}_{j}{\Psi}_{j}\left(t\right)\notin {L}_{1}$ we need to regularize ${\widehat{f}}_{T}^{\mathrm{Ft}}$ before plugging it into (2.2). This challenge is also encountered in relatively classical deconvolution problems. We use a kernel approach to regularize this problem. Methodology based on another nonparametric technique, such as splines, orthogonal series or the ridge technique of Hall and Meister (2007), could also be developed. While those methods might be competitive with the kernel approach, the latter benefits from being relatively accessible to asymptotic analysis. Let *K* be a kernel function with Fourier transform *K*^{Ft}, and let h > 0 be a bandwidth. Assuming that ${K}^{\mathrm{Ft}}\left(t\right){f}_{{U}^{F}}^{\mathrm{Ft}}(t\u2215h){\Psi}_{j}(t\u2215h)\in {L}_{1}$, our regularized estimator of *f _{T}* is ${\tilde{f}}_{T}\left(x\right)={h}^{-1}{\Sigma}_{j}{K}_{T,j}\{(x-{W}_{j})\u2215h\}$, where

$${K}_{T,j}\left(x\right)=\frac{1}{2\pi}\int {e}^{-itx}{K}^{\mathrm{Ft}}\left(t\right){f}_{{U}^{F}}^{\mathrm{Ft}}(t\u2215h){\Psi}_{j}(t\u2215h)dt.$$

Proceeding as above, we define our estimator of μ by

$$\tilde{\mu}\left(x\right)=\frac{\sum _{j}{Y}_{j}{K}_{T,j}\left(\frac{x-{W}_{j}}{h}\right)}{\sum _{j}{K}_{T,j}\left(\frac{x-{W}_{j}}{h}\right)}\cdot $$

(2.7)

In the two-error model which motivated our work, the estimators of μ are particularly simple. To keep the same notation as in the introduction, assume that the first *m* observations are contaminated by an error with density *f*_{U(1)}, that the last *n* − *m* are contaminated by an error with density *f*_{U(2)}, and that the error in the future observation is *U _{F}* ~

If 1/*q*^{Ft} *L*_{1}, we use the estimator at (2.6), which becomes

$$\widehat{\mu}\left(x\right)=\frac{\sum _{j=1}^{m}{Y}_{j}{f}_{q,1}(x-{W}_{j})+\sum _{j=m+1}^{n}{Y}_{j}{f}_{q,2}(x-{W}_{j})}{\sum _{j=1}^{m}{f}_{q,1}(x-{W}_{j})+\sum _{j=m+1}^{n}{f}_{q,2}(x-{W}_{j})},$$

(2.8)

where we used the notations ${f}_{q,1}\left(x\right)={\left(2\pi \right)}^{-1}\int {e}^{-itx}{\{m+(n-m){\mid {q}^{\mathrm{Ft}}\left(t\right)\mid}^{2}\}}^{-1}dt$ and ${f}_{q,2}\left(x\right)={\left(2\pi \right)}^{-1}\int {e}^{-itx}{q}^{\mathrm{Ft}}(-t){\{m+(n-m){\mid {q}^{\mathrm{Ft}}\left(t\right)\mid}^{2}\}}^{-1}dt$.

If 1/*q*^{Ft} *L*_{1}, we use the estimator at (2.7), which becomes

$$\tilde{\mu}\left(x\right)=\frac{\sum _{j=1}^{m}{Y}_{j}{K}_{q,1}\left(\frac{x-{W}_{j}}{h}\right)+\sum _{j=m+1}^{n}{Y}_{j}{K}_{q,2}\left(\frac{x-{W}_{j}}{h}\right)}{\sum _{j=1}^{m}{K}_{q,1}\left(\frac{x-{W}_{j}}{h}\right)+\sum _{j=m+1}^{n}{K}_{q,2}\left(\frac{x-{W}_{j}}{h}\right)},$$

(2.9)

where ${K}_{q,1}\left(x\right)={\left(2\pi \right)}^{-1}\int {e}^{-itx}{K}^{\mathrm{Ft}}\left(t\right){\{m+(n-m){\mid {q}^{\mathrm{Ft}}(t\u2215h)\mid}^{2}\}}^{-1}dt$ and ${K}_{q,2}\left(x\right)={\left(2\pi \right)}^{-1}\int {e}^{-itx}{K}^{\mathrm{Ft}}\left(t\right){q}^{\mathrm{Ft}}(-t\u2215h){\{m+(n-m){\mid {q}^{\mathrm{Ft}}(t\u2215h)\mid}^{2}\}}^{-1}dt$. Note that in the particular case where we have only observations contaminated by the error density *f*_{U(1)}, *m* = *n* and the estimator at (2.9) is nothing more than the usual Nadaraya-Watson estimator without any contamination, see Wand and Jones (1995, p. 119).

In the terminology of nonparametric deconvolution, the smoothness of an error (or error density) is usually described in terms of the speed of convergence to zero of its characteristic function in the tails — the faster, the smoother. See Section 4.2 for discussion. In this context, roughly speaking, 1/*q*^{Ft} *L*_{1} implies that *f*_{U(1)} is smoother than *f*_{U(2)}. For example, if *f*_{U(1)} and *f*_{U(2)} are normal densities with mean zero and variances ${\sigma}_{1}^{2}$ and ${\sigma}_{2}^{2}$, respectively, then, 1/*q*^{Ft} *L*_{1} if ${\sigma}_{1}^{2}>{\sigma}_{2}^{2}$, and 1/*q*^{Ft} *L*_{1} if ${\sigma}_{1}^{2}\le {\sigma}_{2}^{2}$. The condition ${f}_{{U}^{F}}^{\mathrm{Ft}}\left(t\right){\Sigma}_{j}{\Psi}_{j}\left(t\right)\in {L}_{1}$ can be understood in a related manner.

The estimators presented above are an extension of the Nadaraya-Watson estimator, which is nothing more than a local constant estimator appropriate for error-free data. Recently, Delaigle, Fan and Carroll (2008) solved the long-open problem of developing local polynomial estimators for errors-in-variables problems. Using their technique we can give a local polynomial version of our estimator $\tilde{\mu}$. More precisely, we define a *p*th order local polynomial estimator of μ by ${\tilde{\mu}}_{p}\left(x\right)=(1,0,\dots ,0){\widehat{\mathbf{S}}}_{\mathbf{n}}^{-1}{\widehat{\mathbf{T}}}_{\mathbf{n}}$, where ${\widehat{\mathbf{S}}}_{\mathbf{n}}={\left({\widehat{S}}_{n,j+l}\left(x\right)\right)}_{0\le j,l\le p}$ and ${\widehat{\mathbf{T}}}_{\mathbf{n}}={({\widehat{T}}_{n,0}\left(x\right),\dots ,{\widehat{T}}_{n,p}\left(x\right))}^{T}$ with

$${\widehat{S}}_{n,k}\left(x\right)=\sum _{j=1}^{n}{K}_{U,k,j;h}({W}_{j}-x)\phantom{\rule{1em}{0ex}}\text{and}\phantom{\rule{1em}{0ex}}{\widehat{T}}_{n,k}\left(x\right)=\sum _{j=1}^{n}{Y}_{j}{K}_{U,k,j;h}({W}_{j}-x),$$

where ${K}_{U,k,j;h}\left(x\right)={h}^{-1}{K}_{U,k,j}(x\u2215h)$ and

$${K}_{U,k,j}\left(x\right)={i}^{-k}\frac{1}{2\pi}\int {e}^{-itx}{\left({K}^{\mathrm{Ft}}\right)}^{\left(k\right)}\left(t\right){f}_{{U}^{F}}^{\mathrm{Ft}}(t\u2215h){\Psi}_{j}(t\u2215h)dt.$$

Compared to local constant estimators (*p* = 0), local polynomial estimators for *p* > 0 have the advantage of being less biased in the presence of boundary points. On the other hand, in practice, increasing the value of *p* usually leads to an increase of variability, and using values of *p* larger than 1 is rarely useful unless the interest is in estimating derivatives of the curve μ. This, however, is usually not the case in the prediction problem.

It is straightforward to extend the local-constant methodology of Delaigle, Hall and Meister (2008) for the case of unknown error distributions to the context of general local polynomial estimators. Properties are similar, too. For example, convergence rates in the case of local linear methods are identical, under regularity conditions discussed towards the end of Section 3.1, to those for the local constant estimators treated in this paper.

There are many examples where it is too restrictive to assume that the error densities are completely known, and in such cases, these densities have to be estimated from the data. For a long time this problem was essentially ignored in the nonparametric literature, where the error distributions were systematically assumed to be known. Recently, however, several authors have shown that this problem can be tackled if every observation is replicated at least once. References include Schennach (2004a,b), Delaigle, Hall and Meister (2008) and Hu and Schennach (2008). In the next section we discuss parametric and nonparametric methods for error density estimation in the two-error model treated in Section 2.2. A procedure for the general model is given in Section 3.2.

In the two-population case, if we do not have enough data to estimate the error densities, and if *m* is not particularly small, then a consistent estimator of μ can be obtained by taking *n* = *m*, i.e. discarding all observations contaminated by *f*_{U(2)} and using the standard Nadaraya-Watson estimator. See also Remark 3.1, below. The most interesting setting is undoubtedly that where it is possible to estimate the error densities, as it is in this case that the estimator of μ will enjoy the fastest rates of convergence; see Section 4.

As discussed in the introduction, a large literature on measurement errors assumes that the errors are normal. More generally, the errors could belong to some parametric family, not necessarily normal. There, if we have a parametric model *f*_{U(2)} (·|θ_{1}) (respectively, *f*_{U(2)} (·|θ_{2})) that is identifiable from data on *U* — U′, where *U* and U′ denote two independent variables from of *f*_{U(1)} (respectively, *f*_{U(2)}), then θ_{1} and θ_{2} can be estimated from a sample of replicated data, i.e. a sample of the form (*W _{ij}, Y_{i}*),

$${Y}_{i}=g\left({X}_{i}\right)+{\u220a}_{i},\phantom{\rule{1em}{0ex}}{W}_{ij}={X}_{i}+{U}_{ij},$$

(3.1)

where, for *i* = 1, …, *m*, *U _{ij}* ~

In some cases it can happen that we have no suitable parametric model for the error densities. If the characteristic functions of the error densities are positive and symmetric, as is the case for many common densities, then they can be estimated nonparametrically along the lines of Delaigle, Hall and Meister (2008). More precisely, ${\widehat{f}}_{{U}^{\left(1\right)}}^{\mathrm{Ft}}$ and ${\widehat{f}}_{{U}^{\left(2\right)}}^{\mathrm{Ft}}$, in (2.8) and (2.9), are estimated by, respectively, ${\widehat{f}}_{{U}^{\left(1\right)}}^{\mathrm{Ft}}\left(t\right)={\mid {m}^{-1}{\sum}_{j=1}^{m}\mathrm{cos}\left\{t({W}_{j1}-{W}_{j2})\right\}\mid}^{1\u22152}$ and ${\widehat{f}}_{{U}^{\left(2\right)}}^{\mathrm{Ft}}\left(t\right)={\mid {(n-m)}^{-1}{\sum}_{j=m+1}^{n}\mathrm{cos}\left\{t({W}_{j1}-{W}_{j2})\right\}\mid}^{1\u22152}$. We then replace ${f}_{{U}^{\left(1\right)}}^{\mathrm{Ft}}$ and ${f}_{{U}^{\left(2\right)}}^{\mathrm{Ft}}$ by ${\widehat{f}}_{{U}^{\left(1\right)}}^{\mathrm{Ft}}$ and ${\widehat{f}}_{{U}^{\left(1\right)}}^{\mathrm{Ft}}$ in the numerators of *f*_{q,1}, *f*_{q,1}, *K*_{q,1}, and *K*_{q,2}, but in the denominators, to avoid division by zero, we replace $m{\mid {f}_{{U}^{\left(1\right)}}^{\mathrm{Ft}}\mid}^{2}+(n-m){\mid {f}_{{U}^{\left(2\right)}}^{\mathrm{Ft}}\mid}^{2}$ by $m{\mid {\widehat{f}}_{{U}^{\left(1\right)}}^{\mathrm{Ft}}\mid}^{2}+(n-m){\mid {\widehat{f}}_{{U}^{\left(2\right)}}^{\mathrm{Ft}}\mid}^{2}+r$, with *r* > 0; see Delaigle, Hall and Meister (2008). More general settings are considered by Li and Vuong (1998), Schennach (2004a,b) and Hu and Schennach (2008).

Convergence rates in the unknown error case, and in the setting of classical errors-in-variables problems, have been given by Delaigle, Hall and Meister (2008). The results there state that, if the characteristic function of the unknown error distribution is estimated using a difference-based method, then the convergence rate is the same as in the setting of a known error distribution, provided the density of *X* is sufficiently smooth relative to the error density. This is also true in the prediction problem treated in the present paper.

When *m* is large relative to *n* − *m*, and the error densities cannot be well estimated, a classical Nadaraya-Watson estimator of μ, based on (*W*_{1}, …, (*W _{m}, Y_{m}*), is likely to perform better than our estimator. For example, this could happen if the errors densities had to be estimated nonparametrically from a small number of individuals for which there were replicated observations. In such cases, a conservative approach would be to use the Nadaraya-Watson estimator.

Before we show how to construct a consistent estimator in the general context of the model (1.1), it is important to realize that, whatever approach we take, in order for the function μ to be identifiable we need to be able to consistently estimate the error density *f*_{UF} of the future observations. See Section 3.1 for a discussion on how to estimate an error density. We assume that sufficient effort has been made by the experimenters to collect data permitting the construction of a consistent estimator ${\widehat{f}}_{{U}^{F}}$ of *f*_{UF}. If *f*_{U(2)} cannot be estimated then the prediction problem is not identifiable.

As in Section 3.1, for simplicity we address the case where there are just two replicated measurements of *X _{i}* for each

Let ${\stackrel{\u2012}{W}}_{i}=({W}_{i1}+{W}_{i2})\u22152$, and note that ${f}_{X}^{\mathrm{Ft}}\left(t\right)={f}_{{\stackrel{\u2012}{W}}_{j}}^{\mathrm{Ft}}\left(t\right)\u2215{\left\{{f}_{{U}_{j}}^{\mathrm{Ft}}(t\u22152)\right\}}^{2}=\Phi \left(t\right){\Sigma}_{j}{f}_{{\stackrel{\u2012}{W}}_{j}}^{\mathrm{Ft}}\left(t\right)$, where we used the notation $\Phi \left(t\right)=1\u2215{\Sigma}_{k}{\left\{{f}_{{U}_{k}}^{\mathrm{Ft}}(t\u22152)\right\}}^{2}$. Replacing the unknown Φ(*t*)^{−1} by the estimator $\widehat{\Phi}{\left(t\right)}^{-1}={\Sigma}_{k}\mathrm{exp}(\mathit{it}({W}_{k,1}-{W}_{k,2})\u22152)$, and proceeding as in Section 2, we obtain the following versions of (2.6) and (2.7):

$$\widehat{\mu}\ast \left(t\right)=\frac{\sum _{j=1}^{n}{Y}_{j}{f}_{T}^{\ast}(x-{\stackrel{\u2012}{W}}_{j})}{\sum _{j=1}^{n}{f}_{T}^{\ast}(x-{\stackrel{\u2012}{W}}_{j})},\phantom{\rule{1em}{0ex}}\tilde{\mu}\ast \left(t\right)=\frac{\sum _{j=1}^{n}{Y}_{j}{K}_{T}^{\ast}\left(\frac{x-{\stackrel{\u2012}{W}}_{j}}{h}\right)}{\sum _{j=1}^{n}{K}_{T}^{\ast}\left(\frac{x-{\stackrel{\u2012}{W}}_{j}}{h}\right)},$$

with ${f}_{T}^{\ast}\left(x\right)={\left(2\pi \right)}^{-1}\int {e}^{-itx}{\widehat{f}}_{{U}_{F}}^{\mathrm{Ft}}\left(t\right)\u2215\{{\widehat{\Phi}}^{-1}\left(t\right)+r\}$, and with

$${K}_{T}^{\ast}\left(x\right)={\left(2\pi \right)}^{-1}\int {e}^{-itx}{K}^{\mathrm{Ft}}\left(t\right){\widehat{f}}_{{U}_{F}}^{\mathrm{Ft}}(t\u2215h)\u2215\{{\widehat{\Phi}}^{-1}(t\u2215h)+r\}dt,$$

where, as before, *r* > 0 is introduced to avoid division by zero.

The properties of the estimator $\widehat{\mu}$ at (2.6) are clear. In particular, it is easy to check that the numerator and the denominator are both unbiased estimators of μ(*t*)*f _{T}*(

To study asymptotic properties of our estimator, we assume that:

$${\u220a}_{i},\dots ,{\u220a}_{n}\phantom{\rule{thickmathspace}{0ex}}\text{have zero means and uniformly bounded variances};$$

(4.1)

$$\begin{array}{cc}\hfill & K\phantom{\rule{thickmathspace}{0ex}}\text{is symmetric},\mid K\left(x\right)\mid \le {C}_{3}{(1+\mid x\mid )}^{-k-1-{C}_{4}}\phantom{\rule{thickmathspace}{0ex}}\text{for an integer}\phantom{\rule{thickmathspace}{0ex}}k\ge 2\phantom{\rule{thickmathspace}{0ex}}\text{and for}\hfill \\ \hfill & \text{constants}\phantom{\rule{thickmathspace}{0ex}}{C}_{3},{C}_{4}>0,\int {u}^{j}K\left(u\right)du=0\phantom{\rule{thickmathspace}{0ex}}\text{for}\phantom{\rule{thickmathspace}{0ex}}1\le j\le k-1,\phantom{\rule{thickmathspace}{0ex}}\mathrm{sup}\mid {K}^{\mathrm{Ft}}\mid <\infty ,\hfill \\ \hfill & {K}^{\mathrm{Ft}}\left(0\right)>0,\phantom{\rule{thickmathspace}{0ex}}\text{and, for some}\phantom{\rule{thickmathspace}{0ex}}{C}_{5}>0,{K}^{\mathrm{Ft}}\left(t\right)=0\phantom{\rule{thickmathspace}{0ex}}\text{for all}\phantom{\rule{thickmathspace}{0ex}}\mid t\mid >{C}_{5},\hfill \end{array}$$

(4.2)

$${f}_{X},{f}_{{U}_{1}},\dots ,{f}_{{U}_{n}},{f}_{{U}^{F}}\phantom{\rule{thickmathspace}{0ex}}\text{and}\phantom{\rule{thickmathspace}{0ex}}g\phantom{\rule{thickmathspace}{0ex}}\text{are bounded, and}\phantom{\rule{thickmathspace}{0ex}}{f}_{X}\phantom{\rule{thickmathspace}{0ex}}\text{and}\phantom{\rule{thickmathspace}{0ex}}g\phantom{\rule{thickmathspace}{0ex}}\text{have}\phantom{\rule{thickmathspace}{0ex}}k\phantom{\rule{thickmathspace}{0ex}}\text{bounded derivatives};$$

(4.3)

$${\mathrm{sup}}_{k}{\mid {f}_{{U}_{k}}^{\mathrm{Ft}}\mid}^{2}\phantom{\rule{thickmathspace}{0ex}}\text{and}\phantom{\rule{thickmathspace}{0ex}}\mid {f}_{{U}^{F}}^{\mathrm{Ft}}\mid \phantom{\rule{thickmathspace}{0ex}}\text{are bounded and, for all t},\sum _{k}\mid {f}_{{U}_{k}}^{\mathrm{Ft}}\left(t\right)\mid >0.$$

(4.4)

$$h\to 0\phantom{\rule{thickmathspace}{0ex}}\text{as}\phantom{\rule{thickmathspace}{0ex}}n\to \infty \phantom{\rule{thickmathspace}{0ex}}\text{and}\phantom{\rule{thickmathspace}{0ex}}n\u2215v\left(h\right)\to \infty \phantom{\rule{thickmathspace}{0ex}}\text{as}\phantom{\rule{thickmathspace}{0ex}}n\to \infty ,$$

(4.5)

where we defined

$$v\left(h\right)=n{h}^{-1}\int {\mid {K}^{\mathrm{Ft}}\left(t\right)\mid}^{2}{\mid {f}_{{U}^{F}}^{\mathrm{Ft}}(t\u2215h)\mid}^{2}\u2215\sum _{k=1}^{n}{\mid {f}_{{U}_{k}}^{\mathrm{Ft}}(t\u2215h)\mid}^{2}dt.$$

(4.6)

Assumptions such as these are fairly standard in the nonparametric regression literature. Condition (4.1) is mild, and the smoothness of the various curves in (4.3) is imposed only to determine the order of the bias of the estimator, which depends of *k*. In particular, *k* is generally not a tuning parameter, and in empirical examples, where the smoothness of the curves is usually unknown, it is common to set *k* = 2. It is well known that, in practice, larger values of *k* increase the variability of estimators and usually make them unattractive, see, for example, Marron and Wand (1992). Of course, as in the standard error-free case, our results can be extended to cases where *f*_{X} and *g* have discontinuity points, with the obvious modifications to the bias. Condition (4.2) only concerns the kernel (which we can choose) and is satisfied by the kernels used in deconvolution problems. Condition (4.4) is a weaker version of standard conditions usually imposed in deconvolution problems (see e.g Fan, 1991), since, in our case, the characteristic functions of the errors are permitted to vanish. Condition (4.5) is a generalization of the standard condition *nh* → ∞ imposed in the error-free case, but it looks more complicated here because the variance of the estimator is of order *v*(*h*)/*n* rather than 1/(*nh*).

The asymptotic behavior of the estimator is described in the next theorem.

*Assume* (4.1)–(4.5). *Then, for each t such that f*

$$\tilde{\mu}\left(t\right)=\mu \left(t\right)+{O}_{p}\left({\{v\left(h\right)\u2215n\}}^{1\u22152}+{h}^{k}\right).$$

(4.7)

Precise rates of convergence of the estimator depend on the behavior of the ratio $Q\left(t\right)\equiv n{\mid {f}_{{U}_{F}}^{\mathrm{Ft}}\left(t\right)\mid}^{2}\u2215{\Sigma}_{k}{\mid {f}_{{U}_{k}}^{\mathrm{Ft}}\left(t\right)\mid}^{2}$ in the tails. It is not possible here to consider every possible combination of error types. To get some insight on these results, consider the situation where, for all *t*, ${\Sigma}_{k}{\mid {f}_{{U}_{k}}^{\mathrm{Ft}}\left(t\right)\mid}^{2}>n\xi \left(t\right)$ where ξ is a continuous, strictly positive function. Then, if *Q*(*t*) = *o*(|*t*|^{−}) as *t* → ∞, by taking *h* = *O*(*n*^{−1/k}), the estimator converges at the fast parametric *n*^{−1/2} rate. When *Q*(*t*) = *O*(1) and |*t*|*Q*(*t*) → ∞ as *t* → ∞, the estimator converges at a rate that lies between *n*^{−1/2} and the classical nonparametric rate *n*^{−k/(2k+1)}, and in other cases it converges more slowly than *n*^{−k/(2k+1)}. An intuitive explanation of the occurrence of fast parametric rates will be given in Remark 4.2.

In the two-error problem described in Section 2.2, it is possible to provide a more detailed study of the convergence rates of the estimator $\tilde{\mu}$, which, in this case, reduces to (2.9). Precise rates of convergence depend on the values of *m* and *n*, and on the behavior of *q*^{Ft} in the tails, which itself is dictated by the behaviors of the characteristic functions ${f}_{{U}^{\left(1\right)}}^{\mathrm{Ft}}$ and ${f}_{{U}^{\left(2\right)}}^{\mathrm{Ft}}$ of the errors. In the measurement error literature it is well known that convergence rates of nonparametric estimators depend heavily on the behavior of the characteristic function of the error in the tails. This tail behavior is usually referred to as the smoothness of the error, and it is standard to divide the error distributions in two quite different categories, called ordinary smooth and supersmooth in the terminology of Fan (1991). The errors *f*_{U(1)} and *f*_{U(2)} are ordinary smooth of orders β and α, respectively, if they satisfy, for positive constants ${C}_{1}<{C}_{1}^{\prime}$ and ${C}_{2}^{\prime}<{C}_{2}$ and for all *t*,

$$\begin{array}{cc}\hfill C& {}_{2}^{\prime}{(1+\mid t\mid )}^{-\beta}\le \mid {f}_{{U}^{\left(1\right)}}^{\mathrm{Ft}}\left(t\right)\mid \le {C}_{2}{(1+\mid t\mid )}^{-\beta},\hfill \\ \hfill & {C}_{1}{(1+\mid t\mid )}^{-\alpha}\le \mid {f}_{{U}^{\left(2\right)}}^{\mathrm{Ft}}\left(t\right)\mid \le {C}_{1}^{\prime}{(1+\mid t\mid )}^{-\alpha}.\hfill \end{array}$$

(4.8)

An error density *f*_{U} is supersmooth of order β > 0 if it satisfies, for positive constants γ, *D*_{1} < *D*_{2} and for all *t*,

$${D}_{1}\phantom{\rule{thickmathspace}{0ex}}\mathrm{exp}(-{\mid t\mid}^{-\beta}\u2215\gamma )\le \mid {f}_{U}^{\mathrm{Ft}}\left(t\right)\mid \le {D}_{2}\phantom{\rule{thickmathspace}{0ex}}\mathrm{exp}(-{\mid t\mid}^{-\beta}\u2215\gamma ).$$

(4.9)

For simplicity of presentation we give our main results under the assumption that, for constants α, β, *C*_{1}, *C*_{2} > 0 and all real *t*,

$${C}_{1}{(1+\mid t\mid )}^{-\alpha}\le \mid {f}_{{U}^{\left(2\right)}}^{\mathrm{Ft}}\left(t\right)\mid ,\phantom{\rule{1em}{0ex}}\mid {f}_{{U}^{\left(1\right)}}^{\mathrm{Ft}}\left(t\right)\mid \le {C}_{2}{(1+\mid t\mid )}^{-\beta}.$$

(4.10)

Obviously, ordinary smooth errors satisfy both inequalities, but supersmooth errors also satisfy the second inequality for any β > 0.

Define $\delta =\alpha -\beta +{\scriptstyle \frac{1}{2}}$ and, denoting the indicator function by 1_{{·}}, let

$${v}_{1}\left(h\right)={h}^{-2\delta}\cdot {1}_{\{\delta >0\}}+\mid \mathrm{log}h\mid \cdot {1}_{\{\delta =0\}}+{1}_{\{\delta <0\}}.$$

(4.11)

Assume that, as *n* → ∞,

$$h\to 0,n\u2215{v}_{1}\left(h\right)\to \infty \phantom{\rule{thickmathspace}{0ex}}\text{and}\phantom{\rule{thickmathspace}{0ex}}mh\to \infty .$$

(4.12)

The asymptotic behavior of $\tilde{\mu}$ is described in the next theorem.

*Assume* (4.1)–(4.4) and (4.10)–(4.12). *Then, for each t such that f*_{T} (*t*) > 0,

$$\widehat{\mu}\left(t\right)=\mu \left(t\right)+{O}_{P}({h}^{k}+\mathrm{min}\{{\left(mh\right)}^{-1\u22152},{(n-m)}^{-1\u22152}{v}_{1}{\left(h\right)}^{1\u22152}\}).$$

(4.13)

This result shows that our estimator $\tilde{\mu}$ converges at a rate at least as fast as the Nadaraya-Watson estimator of μ based on direct data from (*T*, *Y*), i.e. on the first *m* observations (*W*_{1}, *Y*_{1}),…, (*W*_{m}, *Y*_{m}). Indeed, the Nadaraya-Watson estimator is optimized when taking *h* ~ const. *m*^{−1/(2k+1)}, for which it converges at the rate *m*^{−k/(2k+1)}. Of course, the cases where our estimator converges faster that the Nadaraya-Watson estimator depend on the relative sizes of *m* and *n*, but also on the relative smoothness of the two errors, as we show below.

The most favorable situation is clearly the one where δ < 0. This includes the case where *f*_{U(1)} and *f*_{U(2)} are ordinary smooth of orders β and α, respectively, with $\beta >\alpha +{\scriptstyle \frac{1}{2}}$, but it also includes the case where, simultaneously, *f*_{U(1)} is supersmooth of any order, and *f*_{U(2)} is ordinary smooth of any order. When *m* = *o*{(*n* − *m*)^{(2k+1)/2k}}, we obtain the very fast parametric rate (*n* − *m*)^{−1/2}, by taking *h* = *O*{(*n* − *m*)^{−1/(2k)}}. In particular, when *m* and *n* − *m* are of the same order, or *m* = *o*(*n*), the estimator converges at the rate *n*^{−1/2}. When *m* ≠ *o*{(*n* − *m*)^{(2k+1)/2k}}, i.e. when there are many more data contaminated by *f*_{U(1)} than by *f*_{U(2)}, then, logically, the estimator converges at the same rate as the Nadaraya-Watson estimator based on the *m* first data points. More precisely, when *h* ~ const. *m*^{−1/(2k+1)} the estimator converges at the rate *m*^{−k/(2k+1)}.

The case where δ > 0 is more involved since, there, the term (*n* - *m*)^{−1/2 }*v*_{1}(*h*)^{1/2} in (4.13) is only an upper bound to the contribution of the data (*W*_{i}, *Y*_{i}) for *i* = *m*+1,…,*n*, and precise characterization of convergence rates can be obtained only at the expense of more precise characterization of *q*^{Ft}. We shall assume that the errors are ordinary smooth of order β and α, as defined at (4.8). Under this assumption, the convergence rate of the estimator is exactly of the order given in the theorem. The optimal bandwidth is thus of order *h* ~ const. *m*^{−1/(2k+1)} when (*n* − *m*)^{(2k+1)/(2k+2δ)} = *o*(*m*), and the estimator then converges at the rate *m*^{−k/(2k+1)}. When *m* = *o*{(*n*−*m*)^{(2k+1)/(2k+2δ)}}, the optimal bandwidth is of order *h* ~ const. (*n*−*m*)^{−1/(2k+2δ)} and the estimator converges at rate (*n*−*m*)^{−k/(2k+2δ)}.

Calculation of the rates for $\tilde{\mu}$ can be extended to cases more general than (4.10). For example, it can be shown that the convergence rates are of order min{*m*^{−k/(2k+1)}, (*n* − m)^{−1/2}} whenever *t*^{1/2}/*q*^{Ft}(*t*) < const. as *t* → ∞; they are of order min{*m*^{−k/(2k+1)}, *B*(*n*) where *B*(*n*) → 0 as *n* → ∞ at a speed similar to typical deconvolution rates (see e.g. Fan and Truong, 1993), when 1/*q*^{Ft}(*t*) → ∞ as *t* → ∞; and it is of an order between min{*m*^{−k/(2k+1)}, *B*(*n*)} and min{*m*^{−k/(2k+1)}, (*n*−*m*)^{−1/2}} in all other cases.

The very fast parametric rate noted above may appear counter-intuitive. It can be understood from the fact that, since we know that *f _{T}* =

When the covariate *X* is observed without error, for past and future observations, instead of applying standard nonparametric estimators of *E*(*Y*|*X*), which only converge at the rate *n*^{−k/(2k+1)}, it may seem to be a better idea to artificially add random noise *U* ~ *f*_{U} to the future observed value of *X*, with *f*_{U} such that ${f}_{U}^{\mathrm{Ft}}\in {L}_{1}$, and predict *Y* by an estimator of *E*(*Y*|*T*), where *T* = *X* + *U*. Indeed, this would correspond to our model, in the situation where $1\u2215{q}^{\mathrm{Ft}}\left(t\right)={f}_{U}^{\mathrm{Ft}}\in {L}_{1}$, in which case we can use $\widehat{\mu}$, which converges at a *n*^{−1/2} rate. However, it is not clear that, despite the convergence rate, this would lead to better prediction of *Y* than the error-free estimator of *E*(*Y*|*X*), since *Y* is more dispersed around *E*(*Y*|*T*) than *Y* is around *E*(*Y*|*X*) because *T* exhibits larger errors than *X*.

Here we indicate that in the ordinary smooth case, the convergence rates given by Theorem 4.2 are optimal when δ > 0. A simpler argument can be used to demonstrate optimality when δ < 0, and similar methods can be employed to verify optimality in supersmooth settings. We prove results only in the two-error case, but similar techniques can be used to show that, under regularity conditions, our estimator is also optimal in the general setting of model (1.1).

Let *f*_{U(2)} and *f*_{U(1)} denote symmetric densities for which

$$\underset{\mid t\mid \to \infty}{\mathrm{lim}\phantom{\rule{thinmathspace}{0ex}}\mathrm{sup}}\phantom{\rule{thickmathspace}{0ex}}\underset{j=0,1,2}{\mathrm{max}}{(1+\mid t\mid )}^{\alpha +j}\mid \frac{{d}^{j}}{d{t}^{j}}{f}_{{U}^{\left(2\right)}}^{\mathrm{Ft}}\left(t\right)\mid <\infty ,$$

(4.14)

$$\underset{\mid t\mid \to \infty}{\mathrm{lim}\phantom{\rule{thinmathspace}{0ex}}\mathrm{inf}}\phantom{\rule{thickmathspace}{0ex}}\underset{j=0,1,2}{\mathrm{min}}{(1+\mid t\mid )}^{\beta +j}\mid \frac{{d}^{j}}{d{t}^{j}}{f}_{{U}^{\left(1\right)}}^{\mathrm{Ft}}\left(t\right)\mid >0.$$

(4.15)

Given −∞ < *a* < *b* < ∞, *C* > 0 and an integer *k* ≥ 1, write $\mathcal{F}(a,b,C,k)$ for the class of densities *f _{XY}* of (

Assume that (4.14) and (4.15) hold, and δ > 0. Then, for each real number *w*, there exists a constant *c* > 0 such that

$$\underset{n\to \infty}{\mathrm{lim}\phantom{\rule{thinmathspace}{0ex}}\mathrm{inf}}\phantom{\rule{thickmathspace}{0ex}}\underset{\stackrel{\u02d8}{\mu}\in \mathcal{C}}{\mathrm{inf}}\phantom{\rule{thickmathspace}{0ex}}\underset{{f}_{XY}\in \mathcal{F}(a,b,C,k)}{\mathrm{sup}}P\left[\mid \stackrel{\u02d8}{\mu}\left(w\right)-\mu \left(w\right)\mid >c\phantom{\rule{thickmathspace}{0ex}}\mathrm{min}\left\{{(n-m)}^{-k\u2215\{2k+2\delta \}},{m}^{-k\u2215(2k+1)}\right\}\right]>0.$$

(4.16)

These rates correspond exactly to those in Theorem 4.2. They show that the rate at (4.16) is achieved by the estimator $\tilde{\mu}$. Although our upper bound giving this rate was derived only for a particular fixed distribution, that bound is readily established uniformly over a function class for which (4.16) holds.

We applied our estimators in the particular case where the observations are contaminated by only two types of errors, which is the setting of our empirical example. Note that we have defined two estimators, at (2.8) and (2.9). The first exists only when 1/*q*^{Ft} is integrable, and is simpler to calculate. In particular, it requires neither a bandwidth nor a kernel, and our numerical work showed that it systematically outperformed $\tilde{\mu}$, which therefore we do not present in the cases where $\widehat{\mu}$ exists. We use the notations $\widehat{\mu}\ast $ and $\tilde{\mu}\ast $ for the versions of the estimator $\widehat{\mu}$ and $\tilde{\mu}$ respectively, with the error variances estimated from replicated data as in Section 3. We use the notation ${\widehat{\mu}}_{\mathrm{NW}}$ for the classical Nadaraya-Watson estimator calculated from the data (*W _{i}, Y_{i}*),

We applied the various estimators introduced above to some simulated examples, corresponding to the following models, where we took the {* _{i}*} identically distributed as (we use Be(

- (1)$g\left(x\right)=3x+20\mathrm{exp}\{-100{(x-0.5)}^{2}\}\u2215\sqrt{2\pi}$,
*X*~ N(0.5, 1.0/3.92^{2}), ~ N(0, 0.673); - (2)
*Y*|*X*=*x*~ Be {*g*(*x*)},*g*(*x*) = 0.45 sin(2σ*x*) + 0.5 and*X*~ U[0, 1]; - (3)
*g*(*x*) = sin (σ*x*/2)/{1 + 2*x*^{2}(sgn*x*+ 1)},*X*~ N(0, 1), N(0, 0.09).

In each case we took *U ^{F}* ~

In addition to the bandwidth *h*, necessary to calculate $\tilde{\mu}$ and the classical Nadaraya-Watson estimator ${\widehat{\mu}}_{\mathrm{NW}}$, all methods, including $\widehat{\mu}$, required the choice of a ridge parameter ρ, used in their denominators to avoid division by a number close to zero. For each method, at points *x* where the denominator of the estimator was smaller than ρ, we replaced it by ρ. For a given estimator μ^{est}, we selected (ρ, *h*) — or ρ alone for $\widehat{\mu}$ — by minimizing the following cross-validation (CV) criterion:

$$\mathrm{CV}=\sum _{j=1}^{m}{\{{Y}_{j}-{\mu}^{\mathrm{est},(-j)}\left({W}_{j}\right)\}}^{2},$$

(5.1)

where the superscript ^{(−j)} meant that the estimator was constructed without using the *j*th observation.

Figure 1 and Table 1 compare, for various sample sizes, the results obtained for estimating curve (*i*) when *f*_{U(1)} was smoother than *f*_{U(2)}, with either both errors normal, or *f*_{U(1)} normal and *f*_{U(2)} Laplace. We compare ${\widehat{\mu}}_{\mathrm{NW}}$, $\widehat{\mu}$ and $\widehat{\mu}\ast $ (recall that the * version of estimators is used when the error variances are estimated from replicated observations). Our results show that the estimator $\widehat{\mu}$ outperforms ${\widehat{\mu}}_{\mathrm{NW}}$. The $\widehat{\mu}\ast $ version of $\widehat{\mu}$ worked almost as well as the latter, showing the limited loss incurred by estimating the error variances from replicated data. We also show the results obtained when *f*_{U(2)} was smoother than *f*_{U(1)} with both errors normal, where we compare ${\widehat{\mu}}_{\mathrm{NW}}$, $\tilde{\mu}$, and $\tilde{\mu}\ast $. Although the new estimator still outperforms the Nadaraya-Watson estimator, here the gain is less impressive, as predicted by the theory.

Quartile curves for the estimation of curve (*i*) when *f*_{U(1)} ~ N, *f*_{U(2)} ~ Laplace, ${\sigma}_{{U}^{\left(1\right)}}^{2}={\sigma}_{{U}^{\left(2\right)}}^{2}=0.2\text{var}\left(X\right)$, *m* = *n*/2 = 125, using ${\widehat{\mu}}_{\mathrm{NW}}$ (left) or $\tilde{\mu}\ast $(right).

MISE for estimation of curve (*i*) when *f*_{U(1)}~ Normal (N) and *f*_{U(2)} ~ Laplace (L), with ${\sigma}_{{U}^{\left(1\right)}}^{2}$ = ${\sigma}_{{U}^{\left(2\right)}}^{2}$ = 0.2 var(*X*); *f*_{U(1)} and *f*_{U(2)} ~ N, with ${\sigma}_{U\left(1\right)}^{2}$ = $2{\sigma}_{U\left(2\right)}^{2}$ = 0.2 var(*X*); and *f*_{U(1)} and *f*_{U(2)} ~ N, with $2{\sigma}_{U\left(1\right)}^{2}$ **...**

Figure 2 and Table 2 show the results obtained for estimating curve (*ii*) when *f*_{U(1)} was smoother than *f*_{U(2)}, with both errors normal. We compare ${\widehat{\mu}}_{\mathrm{NW}}$, $\widehat{\mu}$ and $\widehat{\mu}\ast $ for different combinations of sample sizes. Of course, the situation where we can expect the largest gain by using the new estimator, compared to the classical Nadaraya-Watson estimator, is that where the size, *m*, of the sample of data contaminated by *f*_{U(1)} is as small as possible, relative to the total sample size, *n*. The results, however, indicate that, even when *m* = 5*n*/6, the gain can already be quite significant. In this example, $\widehat{\mu}\ast $ performed so well that it even bettered its known error version, $\widehat{\mu}$, in the majority of cases.

Quartile curves for the estimation of curve (*ii*) when *f*_{U(1)} ~ N, *f*_{U(2)} ~ N, ${\sigma}_{{U}^{\left(1\right)}}^{2}=2{\sigma}_{{U}^{\left(2\right)}}^{2}=0.2\text{var}\left(X\right)$ and *m* = *n*/2 = 250, using ${\widehat{\mu}}_{\mathrm{NW}}$ (left) or $\widehat{\mu}\ast $ (right).

MISE *for estimation of curve* (*ii*) *when **f*_{U(1) }*and **f*_{U(2)} ~ *Normal with* ${\sigma}_{U\left(1\right)}^{2}$ = $2{\sigma}_{U\left(2\right)}^{2}$ = 0.2 var(*X*).

Finally, Figure 3 compares the results obtained for estimating curve (*iii*) when both errors are normal with ${\sigma}_{{U}^{\left(1\right)}}^{2}=2{\sigma}_{{U}^{\left(2\right)}}^{2}=0.2\text{var}\left(X\right)$ and *m* = *n*/2 = 250. We show the quartile curves obtained for ${\widehat{\mu}}_{\mathrm{NW}}$ and $\widehat{\mu}\ast $. Again, we see the important gain that can be obtained when using the new estimator compared to the classical Nadaraya-Watson estimator, which uses only (*W*_{1}, *Y*_{1}), …, (*W _{m}*,

Quartile curves for the estimation of curve (*iii*) when *f*_{U(1)} ~ N, *f*_{U(2)} ~ N, ${\sigma}_{{U}^{\left(1\right)}}^{2}=2{\sigma}_{{U}^{\left(2\right)}}^{2}=0.2\text{var}\left(X\right)$, *m* = *n*/2 = 250, using ${\widehat{\mu}}_{\mathrm{NW}}$ (left) or $\widehat{\mu}\ast $ (right).

In summary, our simulations showed that when *f*_{U(1)} was smoother than *f*_{U(2)}, the new estimator substantially outperformed the Nadaraya-Watson estimator. When *f*_{U(2)} was smoother than *f*_{U(1)}, the gain from using the new estimator was usually less impressive, unless *m* was relatively small, as predicted by the theoretical results in Section 4. The empirical applications we had in mind when developing the new estimators fall into the category where the *f*_{U(1)} is smoother than *f*_{U(2)}, see Section 5.2.

In part, this paper arises from the following considerations. In nutritional epidemiology, the standard method for correcting for the effects of measurement error in evaluating diet-disease relationships is regression calibration (Carroll, et al., 2006). Using our notation, the method works as follows. Let *N* be unobserved true long-term nutrient intake. The goal is to regress a response, say disease status *D*, on *N*. In the main study, nutrient intakes are measured by a single food frequency questionnaire (FFQ), which is what we call *W*. Here *X* is the long-term average intake as measured by the FFQ.

Because *N* is not observed, most nutritional epidemiology studies take a calibration random sub-sample of the main study population, generally much smaller than the main sample, where they typically measure repeated versions of *W*, in an effort to understand the measurement error properties of *W* in the sampled population. In addition, in the sub-sample, they observe an unbiased estimate *Y* of *N*. In regression calibration, instead of regressing *D* on the unobserved *N*, one regresses *D* on *E*(*Y*|*W*), where *E*(*Y*|*W*) is estimated from the observations in the sub-sample. Of course, one way to estimate *E*(*Y*|*W*) would be to use the classical Nadaraya-Watson estimator of *E*(*Y*|*W*) based on the direct observations on (*W, Y*), but our new approach can be used with the averaged replicated data to obtain a more efficient estimator of *E*(*Y*|*W*), as we illustrate below, on a calibration sub-study from the American Cancer Society Cancer Prevention Study II Nutrition Survey (ACS, Flagg et al., 2000).

The main study had approximately 185, 000 adults, while the calibration sub-study was of size 598. In the calibration sub-study, several variables were measured, including *Y*, an average of protein intake from four food records which is taken to be unbiased for usual intake *N*, and *W*, a log-transformed version of protein intake using a FFQ, which was measured twice with error approximately normal $\mathrm{N}(0,{\sigma}_{U}^{2})$. As above, *X* is the unobserved long-term average intake as measured by the FFQ. The data we considered were a sample of size *n* = 598 from (*W*_{i1}, *W*_{i2}, *Y _{i}*), for

This example is convenient for illustrating the various approaches to regression estimation, since the fact that we have direct data from the quantity of interest allows us to consider three different estimators: the classical Nadaraya-Watson estimator ${\widehat{\mu}}_{\mathrm{NW}}$ of *E*(*Y*|*W*) based on the dependent data (*W*_{i1}, *Y _{i}*), for

Of course, here we do not know the true curve *E*(*Y*|*W*), so we cannot say which method gives the best estimator. However, the sample size is large, so one way to illustrate the performance of the procedures in a way that is similar to a simulation study is to create a large number (we took 500) of subsamples of smaller size (we took 30, 50, 75 and 100), and examine the variability of each method, for each subsample size. It is not hard to show that, for our method, the squared bias is of smaller order than the variance, and since we do not know the true target, it thus seems appropriate to focus on variance. In Figure 4 we show the estimated curves for 15 subsamples of size 30 (respectively, 100) randomly selected from the 500 randomly created subsamples of size 30 (respectively, 100). We see that, although all methods indicate the same trend for the relationship between *W* and *Y*, both versions of the Nadaraya-Watson estimator experience some difficulty, as some of the estimated curves are quite wiggly. To illustrate this further, in Table 3 we show, for each subsample size, the integrated variance of each method on the interval [1, 4] (calculated via the variance of the 500 replications in each case). The main message is the same: our method is less variable than both Nadaraya-Watson estimators, as expected by the theory.

Estimation of μ for the American Cancer data, using the new estimator $\widehat{\mu}\ast $ (left), the classical NW estimator ${\widehat{\mu}}_{\mathrm{NW},\mathrm{dep}}$ (middle) with dependent data, or the classical NW estimator ${\widehat{\mu}}_{\mathrm{NW}}$ (right), for subsamples of size **...**

We have shown how to predict in errors-in-variables regression, when information from different sources (e.g. different laboratories) is combined, and the errors have different distributions. The methods that we suggest enjoy optimal accuracy, in the sense that the rates of convergence are best possible. However, those rates can vary particularly widely, from root-*n* rates when the problem is in effect semiparametric, to much slower rates that are characteristic of a genuinely nonparametric problem.

The problem turns out to be quite complex in other ways, too, and has a number of subtle features and apparent contradictions. For example, the results superficially suggest that on occasion it might even be beneficial to artificially add noise to some of the data. However, as explained in Remark 4.3, such a conclusion is unwarranted because it does not take account of the way in which adding noise would affect the conditioning step.

Our methods can also be applied in settings where the error distributions are not known and are instead estimated, for example from repeated measurements, see section 3. The methodology developed there can be taken further. This, and other practically important variants of the problem, offer interesting avenues for further research.

- Berry S, Carroll RJ, Ruppert D. Bayesian Smoothing and Regression Splines for Measurement Error Problems. J. Amer. Statist. Assoc. 2002;97:160–169.
- Carroll RJ, Hall P. Optimal rates of convergence for deconvolving a density. J. Amer. Statist. Assoc. 1988;83:1184–1186.
- Carroll RJ, Maca JD, Ruppert D. Nonparametric regression in the presence of measurement error. Biometrika. 1999;86:541–554.
- Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM. Measurement Error in Nonlinear Models: A Modern Perspective. Second Edition Chapman and Hall CRC Press; 2006.
- Cheng C-L, Riu J. On estimating linear relationships when both variables are subject to heteroscedastic measurement errors. Technometrics. 2006;48:511–519.
- Cook JR, Stefanski LA. Simulation-extrapolation estimation in parametric measurement error models. J. Amer. Statist. Assoc. 1994;89:1314–1328.
- Delaigle A. Nonparametric density estimation from data with a mixture of Berkson and classical errors. Canad. J. Statist. 2007;35:89–104.
- Delaigle A, Fan J, Carroll RJ. Local polynomial estimator for the errors-in-variables problem. 2008. Submitted for publication. [PMC free article] [PubMed]
- Delaigle A, Hall P, Meister A. On deconvolution with repeated measurements. Ann. Statist. 2008;36:665–685.
- Delaigle A, Meister A. Nonparametric regression estimation in the heteroscedastic errors-in-variables problem. J. Amer. Statist. Assoc. 2007;102:1416–1426.
- Delaigle A, Meister A. Density estimation with heteroscedastic error. Bernoulli. 2008;14:562–579.
- Devanarayan V, Stefanski LA. Empirical simulation extrapolation for measurement error models with replicate measurements. Statist. Probab. Lett. 2002;59:219–225.
- Fan J. On the optimal rates of convergence for nonparametric deconvolution problems. Ann. Statist. 1991;19:1257–1272.
- Fan J, Truong YK. Nonparametric regression with errors in variables. Ann. Statist. 1993;21:1900–1925.
- Ferrari P, Kaaks R, Fahey M, Slimani N, Day NE, Pera G, Boshuizen HC, Roddam A, Boeing H, Nagel G, Thiebaut A, Orfanos P, Krogh P, Braaten T, Riboli E. Within- and between-cohort variation in measured macronutrient intakes, taking account of measurement errors, in the European Prospective Investigation into Cancer and Nutrition Study. American Journal of Epidemiology. 2004;160:814–822. [PubMed]
- Ferrari P, Day NE, Boshuizen HC, Roddam A, Hoffmann K, Thiebaut A, Pera G, Overvad K, Lund E, Trichopoulou A, Tumino R, Gullberg A, Norat T, Slimani N, Kaaks R, Ribolil E. The evaluation of the diet/disease relation in the EPIC study: considerations for the calibration and the disease models. International Journal of Epidemiology. 2008 Advance Access published January 6, 2008. [PubMed]
- Flagg EW, Coates RJ, Calle EE, Potischman N, Thun M. Validation of the American Cancer Society Cancer Prevention Study II Nutrition Survey Cohort Food Frequency Questionnaire. Epidemiology. 2000;11:462–468. [PubMed]
- Ganse RA, Amemiya Y, Fuller w. a. Prediction when both variables are subject to error, with application to earthquake magnitudes. J. Amer. Statist. Assoc. 1983;78:761–765.
- Hall P, Meister A. A ridge-parameter approach to deconvolution. Ann. Statist. 2007;35:1535–1558.
- Hu Y, Schennach SM. Identification and estimation of nonclassical nonlinear errors-in-variables models with continuous distributions. Econometrica. 2008;76:195–216.
- Huang XZ, Stefanski LA, Davidian M. Latent-model robustness in structural measurement error models. Biometrika. 2006;93:53–64.
- Kim J, Gleser LJ. SIMEX approaches to measurement error in ROC studies. Comm. Statist. Theory Meth. 2000;29:2473–2491.
- Kipnis V, Subar AF, Midthune D, Freedman LS, Ballard-Barbash R, Troiano R, Bingham S, Schoeller DA, Schatzkin A, Carroll RJ. The structure of dietary measurement error: results of the OPEN biomarker study. Amer. J. Epidemiology. 2003;158:14–21. [PubMed]
- Kulathinal SB, Kuulasmaa K, Gasbarra D. Estimation of an errors-in-variables regression model when the variances of the measurement errors vary between the observations. Statist. Medicine. 2002;21:1089–1101. [PubMed]
- Li T, Vuong Q. Nonparametric estimation of the measurement error model using multiple indicators. J. Multivariate Anal. 1998;65:139–165.
- Linton O, Whang YJ. Nonparametric estimation with aggregated data. Econometric Theory. 2002;18:420–468.
- Marron JS, Wand MP. Exact mean integrated squared error. Ann. Statist. 1992;20:712–736.
- National Research Council. Committee on Pesticides in the Diets of Infants and Children . Pesticides in the Diets of Infants and Children. National Academies Press; 1993.
- Schennach SM. Estimation of nonlinear models with measurement error. Econometrica. 2004a;72:33–75.
- Schennach SM. Nonparametric regression in the presence of measurement error. Econometric Theory. 2004b;20:1046–1093.
- Simonoff JS. Smoothing Methods in Statistics. Springer; New York: 1996.
- Staudenmayer J, Ruppert D. Local polynomial regression and simulation-extrapolation. J. Roy. Statist. Soc., Ser. B. 2004;66:17–30.
- Staudenmayer J, Ruppert D, Buonaccorsi J. Density estimation in the presence of heteroskedastic measurement error. J. Amer. Statist. Assoc. 2008 to appear.
- Stefanski LA. Measurement error models. J. Amer. Statist. Assoc. 2000;95:1353–1358.
- Stefanski LA, Carroll RJ. Deconvoluting kernel density estimators. Statistics. 1990;21:165–184.
- Taupin ML. Semi-parametric estimation in the nonlinear structural errors-in-variables model. Ann. Statist. 2001;29:66–93.
- Thamerus M. Fitting a mixture distribution to a variable subject to heteroscedastic measurement errors. Comput. Statist. 2003;18:1–17.
- Wand MP, Jones MC. Kernel Smoothing. Chapman and Hall; London: 1995.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |