PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
J Am Stat Assoc. Author manuscript; available in PMC 2010 May 5.
Published in final edited form as:
J Am Stat Assoc. 2009 September 1; 104(487): 993–1014.
doi:  10.1198/jasa.2009.tm07543
PMCID: PMC2864536
NIHMSID: NIHMS184134

Nonparametric Prediction in Measurement Error Models

Raymond J. Carroll
Department of Statistics, 3143 TAMU, Texas A&M University, College Station, Texas 77843, USA
Aurore Delaigle
Department of Mathematics, University of Bristol, Bristol BS8 1TW, UK
Department of Mathematics and Statistics, University of Melbourne, VIC, 3010, Australia

Abstract

Predicting the value of a variable Y corresponding to a future value of an explanatory variable X, based on a sample of previously observed independent data pairs (X1, Y1), …, (Xn, Yn) distributed like (X, Y), is very important in statistics. In the error-free case, where X is observed accurately, this problem is strongly related to that of standard regression estimation, since prediction of Y can be achieved via estimation of the regression curve E(Y|X). When the observed Xis and the future observation of X are measured with error, prediction is of a quite different nature. Here, if T denotes the future (contaminated) available version of X, prediction of Y can be achieved via estimation of E(Y|T). In practice, estimating E(Y|T) can be quite challenging, as data may be collected under different conditions, making the measurement errors on Xi and X non-identically distributed. We take up this problem in the nonparametric setting and introduce estimators which allow a highly adaptive approach to smoothing. Reflecting the complexity of the problem, optimal rates of convergence of estimators can vary from the semiparametric n−1/2 rate to much slower rates that are characteristic of nonparametric problems. Nevertheless, we are able to develop highly adaptive, data-driven methods that achieve very good performance in practice.

Keywords: Bandwidth, Contamination, Deconvolution, Errors-in-variables, Parametric rates, Regression, Ridge parameter, Smoothing

1 Introduction

We consider prediction in a problem of nonparametric errors-in-variables regression. In the classical errors-in-variables context, the data consist of a sample of independent and identically distributed observations (Wi, Yi), i = 1, …, n, generated by the model Yi = g(Xi) + [sm epsilon]i, Wi = Xi + Ui, where each Wi represents a contaminated version of a variable Xi, Xi and Ui are independent, and, for i = 1, …, n, Ui has the distribution with density fU, which we indicate by writing Ui ~ fU. Nonparametric estimation of g in this context is a difficult problem, for which optimal estimators converge at notoriously slow rates, see e.g. Fan and Truong (1993). When the interest lies in predicting future values of Y, however, there is often no need to estimate the function g explicitly. In particular, if future observations of X are also measured with an error U ~ fU, then it is rarely necessary to address the measurement error, as prediction of Y can be achieved via estimation of E(Y|X + U) = E(Y|W) by standard nonparametric regression estimators from the sample (Wi, Yi), i = 1, …, n. See Carroll et al. (2006) for further discussion of this and related issues.

In empirical applications, however, the above model can be too restrictive, because individuals are not necessarily observed in similar conditions. For example, the data may have been collected from different laboratories (see National Research Council, 1993), and future observations may come from yet another laboratory. In such cases, the data are a sample of independent observations (Wi, Yi), i = 1, …, n, generated by

Yi=g(Xi)+i,Wi=Xi+Ui,
(1.1)

where each Wi represents a contaminated version of a variable Xi ~ fX, with error Ui ~ fUi and where {Xi, Ui, [sm epsilon]i}i=1,…,n are mutually independent, and future observations are of the type T = X+UF, where X and UF are independent, UF ~ fUF and X has the same distribution as the Xis. As in the setting of the previous paragraph, since future values of X are of the type T = X + UF, nonparametric prediction of Y can be achieved via nonparametric estimation of μ(t) = E(Y|T = t). Unlike the case of the previous paragraph, however, this cannot be done via standard nonparametric regression estimators. Indeed, given that fU1, …, fUn and fUF can all be different, the Wis are not necessarily identically distributed, nor are they distributed like T. Despite this difficulty, the major purpose of this paper is to show that it is possible to estimate μ(t) nonparametrically. Moreover, the convergence rates of our estimator can be as fast as the parametric n−1/2 rate. While we acknowledge that asymptotic convergence rates do not tell the entire story about the relative performance of estimators, and in particular that multiplicative constants can also be important, our numerical results indicate that our method also does very well in practice.

Heteroscedasticity in the errors arises in many different ways and has been treated by several authors. See, for example, Devanarayana and Stefanski (2002), Kulathinal et al. (2002), Thamerus (2003), Cheng and Riu (2006), Delaigle and Meister (2007), Staudenmayer et al. (2008) and Delaigle and Meister (2008). In some contexts, it is reasonable to assume that there is only a small number of different error densities. In other cases of interest, the error densities could reasonably all come from the same parametric family and differ only through the parameters of their distributions. Indeed, it is commonly assumed that all errors are centered normal random variables. See, for example, Cook and Stefanski (1994), Carroll et al. (1999), Berry et al. (2002), Devanarayana and Stefanski (2002), Kulathinal et al. (2002), Staudenmayer and Ruppert (2004) and Staudenmayer et al. (2008).

The work in this paper was originally motivated by applications where the errors in the sample W1, …, Wn are of only two types, and the error on future observations is of one of these two types. To fix ideas, suppose the data have been rearranged such that, for i = 1, …, m, Ui ~ fU(1) and for i = m + 1, …, n, Ui ~ fU(2), whereas future observations are of the type T = X + UF with UF ~ fU(1). Although it is much simpler, this model is important in practical applications (see also Carroll et al. 2006, page 38–39), and we shall discuss it in detail. In Section 5, we apply this two-error model on a dietary data set where the goal is to predict a nutrient intake from a Food Frequency Questionnaire.

There is an extensive literature on estimation of a regression curve from contaminated data sets. A contemporary introduction to this problem is provided by Carroll et al. (2006), and recent contributions include those of Kim and Gleser (2000), Stefanski (2000), Taupin (2001), Linton and Whang (2002), Schennach (2004a,b) and Huang et al. (2006). Nonparametric estimation of a regression curve without contamination is a much older problem, treated in monographs such as those by Wand and Jones (1995) and Simonoff (1996).

An outline of this paper is as follows. In Section 2, when the measurement error densities are known, we describe estimators of the target μ(t). In Section 3, we show how to extend these methods to the case that the measurement error densities are unknown. Section 4 gives the rates of convergence of our estimators, and in particular discusses cases where our estimators achieve parametric rates of convergence. Section 5 gives numerical results, both in simulations and in a nutritional epidemiology context.

2 Estimators

2.1 Estimators for the general case

Suppose we have a sample of data (Wi, Yi), i = 1, …, n, generated as at (1.1); that the future observations of X are of the type T = X + UF, where X and UF are independent and UF ~ fUF; and that the error densities fUi and fUF are known. The case where these are totally or partially unknown will be discussed in Section 3. We wish to predict Y nonparametrically, via estimation of

μ(t)=E(YT=t)=yfT,Y(t,y)dyfT(t).
(2.1)

The task seems challenging, as we need to estimate T-related quantities from a sample of W-related quantities. The relationship between T and each Wj, however, when expressed in terms of their characteristic functions, is relatively simple. Let fVFt denote the characteristic function of the distribution of a random variable V. Then it is easy to check that fTFt(t)=fXFt(t)fUFFt(t) and fXFt(t)=fWjFt(t)fUjFt(t). Assuming that none of the fUjFt(t)s vanishes and fTFt(t) is integrable, it follows that, by the Fourier inversion theorem,

fT(x)=12πeitxfTFt(t)dt,
(2.2)

where we can write

fTFt(t)=fUFFt(t)n1j{fWjFt(t)fUjFt(t)}.
(2.3)

Based on these considerations, we show below how to estimate fTFt and fT. Then, we construct an estimator of the numerator of (2.1), and finally obtain an estimator of μ.

The simplest device to obtain a consistent estimator of fT is to replace fWjFt(t) in (2.3) by eitWj, an unbiased estimator. However, the form of a sum of ratios in (2.3) implies that the variance of the resulting estimator of fT, which depends on the behavior of the fUjFts in the tails, would be dominated by the variance of the least favorable errors. This simple approach thus does not lead to optimal estimators. Alternatively, to gain more precision, we could rewrite (2.3) by using the ratio of the sums of fWjFt and fUjFt. Specifically, first note that 1fUjFt(t)=fUjFt(t)fUjFt(t)2. Use the notation Ψj(t)=fUjFt(t)ΣkfUkFt(t)2. Since fWjFt(t)=fXFt(t)fUjFt(t), it follows that fWjFt(t)fUjFt(t)=fXFt(t)fUjFt(t)2, which implies that

fTFt(t)=fUFFt(t)jfWjFt(t)fUjFt(t)kfUkFt(t)2=jfWjFt(t)Ψj(t)fUFFt(t).

Now, replacing fWjFt(t) by its unbiased estimate eitWj, we can estimate fTFt(t) from the data (W1, Y1), …, (Wn, Yn) by

f^TFt(t)=j=1neitWjΨj(t)fUFFt(t).
(2.4)

We shall show in Section 4 that this procedure leads to optimal estimators of μ.

If fUFFt(t)ΣjΨj(t)L1 we can obtain an estimator of the denominator of (2.1) by plugging (2.4) into (2.2), which gives fT^(x)=ΣjfT,j(xWj), where we employed the notation

fT,j(x)=12πeitxfUFFt(t)Ψj(t)dt.
(2.5)

Using a similar approach, we estimate the numerator of (2.1) by Σj YjfT,j(x - Wj) and we obtain an estimator of μ by taking the ratio of these two estimators:

μ^(x)=jYjfT,j(xWj)jfT,j(xWj).
(2.6)

When fUFFt(t)ΣjΨj(t)L1 we need to regularize f^TFt before plugging it into (2.2). This challenge is also encountered in relatively classical deconvolution problems. We use a kernel approach to regularize this problem. Methodology based on another nonparametric technique, such as splines, orthogonal series or the ridge technique of Hall and Meister (2007), could also be developed. While those methods might be competitive with the kernel approach, the latter benefits from being relatively accessible to asymptotic analysis. Let K be a kernel function with Fourier transform KFt, and let h > 0 be a bandwidth. Assuming that KFt(t)fUFFt(th)Ψj(th)L1, our regularized estimator of fT is f˜T(x)=h1ΣjKT,j{(xWj)h}, where

KT,j(x)=12πeitxKFt(t)fUFFt(th)Ψj(th)dt.

Proceeding as above, we define our estimator of μ by

μ˜(x)=jYjKT,j(xWjh)jKT,j(xWjh)
(2.7)

2.2 The two-error model

In the two-error model which motivated our work, the estimators of μ are particularly simple. To keep the same notation as in the introduction, assume that the first m observations are contaminated by an error with density fU(1), that the last nm are contaminated by an error with density fU(2), and that the error in the future observation is UF ~ fU(1). Let qFt=fU(2)Ft(t)fU(1)Ft(t).

If 1/qFt [set membership] L1, we use the estimator at (2.6), which becomes

μ^(x)=j=1mYjfq,1(xWj)+j=m+1nYjfq,2(xWj)j=1mfq,1(xWj)+j=m+1nfq,2(xWj),
(2.8)

where we used the notations fq,1(x)=(2π)1eitx{m+(nm)qFt(t)2}1dt and fq,2(x)=(2π)1eitxqFt(t){m+(nm)qFt(t)2}1dt.

If 1/qFt [negated set membership] L1, we use the estimator at (2.7), which becomes

μ˜(x)=j=1mYjKq,1(xWjh)+j=m+1nYjKq,2(xWjh)j=1mKq,1(xWjh)+j=m+1nKq,2(xWjh),
(2.9)

where Kq,1(x)=(2π)1eitxKFt(t){m+(nm)qFt(th)2}1dt and Kq,2(x)=(2π)1eitxKFt(t)qFt(th){m+(nm)qFt(th)2}1dt. Note that in the particular case where we have only observations contaminated by the error density fU(1), m = n and the estimator at (2.9) is nothing more than the usual Nadaraya-Watson estimator without any contamination, see Wand and Jones (1995, p. 119).

Remark 2.1

In the terminology of nonparametric deconvolution, the smoothness of an error (or error density) is usually described in terms of the speed of convergence to zero of its characteristic function in the tails — the faster, the smoother. See Section 4.2 for discussion. In this context, roughly speaking, 1/qFt [set membership] L1 implies that fU(1) is smoother than fU(2). For example, if fU(1) and fU(2) are normal densities with mean zero and variances σ12 and σ22, respectively, then, 1/qFt [set membership] L1 if σ12>σ22, and 1/qFt [negated set membership] L1 if σ12σ22. The condition fUFFt(t)ΣjΨj(t)L1 can be understood in a related manner.

2.3 Local polynomial extension

The estimators presented above are an extension of the Nadaraya-Watson estimator, which is nothing more than a local constant estimator appropriate for error-free data. Recently, Delaigle, Fan and Carroll (2008) solved the long-open problem of developing local polynomial estimators for errors-in-variables problems. Using their technique we can give a local polynomial version of our estimator μ˜. More precisely, we define a pth order local polynomial estimator of μ by μ˜p(x)=(1,0,,0)S^n1T^n, where S^n=(S^n,j+l(x))0j,lp and T^n=(T^n,0(x),,T^n,p(x))T with

S^n,k(x)=j=1nKU,k,j;h(Wjx)andT^n,k(x)=j=1nYjKU,k,j;h(Wjx),

where KU,k,j;h(x)=h1KU,k,j(xh) and

KU,k,j(x)=ik12πeitx(KFt)(k)(t)fUFFt(th)Ψj(th)dt.

Compared to local constant estimators (p = 0), local polynomial estimators for p > 0 have the advantage of being less biased in the presence of boundary points. On the other hand, in practice, increasing the value of p usually leads to an increase of variability, and using values of p larger than 1 is rarely useful unless the interest is in estimating derivatives of the curve μ. This, however, is usually not the case in the prediction problem.

It is straightforward to extend the local-constant methodology of Delaigle, Hall and Meister (2008) for the case of unknown error distributions to the context of general local polynomial estimators. Properties are similar, too. For example, convergence rates in the case of local linear methods are identical, under regularity conditions discussed towards the end of Section 3.1, to those for the local constant estimators treated in this paper.

3 Unknown error densities

There are many examples where it is too restrictive to assume that the error densities are completely known, and in such cases, these densities have to be estimated from the data. For a long time this problem was essentially ignored in the nonparametric literature, where the error distributions were systematically assumed to be known. Recently, however, several authors have shown that this problem can be tackled if every observation is replicated at least once. References include Schennach (2004a,b), Delaigle, Hall and Meister (2008) and Hu and Schennach (2008). In the next section we discuss parametric and nonparametric methods for error density estimation in the two-error model treated in Section 2.2. A procedure for the general model is given in Section 3.2.

3.1 Procedures for the two-error model

In the two-population case, if we do not have enough data to estimate the error densities, and if m is not particularly small, then a consistent estimator of μ can be obtained by taking n = m, i.e. discarding all observations contaminated by fU(2) and using the standard Nadaraya-Watson estimator. See also Remark 3.1, below. The most interesting setting is undoubtedly that where it is possible to estimate the error densities, as it is in this case that the estimator of μ will enjoy the fastest rates of convergence; see Section 4.

As discussed in the introduction, a large literature on measurement errors assumes that the errors are normal. More generally, the errors could belong to some parametric family, not necessarily normal. There, if we have a parametric model fU(2) (·|θ1) (respectively, fU(2) (·|θ2)) that is identifiable from data on U — U′, where U and U′ denote two independent variables from of fU(1) (respectively, fU(2)), then θ1 and θ2 can be estimated from a sample of replicated data, i.e. a sample of the form (Wij, Yi), i = 1, …, n and j = 1,2, generated by the model

Yi=g(Xi)+i,Wij=Xi+Uij,
(3.1)

where, for i = 1, …, m, Uij ~ fU(1), whereas, for i = m + 1, …, n, Uij ~ fU(2), and the Uijs are all independent, and independent of each Xi. For example, if θi=σU(i)2, i = 1, 2, then we can take θ^i equal to a weighted average of the within-subject sample variance; see equation (4.3) of Carroll et al. (2006). Once the unknown parameters have been estimated, the resulting estimated characteristic functions f^U(1)Ft and f^U(2)Ft are plugged into the estimators (2.8) and (2.9), to produce our estimators of μ.

In some cases it can happen that we have no suitable parametric model for the error densities. If the characteristic functions of the error densities are positive and symmetric, as is the case for many common densities, then they can be estimated nonparametrically along the lines of Delaigle, Hall and Meister (2008). More precisely, f^U(1)Ft and f^U(2)Ft, in (2.8) and (2.9), are estimated by, respectively, f^U(1)Ft(t)=m1j=1mcos{t(Wj1Wj2)}12 and f^U(2)Ft(t)=(nm)1j=m+1ncos{t(Wj1Wj2)}12. We then replace fU(1)Ft and fU(2)Ft by f^U(1)Ft and f^U(1)Ft in the numerators of fq,1, fq,1, Kq,1, and Kq,2, but in the denominators, to avoid division by zero, we replace mfU(1)Ft2+(nm)fU(2)Ft2 by mf^U(1)Ft2+(nm)f^U(2)Ft2+r, with r > 0; see Delaigle, Hall and Meister (2008). More general settings are considered by Li and Vuong (1998), Schennach (2004a,b) and Hu and Schennach (2008).

Convergence rates in the unknown error case, and in the setting of classical errors-in-variables problems, have been given by Delaigle, Hall and Meister (2008). The results there state that, if the characteristic function of the unknown error distribution is estimated using a difference-based method, then the convergence rate is the same as in the setting of a known error distribution, provided the density of X is sufficiently smooth relative to the error density. This is also true in the prediction problem treated in the present paper.

Remark 3.1

When m is large relative to nm, and the error densities cannot be well estimated, a classical Nadaraya-Watson estimator of μ, based on (W1, …, (Wm, Ym), is likely to perform better than our estimator. For example, this could happen if the errors densities had to be estimated nonparametrically from a small number of individuals for which there were replicated observations. In such cases, a conservative approach would be to use the Nadaraya-Watson estimator.

3.2 A procedure for the general model

Before we show how to construct a consistent estimator in the general context of the model (1.1), it is important to realize that, whatever approach we take, in order for the function μ to be identifiable we need to be able to consistently estimate the error density fUF of the future observations. See Section 3.1 for a discussion on how to estimate an error density. We assume that sufficient effort has been made by the experimenters to collect data permitting the construction of a consistent estimator f^UF of fUF. If fU(2) cannot be estimated then the prediction problem is not identifiable.

As in Section 3.1, for simplicity we address the case where there are just two replicated measurements of Xi for each i, that is, we have data of the form (Wij, Yi), for i = 1, …, n and j = 1, 2, generated by the model (3.1), where Uij ~ fUi and the Uijs are all independent, and independent of every Xi (the case of a larger number of replicates can be treated similarly). Of course, it is not possible to estimate each error density fUi from such data. Nevertheless, if the characteristic functions of the error densities are positive and symmetric and if we modify our estimators at (2.6) and (2.7) appropriately, it is possible to construct a consistent estimator of μ, as we show below.

Let Wi=(Wi1+Wi2)2, and note that fXFt(t)=fWjFt(t){fUjFt(t2)}2=Φ(t)ΣjfWjFt(t), where we used the notation Φ(t)=1Σk{fUkFt(t2)}2. Replacing the unknown Φ(t)−1 by the estimator Φ^(t)1=Σkexp(it(Wk,1Wk,2)2), and proceeding as in Section 2, we obtain the following versions of (2.6) and (2.7):

μ^(t)=j=1nYjfT(xWj)j=1nfT(xWj),μ˜(t)=j=1nYjKT(xWjh)j=1nKT(xWjh),

with fT(x)=(2π)1eitxf^UFFt(t){Φ^1(t)+r}, and with

KT(x)=(2π)1eitxKFt(t)f^UFFt(th){Φ^1(th)+r}dt,

where, as before, r > 0 is introduced to avoid division by zero.

4 Theoretical properties

The properties of the estimator μ^ at (2.6) are clear. In particular, it is easy to check that the numerator and the denominator are both unbiased estimators of μ(t)fT(t) and FT(t), respectively, and that, under conditions similar to those discussed below Theorem 4.1, μ^ converges at the fast parametric n−1/2 rate. Intuitive explanation of why this fast rate can occur in a nonparametric context will be given in Remark 4.2. Properties of the estimator μ˜ are more complicated and, in what follows, we derive them in the general case. Then we obtain more detailed results in the two-error problem. Proofs of the results presented in this section can be found in the supplemental material of this paper, available at http://www.amstat.org/publications/jasa/supplemental_materials.

4.1 Asymptotic results for μ˜ in the general case

To study asymptotic properties of our estimator, we assume that:

i,,nhave zero means and uniformly bounded variances;
(4.1)

Kis symmetric,K(x)C3(1+x)k1C4for an integerk2and forconstantsC3,C4>0,ujK(u)du=0for1jk1,supKFt<,KFt(0)>0,and, for someC5>0,KFt(t)=0for allt>C5,
(4.2)

fX,fU1,,fUn,fUFandgare bounded, andfXandghavekbounded derivatives;
(4.3)

supkfUkFt2andfUFFtare bounded and, for all t,kfUkFt(t)>0.
(4.4)

h0asnandnv(h)asn,
(4.5)

where we defined

v(h)=nh1KFt(t)2fUFFt(th)2k=1nfUkFt(th)2dt.
(4.6)

Assumptions such as these are fairly standard in the nonparametric regression literature. Condition (4.1) is mild, and the smoothness of the various curves in (4.3) is imposed only to determine the order of the bias of the estimator, which depends of k. In particular, k is generally not a tuning parameter, and in empirical examples, where the smoothness of the curves is usually unknown, it is common to set k = 2. It is well known that, in practice, larger values of k increase the variability of estimators and usually make them unattractive, see, for example, Marron and Wand (1992). Of course, as in the standard error-free case, our results can be extended to cases where fX and g have discontinuity points, with the obvious modifications to the bias. Condition (4.2) only concerns the kernel (which we can choose) and is satisfied by the kernels used in deconvolution problems. Condition (4.4) is a weaker version of standard conditions usually imposed in deconvolution problems (see e.g Fan, 1991), since, in our case, the characteristic functions of the errors are permitted to vanish. Condition (4.5) is a generalization of the standard condition nh → ∞ imposed in the error-free case, but it looks more complicated here because the variance of the estimator is of order v(h)/n rather than 1/(nh).

The asymptotic behavior of the estimator is described in the next theorem.

Theorem 4.1

Assume (4.1)–(4.5). Then, for each t such that fT(t) > 0,

μ˜(t)=μ(t)+Op({v(h)n}12+hk).
(4.7)

Precise rates of convergence of the estimator depend on the behavior of the ratio Q(t)nfUFFt(t)2ΣkfUkFt(t)2 in the tails. It is not possible here to consider every possible combination of error types. To get some insight on these results, consider the situation where, for all t, ΣkfUkFt(t)2>nξ(t) where ξ is a continuous, strictly positive function. Then, if Q(t) = o(|t|) as t → ∞, by taking h = O(n−1/k), the estimator converges at the fast parametric n−1/2 rate. When Q(t) = O(1) and |t|Q(t) → ∞ as t → ∞, the estimator converges at a rate that lies between n−1/2 and the classical nonparametric rate nk/(2k+1), and in other cases it converges more slowly than nk/(2k+1). An intuitive explanation of the occurrence of fast parametric rates will be given in Remark 4.2.

4.2 Detailed results in the two-error case

In the two-error problem described in Section 2.2, it is possible to provide a more detailed study of the convergence rates of the estimator μ˜, which, in this case, reduces to (2.9). Precise rates of convergence depend on the values of m and n, and on the behavior of qFt in the tails, which itself is dictated by the behaviors of the characteristic functions fU(1)Ft and fU(2)Ft of the errors. In the measurement error literature it is well known that convergence rates of nonparametric estimators depend heavily on the behavior of the characteristic function of the error in the tails. This tail behavior is usually referred to as the smoothness of the error, and it is standard to divide the error distributions in two quite different categories, called ordinary smooth and supersmooth in the terminology of Fan (1991). The errors fU(1) and fU(2) are ordinary smooth of orders β and α, respectively, if they satisfy, for positive constants C1<C1 and C2<C2 and for all t,

C2(1+t)βfU(1)Ft(t)C2(1+t)β,C1(1+t)αfU(2)Ft(t)C1(1+t)α.
(4.8)

An error density fU is supersmooth of order β > 0 if it satisfies, for positive constants γ, D1 < D2 and for all t,

D1exp(tβγ)fUFt(t)D2exp(tβγ).
(4.9)

For simplicity of presentation we give our main results under the assumption that, for constants α, β, C1, C2 > 0 and all real t,

C1(1+t)αfU(2)Ft(t),fU(1)Ft(t)C2(1+t)β.
(4.10)

Obviously, ordinary smooth errors satisfy both inequalities, but supersmooth errors also satisfy the second inequality for any β > 0.

Define δ=αβ+12 and, denoting the indicator function by 1{·}, let

v1(h)=h2δ1{δ>0}+logh1{δ=0}+1{δ<0}.
(4.11)

Assume that, as n → ∞,

h0,nv1(h)andmh.
(4.12)

The asymptotic behavior of μ˜ is described in the next theorem.

Theorem 4.2

Assume (4.1)–(4.4) and (4.10)–(4.12). Then, for each t such that fT (t) > 0,

μ^(t)=μ(t)+OP(hk+min{(mh)12,(nm)12v1(h)12}).
(4.13)

This result shows that our estimator μ˜ converges at a rate at least as fast as the Nadaraya-Watson estimator of μ based on direct data from (T, Y), i.e. on the first m observations (W1, Y1),…, (Wm, Ym). Indeed, the Nadaraya-Watson estimator is optimized when taking h ~ const. m−1/(2k+1), for which it converges at the rate mk/(2k+1). Of course, the cases where our estimator converges faster that the Nadaraya-Watson estimator depend on the relative sizes of m and n, but also on the relative smoothness of the two errors, as we show below.

The most favorable situation is clearly the one where δ < 0. This includes the case where fU(1) and fU(2) are ordinary smooth of orders β and α, respectively, with β>α+12, but it also includes the case where, simultaneously, fU(1) is supersmooth of any order, and fU(2) is ordinary smooth of any order. When m = o{(nm)(2k+1)/2k}, we obtain the very fast parametric rate (nm)−1/2, by taking h = O{(nm)−1/(2k)}. In particular, when m and nm are of the same order, or m = o(n), the estimator converges at the rate n−1/2. When mo{(nm)(2k+1)/2k}, i.e. when there are many more data contaminated by fU(1) than by fU(2), then, logically, the estimator converges at the same rate as the Nadaraya-Watson estimator based on the m first data points. More precisely, when h ~ const. m−1/(2k+1) the estimator converges at the rate mk/(2k+1).

The case where δ > 0 is more involved since, there, the term (n - m)−1/2 v1(h)1/2 in (4.13) is only an upper bound to the contribution of the data (Wi, Yi) for i = m+1,…,n, and precise characterization of convergence rates can be obtained only at the expense of more precise characterization of qFt. We shall assume that the errors are ordinary smooth of order β and α, as defined at (4.8). Under this assumption, the convergence rate of the estimator is exactly of the order given in the theorem. The optimal bandwidth is thus of order h ~ const. m−1/(2k+1) when (nm)(2k+1)/(2k+2δ) = o(m), and the estimator then converges at the rate mk/(2k+1). When m = o{(nm)(2k+1)/(2k+2δ)}, the optimal bandwidth is of order h ~ const. (nm)−1/(2k+2δ) and the estimator converges at rate (nm)k/(2k+2δ).

Remark 4.1. (Other types of errors)

Calculation of the rates for μ˜ can be extended to cases more general than (4.10). For example, it can be shown that the convergence rates are of order min{mk/(2k+1), (n − m)−1/2} whenever t1/2/qFt(t) < const. as t → ∞; they are of order min{mk/(2k+1), B(n) where B(n) → 0 as n → ∞ at a speed similar to typical deconvolution rates (see e.g. Fan and Truong, 1993), when 1/qFt(t) → ∞ as t → ∞; and it is of an order between min{mk/(2k+1), B(n)} and min{mk/(2k+1), (nm)−1/2} in all other cases.

Remark 4.2. (On the parametric rates — I)

The very fast parametric rate noted above may appear counter-intuitive. It can be understood from the fact that, since we know that fT = fX * fU(1), we have some valuable information about the structure of fT: we know that it is the convolution of an estimable density fX and a known density fU(1). If we had a direct sample from fX, this would be the so-called Berkson problem, for which it has been established that the rates of convergence are parametric. See Delaigle (2007). Our situation is more complicated since we can estimate fX only indirectly, and it is only when fU(1) is smooth enough that we can obtain fast parametric convergence rates. Unlike our situation, note that, in the Berkson problem, the parametric rate occurs only in the density estimation context, and not in the regression setting.

Remark 4.3. (On the parametric rates — II)

When the covariate X is observed without error, for past and future observations, instead of applying standard nonparametric estimators of E(Y|X), which only converge at the rate nk/(2k+1), it may seem to be a better idea to artificially add random noise U ~ fU to the future observed value of X, with fU such that fUFtL1, and predict Y by an estimator of E(Y|T), where T = X + U. Indeed, this would correspond to our model, in the situation where 1qFt(t)=fUFtL1, in which case we can use μ^, which converges at a n−1/2 rate. However, it is not clear that, despite the convergence rate, this would lead to better prediction of Y than the error-free estimator of E(Y|X), since Y is more dispersed around E(Y|T) than Y is around E(Y|X) because T exhibits larger errors than X.

4.3 Optimal convergence rates

Here we indicate that in the ordinary smooth case, the convergence rates given by Theorem 4.2 are optimal when δ > 0. A simpler argument can be used to demonstrate optimality when δ < 0, and similar methods can be employed to verify optimality in supersmooth settings. We prove results only in the two-error case, but similar techniques can be used to show that, under regularity conditions, our estimator is also optimal in the general setting of model (1.1).

Let fU(2) and fU(1) denote symmetric densities for which

limsuptmaxj=0,1,2(1+t)α+jdjdtjfU(2)Ft(t)<,
(4.14)

liminftminj=0,1,2(1+t)β+jdjdtjfU(1)Ft(t)>0.
(4.15)

Given −∞ < a < b < ∞, C > 0 and an integer k ≥ 1, write F(a,b,C,k) for the class of densities fXY of (X, Y) such that (a) μ = E(Y | T = ·) and fX are both k times differentiable, with each of these derivatives uniformly bounded, in absolute value, by C; (b) E(Y2 | X = x) ≤ C for all x; and (c) fX(x) ≥ C−1 for x [set membership] [a, b]. Let C denote the class of all estimators μ˘ of μ.

Theorem 4.3

Assume that (4.14) and (4.15) hold, and δ > 0. Then, for each real number w, there exists a constant c > 0 such that

liminfninfμ˘CsupfXYF(a,b,C,k)P[μ˘(w)μ(w)>cmin{(nm)k{2k+2δ},mk(2k+1)}]>0.
(4.16)

These rates correspond exactly to those in Theorem 4.2. They show that the rate at (4.16) is achieved by the estimator μ˜. Although our upper bound giving this rate was derived only for a particular fixed distribution, that bound is readily established uniformly over a function class for which (4.16) holds.

5 Numerical results

We applied our estimators in the particular case where the observations are contaminated by only two types of errors, which is the setting of our empirical example. Note that we have defined two estimators, at (2.8) and (2.9). The first exists only when 1/qFt is integrable, and is simpler to calculate. In particular, it requires neither a bandwidth nor a kernel, and our numerical work showed that it systematically outperformed μ˜, which therefore we do not present in the cases where μ^ exists. We use the notations μ^ and μ˜ for the versions of the estimator μ^ and μ˜ respectively, with the error variances estimated from replicated data as in Section 3. We use the notation μ^NW for the classical Nadaraya-Watson estimator calculated from the data (Wi, Yi), i = 1, …, m. Note that μ^NW is exactly equal to μ˜ when m = n.

5.1 Simulations

We applied the various estimators introduced above to some simulated examples, corresponding to the following models, where we took the {[sm epsilon]i} identically distributed as [sm epsilon] (we use Be(p) to denote the Bernoulli(p) distribution):

  • (1)
    g(x)=3x+20exp{100(x0.5)2}2π, X ~ N(0.5, 1.0/3.922), [sm epsilon] ~ N(0, 0.673);
  • (2)
    Y|X = x ~ Be {g(x)}, g(x) = 0.45 sin(2σx) + 0.5 and X ~ U[0, 1];
  • (3)
    g(x) = sin (σx/2)/{1 + 2x2 (sgn x + 1)}, X ~ N(0, 1), [sm epsilon] N(0, 0.09).

In each case we took UF ~ fU(1), and fU(1) and fU(2) to be either Laplace or normal with zero mean. We generated 200 samples of various sizes, and, for each calculated estimator, say μest, we computed the integrated squared error, ISE = ∫(μest - μ)2. In the graphs, to illustrate the performance of an estimator, we show the estimated curves corresponding to the first (q1), second (q2) and third (q3) quartiles of these calculated ISEs. In each case the target curve is represented by a solid curve. In the tables we provide the average values, denoted by MISE, of the 200 calculated ISEs.

In addition to the bandwidth h, necessary to calculate μ˜ and the classical Nadaraya-Watson estimator μ^NW, all methods, including μ^, required the choice of a ridge parameter ρ, used in their denominators to avoid division by a number close to zero. For each method, at points x where the denominator of the estimator was smaller than ρ, we replaced it by ρ. For a given estimator μest, we selected (ρ, h) — or ρ alone for μ^ — by minimizing the following cross-validation (CV) criterion:

CV=j=1m{Yjμest,(j)(Wj)}2,
(5.1)

where the superscript (−j) meant that the estimator was constructed without using the jth observation.

Figure 1 and Table 1 compare, for various sample sizes, the results obtained for estimating curve (i) when fU(1) was smoother than fU(2), with either both errors normal, or fU(1) normal and fU(2) Laplace. We compare μ^NW, μ^ and μ^ (recall that the * version of estimators is used when the error variances are estimated from replicated observations). Our results show that the estimator μ^ outperforms μ^NW. The μ^ version of μ^ worked almost as well as the latter, showing the limited loss incurred by estimating the error variances from replicated data. We also show the results obtained when fU(2) was smoother than fU(1) with both errors normal, where we compare μ^NW, μ˜, and μ˜. Although the new estimator still outperforms the Nadaraya-Watson estimator, here the gain is less impressive, as predicted by the theory.

Figure 1
Quartile curves for the estimation of curve (i) when fU(1) ~ N, fU(2) ~ Laplace, σU(1)2=σU(2)2=0.2var(X), m = n/2 = 125, using μ^NW (left) or μ˜(right).
Table 1
MISE for estimation of curve (i) when fU(1)~ Normal (N) and fU(2) ~ Laplace (L), with σU(1)2 = σU(2)2 = 0.2 var(X); fU(1) and fU(2) ~ N, with σU(1)2 = 2σU(2)2 = 0.2 var(X); and fU(1) and fU(2) ~ N, with 2σU(1)2 ...

Figure 2 and Table 2 show the results obtained for estimating curve (ii) when fU(1) was smoother than fU(2), with both errors normal. We compare μ^NW, μ^ and μ^ for different combinations of sample sizes. Of course, the situation where we can expect the largest gain by using the new estimator, compared to the classical Nadaraya-Watson estimator, is that where the size, m, of the sample of data contaminated by fU(1) is as small as possible, relative to the total sample size, n. The results, however, indicate that, even when m = 5n/6, the gain can already be quite significant. In this example, μ^ performed so well that it even bettered its known error version, μ^, in the majority of cases.

Figure 2
Quartile curves for the estimation of curve (ii) when fU(1) ~ N, fU(2) ~ N, σU(1)2=2σU(2)2=0.2var(X) and m = n/2 = 250, using μ^NW (left) or μ^ (right).
Table 2
MISE for estimation of curve (ii) when fU(1) and fU(2) ~ Normal with σU(1)2 = 2σU(2)2 = 0.2 var(X).

Finally, Figure 3 compares the results obtained for estimating curve (iii) when both errors are normal with σU(1)2=2σU(2)2=0.2var(X) and m = n/2 = 250. We show the quartile curves obtained for μ^NW and μ^. Again, we see the important gain that can be obtained when using the new estimator compared to the classical Nadaraya-Watson estimator, which uses only (W1, Y1), …, (Wm, Ym).

Figure 3
Quartile curves for the estimation of curve (iii) when fU(1) ~ N, fU(2) ~ N, σU(1)2=2σU(2)2=0.2var(X), m = n/2 = 250, using μ^NW (left) or μ^ (right).

In summary, our simulations showed that when fU(1) was smoother than fU(2), the new estimator substantially outperformed the Nadaraya-Watson estimator. When fU(2) was smoother than fU(1), the gain from using the new estimator was usually less impressive, unless m was relatively small, as predicted by the theoretical results in Section 4. The empirical applications we had in mind when developing the new estimators fall into the category where the fU(1) is smoother than fU(2), see Section 5.2.

5.2 Data Illustration

In part, this paper arises from the following considerations. In nutritional epidemiology, the standard method for correcting for the effects of measurement error in evaluating diet-disease relationships is regression calibration (Carroll, et al., 2006). Using our notation, the method works as follows. Let N be unobserved true long-term nutrient intake. The goal is to regress a response, say disease status D, on N. In the main study, nutrient intakes are measured by a single food frequency questionnaire (FFQ), which is what we call W. Here X is the long-term average intake as measured by the FFQ.

Because N is not observed, most nutritional epidemiology studies take a calibration random sub-sample of the main study population, generally much smaller than the main sample, where they typically measure repeated versions of W, in an effort to understand the measurement error properties of W in the sampled population. In addition, in the sub-sample, they observe an unbiased estimate Y of N. In regression calibration, instead of regressing D on the unobserved N, one regresses D on E(Y|W), where E(Y|W) is estimated from the observations in the sub-sample. Of course, one way to estimate E(Y|W) would be to use the classical Nadaraya-Watson estimator of E(Y|W) based on the direct observations on (W, Y), but our new approach can be used with the averaged replicated data to obtain a more efficient estimator of E(Y|W), as we illustrate below, on a calibration sub-study from the American Cancer Society Cancer Prevention Study II Nutrition Survey (ACS, Flagg et al., 2000).

The main study had approximately 185, 000 adults, while the calibration sub-study was of size 598. In the calibration sub-study, several variables were measured, including Y, an average of protein intake from four food records which is taken to be unbiased for usual intake N, and W, a log-transformed version of protein intake using a FFQ, which was measured twice with error approximately normal N(0,σU2). As above, X is the unobserved long-term average intake as measured by the FFQ. The data we considered were a sample of size n = 598 from (Wi1, Wi2, Yi), for i = 1, …, n, where, for each i, Wi1 = Xi + Ui1 and Wi2 = Xi + Ui2, with Ui1 and Ui2 independent and identically distributed as N(0,σU2). Our target is E(Y|W). A point to note here is that we have transformed W to make the measurement errors normally distributed, but we have not transformed Y, the idea being that disease risk models are interested in the effects of nutrient intakes and not transformed intakes, see Ferrari, et al. (2004, 2008) for examples of this.

This example is convenient for illustrating the various approaches to regression estimation, since the fact that we have direct data from the quantity of interest allows us to consider three different estimators: the classical Nadaraya-Watson estimator μ^NW of E(Y|W) based on the dependent data (Wi1, Yi), for i = 1, …, n, the Nadaraya-Watson estimator μ^NW,dep of E(Y|W) based on the dependent data (Wi1, Yi, (Wi2, Yi), for i = 1, …, n, and our new methodology μ^, based on the averaged data (Wi,Yi), for i = 1, …, n, where Wi=(Wi1+Wi2)2 and σU2 is estimated by half the empirical variance of the differenced replicates. To use the notation of the previous sections, the last approach corresponds to m = 0 (since we do not use the direct data), fU(1)~N(0,σU2) and fU(2)~N(0,σU22). Clearly, fU(1) is smoother than fU(2), so we can use our estimator μ^.

Of course, here we do not know the true curve E(Y|W), so we cannot say which method gives the best estimator. However, the sample size is large, so one way to illustrate the performance of the procedures in a way that is similar to a simulation study is to create a large number (we took 500) of subsamples of smaller size (we took 30, 50, 75 and 100), and examine the variability of each method, for each subsample size. It is not hard to show that, for our method, the squared bias is of smaller order than the variance, and since we do not know the true target, it thus seems appropriate to focus on variance. In Figure 4 we show the estimated curves for 15 subsamples of size 30 (respectively, 100) randomly selected from the 500 randomly created subsamples of size 30 (respectively, 100). We see that, although all methods indicate the same trend for the relationship between W and Y, both versions of the Nadaraya-Watson estimator experience some difficulty, as some of the estimated curves are quite wiggly. To illustrate this further, in Table 3 we show, for each subsample size, the integrated variance of each method on the interval [1, 4] (calculated via the variance of the 500 replications in each case). The main message is the same: our method is less variable than both Nadaraya-Watson estimators, as expected by the theory.

Figure 4
Estimation of μ for the American Cancer data, using the new estimator μ^ (left), the classical NW estimator μ^NW,dep (middle) with dependent data, or the classical NW estimator μ^NW (right), for subsamples of size ...
Table 3
Integrated variances (IVAR) for estimation of E(Y|W) in the empirical example, for various subsample sizes n.

6 Conclusion

We have shown how to predict in errors-in-variables regression, when information from different sources (e.g. different laboratories) is combined, and the errors have different distributions. The methods that we suggest enjoy optimal accuracy, in the sense that the rates of convergence are best possible. However, those rates can vary particularly widely, from root-n rates when the problem is in effect semiparametric, to much slower rates that are characteristic of a genuinely nonparametric problem.

The problem turns out to be quite complex in other ways, too, and has a number of subtle features and apparent contradictions. For example, the results superficially suggest that on occasion it might even be beneficial to artificially add noise to some of the data. However, as explained in Remark 4.3, such a conclusion is unwarranted because it does not take account of the way in which adding noise would affect the conditioning step.

Our methods can also be applied in settings where the error distributions are not known and are instead estimated, for example from repeated measurements, see section 3. The methodology developed there can be taken further. This, and other practically important variants of the problem, offer interesting avenues for further research.

Supplementary Material

References

  • Berry S, Carroll RJ, Ruppert D. Bayesian Smoothing and Regression Splines for Measurement Error Problems. J. Amer. Statist. Assoc. 2002;97:160–169.
  • Carroll RJ, Hall P. Optimal rates of convergence for deconvolving a density. J. Amer. Statist. Assoc. 1988;83:1184–1186.
  • Carroll RJ, Maca JD, Ruppert D. Nonparametric regression in the presence of measurement error. Biometrika. 1999;86:541–554.
  • Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM. Measurement Error in Nonlinear Models: A Modern Perspective. Second Edition Chapman and Hall CRC Press; 2006.
  • Cheng C-L, Riu J. On estimating linear relationships when both variables are subject to heteroscedastic measurement errors. Technometrics. 2006;48:511–519.
  • Cook JR, Stefanski LA. Simulation-extrapolation estimation in parametric measurement error models. J. Amer. Statist. Assoc. 1994;89:1314–1328.
  • Delaigle A. Nonparametric density estimation from data with a mixture of Berkson and classical errors. Canad. J. Statist. 2007;35:89–104.
  • Delaigle A, Fan J, Carroll RJ. Local polynomial estimator for the errors-in-variables problem. 2008. Submitted for publication. [PMC free article] [PubMed]
  • Delaigle A, Hall P, Meister A. On deconvolution with repeated measurements. Ann. Statist. 2008;36:665–685.
  • Delaigle A, Meister A. Nonparametric regression estimation in the heteroscedastic errors-in-variables problem. J. Amer. Statist. Assoc. 2007;102:1416–1426.
  • Delaigle A, Meister A. Density estimation with heteroscedastic error. Bernoulli. 2008;14:562–579.
  • Devanarayan V, Stefanski LA. Empirical simulation extrapolation for measurement error models with replicate measurements. Statist. Probab. Lett. 2002;59:219–225.
  • Fan J. On the optimal rates of convergence for nonparametric deconvolution problems. Ann. Statist. 1991;19:1257–1272.
  • Fan J, Truong YK. Nonparametric regression with errors in variables. Ann. Statist. 1993;21:1900–1925.
  • Ferrari P, Kaaks R, Fahey M, Slimani N, Day NE, Pera G, Boshuizen HC, Roddam A, Boeing H, Nagel G, Thiebaut A, Orfanos P, Krogh P, Braaten T, Riboli E. Within- and between-cohort variation in measured macronutrient intakes, taking account of measurement errors, in the European Prospective Investigation into Cancer and Nutrition Study. American Journal of Epidemiology. 2004;160:814–822. [PubMed]
  • Ferrari P, Day NE, Boshuizen HC, Roddam A, Hoffmann K, Thiebaut A, Pera G, Overvad K, Lund E, Trichopoulou A, Tumino R, Gullberg A, Norat T, Slimani N, Kaaks R, Ribolil E. The evaluation of the diet/disease relation in the EPIC study: considerations for the calibration and the disease models. International Journal of Epidemiology. 2008 Advance Access published January 6, 2008. [PubMed]
  • Flagg EW, Coates RJ, Calle EE, Potischman N, Thun M. Validation of the American Cancer Society Cancer Prevention Study II Nutrition Survey Cohort Food Frequency Questionnaire. Epidemiology. 2000;11:462–468. [PubMed]
  • Ganse RA, Amemiya Y, Fuller w. a. Prediction when both variables are subject to error, with application to earthquake magnitudes. J. Amer. Statist. Assoc. 1983;78:761–765.
  • Hall P, Meister A. A ridge-parameter approach to deconvolution. Ann. Statist. 2007;35:1535–1558.
  • Hu Y, Schennach SM. Identification and estimation of nonclassical nonlinear errors-in-variables models with continuous distributions. Econometrica. 2008;76:195–216.
  • Huang XZ, Stefanski LA, Davidian M. Latent-model robustness in structural measurement error models. Biometrika. 2006;93:53–64.
  • Kim J, Gleser LJ. SIMEX approaches to measurement error in ROC studies. Comm. Statist. Theory Meth. 2000;29:2473–2491.
  • Kipnis V, Subar AF, Midthune D, Freedman LS, Ballard-Barbash R, Troiano R, Bingham S, Schoeller DA, Schatzkin A, Carroll RJ. The structure of dietary measurement error: results of the OPEN biomarker study. Amer. J. Epidemiology. 2003;158:14–21. [PubMed]
  • Kulathinal SB, Kuulasmaa K, Gasbarra D. Estimation of an errors-in-variables regression model when the variances of the measurement errors vary between the observations. Statist. Medicine. 2002;21:1089–1101. [PubMed]
  • Li T, Vuong Q. Nonparametric estimation of the measurement error model using multiple indicators. J. Multivariate Anal. 1998;65:139–165.
  • Linton O, Whang YJ. Nonparametric estimation with aggregated data. Econometric Theory. 2002;18:420–468.
  • Marron JS, Wand MP. Exact mean integrated squared error. Ann. Statist. 1992;20:712–736.
  • National Research Council. Committee on Pesticides in the Diets of Infants and Children . Pesticides in the Diets of Infants and Children. National Academies Press; 1993.
  • Schennach SM. Estimation of nonlinear models with measurement error. Econometrica. 2004a;72:33–75.
  • Schennach SM. Nonparametric regression in the presence of measurement error. Econometric Theory. 2004b;20:1046–1093.
  • Simonoff JS. Smoothing Methods in Statistics. Springer; New York: 1996.
  • Staudenmayer J, Ruppert D. Local polynomial regression and simulation-extrapolation. J. Roy. Statist. Soc., Ser. B. 2004;66:17–30.
  • Staudenmayer J, Ruppert D, Buonaccorsi J. Density estimation in the presence of heteroskedastic measurement error. J. Amer. Statist. Assoc. 2008 to appear.
  • Stefanski LA. Measurement error models. J. Amer. Statist. Assoc. 2000;95:1353–1358.
  • Stefanski LA, Carroll RJ. Deconvoluting kernel density estimators. Statistics. 1990;21:165–184.
  • Taupin ML. Semi-parametric estimation in the nonlinear structural errors-in-variables model. Ann. Statist. 2001;29:66–93.
  • Thamerus M. Fitting a mixture distribution to a variable subject to heteroscedastic measurement errors. Comput. Statist. 2003;18:1–17.
  • Wand MP, Jones MC. Kernel Smoothing. Chapman and Hall; London: 1995.