Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
J Multivar Anal. Author manuscript; available in PMC 2010 April 1.
Published in final edited form as:
J Multivar Anal. 2009 April 1; 100(4): 726–741.
doi:  10.1016/j.jmva.2008.08.003
PMCID: PMC2663964

Improved Estimation in Multiple Linear Regression Models with Measurement Error and General Constraint


In this paper, we define two restricted estimators for the regression parameters in a multiple linear regression model with measurement errors when prior information for the parameters is available. We then construct two sets of improved estimators which include the preliminary test estimator, the Stein-type estimator and the positive rule Stein type estimator for both slope and intercept, and examine their statistical properties such as the asymptotic distributional quadratic biases, the asymptotic distributional quadratic risks. We remove the distribution assumption on the error term, which was generally imposed in the literature, but provide a more general investigation of comparison of the quadratic risks for these estimators. Simulation studies illustrate the finite-sample performance of the proposed estimators, which are then used to analyze a dataset from the Nurses Health Study.

Key words and phrases: Asymptotic distributional quadratic bias, asymptotic distributional quadratic risk, attenuation-correction estimator, James-Stein type estimator, positive rule Stein type estimator, preliminary test estimator, risk function

1 Introduction

Improvement of estimation for regression models is a fundamental and interesting topic. In certain cases, one may have, but is not sure, some prior information for the parameters of interest. By incorporating the information into the estimation procedure, one may give more efficient estimators than the estimators obtained when the prior information is ignored. Statistical approaches for developing more efficient estimators can be roughly classified into two categories. The first one focuses on developing a proper test procedure to check the validity of the uncertain prior. If the prior information is confirmed, then the commonly used estimators are modified to accommodate the prior. The second one is to develop a procedure in which test and estimation can be conducted simultaneously. The first procedure is very natural and commonly used for theory and application purposes. For example, consider the multiple linear regression models:


where X is the n ×p design matrix with rank p, β is the p × 1 regression parameter, and ε is the n × 1 random error vector. Suppose that we have prior information for β, which can be described as = r, where R is a q×p known matrix of rank q and r is a q×1 known vector, qp. A proper test statistic would be based on a distance between R[beta]n and r, where [beta]n is a “good” estimator of β, for instance, the least square estimator, [beta]LS = (XX)−1XY, or the maximum likelihood estimator. If the prior information is rejected, one should keep using these “good” estimators, otherwise restricted least square estimators should be used.

The estimators for the regression parameters, which fall into the second category, include the preliminary test estimator (PTE), the James-Stein type estimator (JSTE), and the positive rule Stein type estimator (PRSE). See Judge and Bock (1978) and Saleh (2006) for a detailed discussion on these estimators. Bancroft (1944) was among the first to consider PTE. Saleh and Sen (1978) extended his idea to a nonparametric setup. JSTE was introduced by Stein (1956) and James and Stein (1961), and expanded by Saleh and Sen (1978, 1986) and Sen and Saleh (1987) to nonparametric areas.

The aforementioned estimation techniques have received much attention recently in linear regression model when the covariates are measured with errors. Stanley (1986, 1988) revealed that JSTE can eliminate inconsistency of the classical least square estimators. Shalabh (1998) studied properties of JSTE when the covariance matrix of the measurement errors is known. For the simple linear regression model with measurement error, when the slope parameter may be the null scalar and all the random components are normally distributed, Kim and Saleh (2003) compared these estimators in the sense of asymptotic distributional quadratic bias, mean square error matrix, and asymptotic distributional quadratic risk. Their comparisons show that PTE behaves better than the attenuation-correction (AC) estimator if the slope is close to 0, but not uniformly better than the AC estimator over the whole range of the regression parameter. Kim and Saleh (2005) further investigated the same question for multiple linear models under the same assumption and setting. Various risk functions based on the asymptotic distribution of the estimators under certain local alternatives are calculated and compared. They also showed that JSTE and PRSE dominate the AC estimator.

This paper mainly focuses on the estimation problem in multiple linear regression models with measurement errors. The contributions we made to the existing literature in this work contain four parts:

  1. to remove the normality assumption. The normality assumption greatly simplifies the theoretical argument, but it is often violated in a practical study. The removal of the normality assumption will make the theoretical results more general and applicable;
  2. to improve estimation under the general constraint = r, which contains the case investigated by Kim and Saleh (2003, 2005), β = 0, as a special one. The theoretical difficulty lies in the question that how the asymptotic distribution of the AC estimator under the local hypothesis depends on the unknown parameter β. This has a substantial impact on constructing the estimators subject to the constraint, or the restricted estimator (RE), which is the building block for constructing PTE, JSTE and PRSE;
  3. to calculate the risk functions for linear combinations of intercept and slope parameters to help estimating the mean response;
  4. to explicitly compare risk functions for various proposed estimators under a certain circumstance.

The outline of the paper is as follows. In Section 2, we define two REs, based on which two sets of PTE, JSTE and PRSE are constructed. The risk functions of various estimators for the slope parameters under the null hypothesis and local alternatives are presented in Section 3. Also the risks are compared among proposed estimators in some special cases. Simulation studies are presented in Section 4. The proposed estimators are used to analyze a dataset from the Nurses Health Study in Section 5. Our results provide more appropriate estimates and information for nutritional study. The proofs of the main results are shifted to the Appendix.

2. Improved Estimator in Multiple Linear Model

Suppose that (Yi, Xi), i = 1,…,n, constitute an independent and identically distributed sample from the linear regression model:


where Xi is a p × 1 vector-valued covariate. We are interested in the estimation of the unknown parameter β when the covariates Xi are measured with error. Instead of observing Xi, we observe Wi = Xi + ui, where the measurement errors ui are independent and identically distributed, independent of (Yi, Xi), with mean zero and covariance matrix Σuu which is assumed to be known throughout this paper.

Section 4.4.3 of Saleh (2006) provides a general road map to construct improved estimators if one has some uncertain information about the unknown parameter, say θ [set membership] Θ, in a statistical model. First, we obtain an optimal unrestricted estimator, say θn, and an optimal RE, say θn, by likelihood method if the likelihood function is available, or by least squares method if the likelihood function is unavailable. Second, we construct an optimum test statistic, say Ln, for testing the “uncertain prior information”, say θ [set membership] Θ0, where Θ0 is a subset of the parameter space Θ. Third, we construct PTE of θ as θn,PTE=θn(θnθn)I(Ln<Ln,α), where Ln,α is the α-level critical value of Ln from its distribution under H0: θ [set membership] Θ0. Finally, we replace the indicator function I(Ln < Ln,α) by a smooth decreasing function cLn1, where c is a suitable constant derived by using empirical Bayesian theory, then JSTE is defined by θn,JSTE=θnc(θnθn)Ln1, and PRSE by θn,PRSE=θn+(1cLn1)I(Ln>c)(θnθn).

To adopt the above general rule in our setting, we need to find an “optimal” test statistic to check = r. The test statistic we use in this paper is Ln = n(R[beta]ACr)′(nR′)−1(R[beta]ACr), where [beta]AC is the AC estimator which is defined by


where SWW and SWY are the sample covariance matrices of Wi’s, and the sample covariance between Wi’s and Yi’s, respectively, Ĝn is a consistent estimator of the asymptotic covariance matrix of [beta]AC, denoted by G which is defined by (2.5) in the next section. Under general conditions, Ln has an asymptotic χ2-distribution with q degrees of freedom. In fact, Ln is the likelihood ratio statistic under the normality assumptions. The next step towards our end is to find an RE for β under the general constraint.

2.1. Construction of estimator subject to the constraint

Here is the intuition behind the methods of construction of RE. If (X, ε, U) is normally distributed, then we can directly calculate the conditional expectation E(Y|W), which is a linear function of W and can be used to derive the maximum likelihood estimators of β0 and β, and the associated RE. If (X, ε, U) is not normally distributed, E(Y|W) may not easily be calculated, or may be a nonlinear function of W. We give another version of RE as follows.

Assume that (Xi, εi, ui) follow N2p+1[( μx, 0, 0′)′, blockdiag(Σxx, σ2, Σuu)]. It is easy to see that E(Yi|Wi) = ν0 + νWi, Var(YiWi)=βxx(IKxx)β+σ2σyw2, where ν0 = β0 + β′(IKxx)μx, ν = Kxxβ, Kxx=WW1xx is the reliability matrix. Gleser (1992) and Kim and Saleh (2005) showed that the maximum likelihood estimators of ν0, ν and σyw2 are just the naive least squares estimators, namely ν^0n=Y¯ν^nW¯,ν^n=SWW1SWY and σ^yw2=n1i=1n(Yiν^0nν^nWi)2 provided σ^yw2ν^nK^xx1uuν^n is nonnegative, where K^xx=SWW1^xx=SWW1(SWWuu), Y and w are the sample means of Yi’s and Wi’s, respectively. Hence an estimator for the slope β can be obtained from ν = Kxxβ with ν replaced by [nu with circumflex]n and Kxx by [K with circumflex]xx. This leads to the AC estimator [beta]AC given by (2.2), and [beta]0 = Yw[beta]n.

To construct REs for β0 and β, we first give a RE for ν. Note that the general restriction can be written as RKxx1ν=r. If Kxx is known, then using the Lagrangian multiplier, one can show that the restricted maximum likelihood estimators of ν, ν0 are given by,


If the reliability matrix Kxx is unknown, replace it with [K with circumflex]xx. Then the REs for β, β0 are defined as


where H^n=K^xx1SWW1K^xx1, which can easily be shown to be a consistent estimator of H=Kxx1WW1K^xx1.

If the random components are not normal, the conditional expectation E(Y|W) = β0 + βE(X|W) may not be linear in W, and the conditional variance Var(Y|W) may vary with W. Hence, a linear model is inappropriate. In fact, linearity of E(X|W) in W and homoscedasticity of Var(X|W) in W imply that (X′, U′) must be multivariate normal, see Geol and Degroot (1980), Rao (1976). Although the estimators given in (2.3) are obtained based on the normal assumption, it is worth investigate whether the estimators [beta]RE, [beta]0, RE have good properties in non-normal settings. See Gleser (1992) for the details.

Another way to construct the REs in the measurement error regression models is to mimic the procedure for the RE in model (1.1). That is, betaRE = [beta]LSSR′(RSR′)−1(R[beta]LSr), where S is the asymptotic covariance matrix of [beta]LS. Then an RE for β in the measurement error setting can be defined as


where Ĝn is a consistent estimator of the asymptotic covariance matrix G of [beta]AC, and


See Fuller (1987), Carroll et al. (2006) and Kim and Saleh (2005) how to derive Ĝn under the normal and non-normal setups. The RE for β0 can be defined accordingly. Clearly, the difference between the REs given in (2.3) and (2.4) comes from the difference between Ĥn and Ĝn.

2.2. Construction of Improved Estimators

Now we are ready to give PTE, JSTE and PRSE for the regression coefficient β based on Saleh’s (2006) general rule and the REs defined by (2.3) and (2.4). For the sake of clarity, we shall put a ^ over those estimators based on [beta]RE, and a ~ over those based on betaRE.

  • PTE: β^PTE=β^AC(β^ACβ^RE)I(Ln<χα2), where χα2 is the upper α-percentile of the χ2-distribution with degrees of freedom q;
  • If q ≥ 3, then one can define
  • JSTE: β^JSTE=β^AC(q2)(β^ACβ^RE)Ln1,
  • PRSE: β^PRSE=β^RE+[1(q2)Ln1]I(Ln>q2)(β^ACβ^RE).

In a similar way, we can define betaPTE, betaJSTE, and betaPRSE by replacing [beta]RE with betaRE in the expressions of [beta]PTE, [beta]JSTE and [beta]PRSE, respectively. It would be interesting to see what differences these estimators based on the two REs may have with respect to the risk comparisons.

From the above definitions, one can see that if the data yield Ln<χα2, then [beta]PTE = [beta]RE, otherwise, [beta]PTE = [beta]AC. So PTE is indeed a simple mixture of the AC estimator and the RE. In the ordinary two-step procedure, one would test the hypothesis = r first, then based on the testing result decide which estimator will be adopted. PTE simply combines these two steps to form a single one. That is, testing and estimation are done simultaneously, while JSTE replaces the indicator function I(Ln<χα2) with a continuous function, (q2)Ln1, of Ln. In the normal case, one can actually obtain JSTE by the empirical Bayesian estimation approach. The constant appearing in JSTE in the classical multiple regression model should be (q − 2)(np)/(np + 2) instead of q − 2 (Saleh, 2006). Since we take into account the asymptotic risk function, a change of the constant to q − 2 will not induce any difference in large sample sense, although this change may have some impacts on the small sample behavior of the estimators. The derivation of PRSE in the current setting is similar to the counterpart in the classical regression case. In fact, if Ln tends to 0, then JSTE may go “past” the estimator [beta]RE, or JSTE may have a different sign from the RE. As a partial remedy, one may restrict Ln > q − 2, which results in PRSE. The corresponding PTE, JSTE and PRSE for the intercept can be defined accordingly.

For the sake of convenience, in what follows, we shall call the estimators [beta]RE, [beta]PTE, [beta]JSTE, and [beta]PRSE the “hat” estimators, and the estimators betaRE, betaPTE, betaJSTE, and betaPRSE the “tilde” estimators.

2.3. Asymptotic Distribution of the New Estimators

To begin with, we state regularity conditions associated with model (2.1), which will be used throughout the current and next sections. Some conditions are already mentioned in the previous section. They are listed again for the sake of completeness.

  • C1: The measurement errors εi, i = 1,…,n, are i.i.d. with mean 0 and finite positive variance σ2; the measurement error vectors ui, i = 1,…,n, are i.i.d. with mean vector 0 and covariance matrix Σuu, which is known. The fourth moment of the Euclidean norm for u exists, that is, E||u||4 < ∞.
  • C2: εi and ui are independent, i = 1,…,n.
  • C3: Xi’s are i.i.d. with mean vector μx and finite positive covariance matrix Σxx, and are independent of εi and ui.

These conditions are quite common in the literature of measurement error models. The existence of the fourth moment of u is needed to ensure the asymptotic normality of the AC estimator. The existence of the fourth moment of u was also assumed in Schneeweiss (1976).

We begin with considering the asymptotic behavior of the improved estimators under the fixed alternative: Ha : r = δ ≠ 0, where δ is a vector of length q. We claim that Ln → ∞ in probability as n → ∞. This claim is based on the following expression.


The first term is the order of Op(1), the second Op(n) and the third Op(n) and positive.

Let βn be a generic notation for the hat estimators and tilde estimators. Under the fixed alternative, it is ready to see that n(βnβ)=n(β^ACβ)+op(1) if Conditions C1–C3 hold. That is, all the estimators defined above are asymptotically equivalent to [beta]AC. This implies that the asymptotic risk functions are all the same if Ha is true. Thus we can not tell any difference among these estimators.

To obtain meaningful risk comparisons among these estimators, we consider a sequence of local alternatives, that is,


with fixed R, δ and r, limn→∞ βn = β, = r, where δ is a q × 1 vector. Write


The following theorem states the asymptotic distributions of the AC, hat and tilde estimators under the local alternatives (2.6).

Theorem 2.1

Suppose Conditions C1–C3 hold. Then, under the local alternatives (2.6), as n → ∞, we have, in distribution,


where Z ~ N(0, G), L = (RZ +δ)′(RGR′)−1(RZ + δ). For βn=β^AC, [var phi](L) = 0; for the hat estimators, C = H, for [beta]RE, [var phi](L) = 1; for [beta]PTE, φ(L)=I(0,χα2)(L); for βJSTE, [var phi] (L) = (q − 2)L−1; and for [beta]PRSE, [var phi](L) = 1 − [1 − (q − 2)L−1]I(q−2,∞)(L). For the tilde estimators, C = G, the [var phi](·) functions are the same as their counterpart in the hat estimators.

In particular, if R = Ip×p, then the asymptotic distribution of the two REs are the same, so are the asymptotic distributions of the two PTEs, JSTEs, and PRSEs.

3. Asymptotic Bias and Risk Functions

The commonly used quantities to evaluate the performance of an estimator are its bias, variance or mean squared error. For the estimators proposed in the last section, we will calculate their asymptotic distributional quadratic bias, asymptotic distributional quadratic risk function with different weight matrices, and show that all estimators but AC are asymptotically biased. The general comparisons between different estimators based on these asymptotic result turn out to be very difficult, but detailed comparisons may be made under special circumstances. We mainly focus on the discussions of asymptotic bias and risk functions of the estimators for the slope parameters and certain linear combinations of the intercept and slope parameters.

The asymptotic distributional bias function of the estimator βn is defined as the bias of the asymptotic distribution of n(βnβn), that is, b(β,βn)=E[Zφ(L)JC(RZ+δ)] according to Theorem 2.1. To make meaningful comparisons among the asymptotic distributional biases, we define the asymptotic distributional quadratic bias as below:


Let M be a known positive definite weight matrix. The asymptotic risk function of the estimator βn of β is defined as


To concisely state the results, the following notation is used.


where i = 2, 4, j = 1, 2, χq+i,λ2 is the noncentral χ2 random variable with degrees of freedom q + i and noncentral parameter λ.

The derivation of the asymptotic distributional bias functions can be done by using the fact Eφ(L)Z=G1/2μ[Eφ(χq+2,λ2)Eφ(χq,λ2)], for any measurable function [var phi] (·). This can be verified by using (A.8) in the Appendix. The details are omitted for the sake of brevity. The following theorem lists the results for the asymptotic distributional bias. Its proof can be finished by Theorem 2.1. We omit the details.

Theorem 3.1

Suppose Conditions C1–C3 hold. Then, under the local alternatives, the asymptotic biases and asymptotic distributional quadratic biases for the AC, hat and tilde estimators are given by


respectively. For the AC estimator, f(λ) = 0; for all hat estimators, C = H, and for all tilde estimators, C = G. For RE, f(λ) = 1; for PTE, f(λ) = gq+2(λ); for JSTE, f(λ) = h1,q+2(λ); and for PRSE, f(λ) = k1,q+2(λ).

Within each set of estimators, the asymptotic distributional quadratic biases can be compared simply through the quantities gq+2(λ), h1,q+2(λ), and k1,q+2(λ). It is worth mentioning that among each set of estimators, the asymptotic distributional quadratic biases of PTE, JSTE and PRSE are smaller than that of RE, since gq+2(λ), h1,q+2(λ), and k1,q+2(λ)are all not bigger than 1.

The more interesting comparison would be made between the hat estimators and tilde estimators. A natural way to make the comparisons is to compute the difference between the biases or the risk functions. However, the computation is rather complex in general situations. But for some special cases, the comparison can be made easily. In particular, we have the following corollary.

Corollary 3.1

Suppose u ~ Np(0, σuuI), Σxx = σxxI, then for any δ, the asymptotic distributional quadratic biases of the hat estimators are all less than those of the corresponding tilde estimators.

For an illustrative purpose, let p = 5, q = 3 (recall that q > 2 is required in constructing JSTE and PRSE), α = 0.15, β = (3, 3, 1, 2, 3)′, Σxx = Σuu = I5×5, u ~ N(0, Σuu), σ2 = 1, and R be a 3 × 5 matrix with the components in the first two columns all 0 and the last three columns forming an identity matrix. δ is a vector with same elements, specifically, δ =δ1, where δ is a scalar, 1 is a vector of 1’s. The plot (a) in Figure 1, in which the dotted lines represent the asymptotic distributional quadratic biases for the tilde estimators and the solid lines for the hat estimators, delineates the asymptotic distributional quadratic biases. In our case, the asymptotic distributional quadratic bias functions of the hat and tilde estimators for each type of estimators (RE, PTE, JSTE or PRSE) are close to each other, but the asymptotic distributional quadratic biases of the hat estimators are all slightly smaller than those of the corresponding tilde estimators, which confirms our discovery in Corollary 3.1. Now, we change Σxx to a positive matrix with diagonal elements all 1’s and off-diagonal elements all 0.1’s, other quantities stay unchanged. In contrast to plot (a), plot (b) in Figure 1 shows an inverse direction, that is, the asymptotic distributional quadratic biases of the tilde estimators are all slightly smaller than those of the corresponding hat estimators. This phenomenon shows, in the sense of asymptotic distributional quadratic bias, that neither the hat estimators nor the tilde estimators can dominate the others over all scenarios.

Figure 1
Bias Plots. The dotted and solid lines are the biases for the tilde estimators and the hat estimators and the horizontal line represents the bias of the AC estimator.

The asymptotic weighted risk functions of [beta]AC, the hat and tilde estimators are summarized in the following theorem.

Theorem 3.2

Suppose Conditions C1–C3 hold, then


while the risk functions of [beta]JSTE, [beta]PRSE have similar forms as the following


The risk function of [beta]JSTE corresponds to f = h, and [beta]PRSE corresponds to f = k.

The risk function comparisons can be done by investigating the matrices G and H, but the actual comparison may be complicated, if possible, because G=(σ2+βuuβ)H+xx1[E(uuββuu)uuββuuuu(βuuβ)]xx1, and the risk function involves inverse operations such as (RGR′)−l. This comparison can be done in a straightforward way for some special cases. For example, if M = H−1, we have the following corollary.

Corollary 3.2

Suppose Conditions C1–C3 hold. Let M = H−1. Then


the risk functions of [beta]JSTE, [beta]PRSE have similar forms as the following


The risk function of [beta]JSTE corresponds to f = h, and [beta]PRSE corresponds to f = k.

Based on Corollary 3.2, one can make the comparisons among the estimators more specifically. In fact,

  • the pretest estimator [beta]PTE performs better than the AC estimator [beta]AC if and only if δ satisfies
    In particular, under H0, that is δ = 0, then
  • For a given α, PTE is not uniformly better than the AC estimator. One may determine an α such that PTE has a minimum guaranteed relative efficiency. Similarly to Kim & Saleh (2005) and Saleh (2006), the relative efficiency of [beta]PTE to [beta]AC is defined as E(α, λ) = ρ(β, [beta]AC)/ρ(β, [beta]PTE). For any given R, r, G, H, the relative efficiency is a function of α and λ. Suppose the minimum efficiency required is E0, then we can choose α by solving the equation minλ E(α, λ) = E0. The explicit solution may not be available, but we can use a numerical method to search for the minimization. That is, we compute the corresponding minλ E(α, λ) for several α values, and select one such that minλ E(α, λ) is close to but not less than E0. Now we use the setup for Figure 1 to illustrate the choice of α. The weight matrix M is chosen to be H−1 and G−1, respectively. Table 1 reports, for each α, the maximum relative efficiency (denoted by max), the minimal relative efficiency (denoted by min) and also the value of δ corresponding to the minimal relative efficiency, denoted by δmin.
    Table 1
    Maximum and Minimum Guaranteed Efficiencies
    So, in our cases, if PTE of β is chosen with at least 0.80 relative efficiency comparing to the AC estimator, in both case (M = H−1 and M = G−1), we choose α = 0.15 as the level of the test. In practice, G and H are unknown. To implement this procedure, one needs to have some preliminary estimates for these quantities.
  • The risk function of [beta]JSTE can be written as
    which can be shown by using the definition of h-function and the following facts on χ2-distributions:
    See (2.2.13d) and (2.2.13e) in Saleh [2006].
    A sufficient condition that ensures [beta]JSTE being superior to [beta]AC is given by
    where Λmax(·) is the maximal eigenvalue of its matrix argument. If, in addition, we assume that u ~ N(0, σuuI), Σxx = σxxI, then the above sufficient condition can be written as (c1c2)q ≥ 2c1, where c1 = σ2 + σuuββ, c2=βR(RR)1Rβσuu2/(σxx+σuu). Furthermore, if c2 = 0, this sufficient condition is simplified as q ≥ 2. This, together with the following fact, implies that PRSE and JSTE dominate the AC estimator when c2 = 0, the case studied by Kim and Saleh (2005).
  • The difference ρ(β, [beta]JSTE) − ρ(β, [beta]PRSE) is
    which is nonnegative for all δ, and implies that the performance of PRSE is uniformly better than that of JSTE.
  • The difference ρ(β, [beta]JSTE) − ρ(β, [beta]PTE) is
    Under the null hypothesis (δ = 0),
    Thus, in the case of being close to r, at the significance level α, PTE should be used if qP(χq+22<χα2)q2, otherwise, JSTE is preferable.
  • The difference ρ(β, [beta]PRSE) − ρ(β, [beta]PTE) is
    Under the null hypothesis (δ = 0), ρ(β, [beta]PRSE) − ρ(β, [beta]PTE)equal to
    Thus, in the case of being close to r, at the significance level α, PTE should be used if qP(χq+22<χα2)q2+qE[(1(q2)χq+22)2I(χq+22<q2)], otherwise, PRSE is preferable.

By applying the above theorem with H replaced by G, we can obtain the asymptotic weighted risk functions for all tilde estimators. Similar risk comparison analysis can be made as above, which we summarize in the following theorem.

Theorem 3.3

Suppose Conditions C1–C3 hold, then


the risk functions of betaJSTE, betaPRSE have similar forms as the following


The risk functions of betaJSTE and correspond to f = h, and f = k, respectively.

In particular, if M = G−1, then the above theorem reduces to the following corollary.

Corollary 3.3

Suppose Conditions C1–C3 hold, then


For the purpose of illustration, we use the previous setting, that is, p = 5, q = 3, α = 0.15, β = (3, 3, 1, 2, 3)′, Σxx = Σuu = I5×5, σ2 = 1, and R is a 3 × 5 matrix with the components in the first two columns being 0’s and the last three columns an identity matrix. For simplicity, δ is taken to be a vector with same elements. We plot the risk functions of the estimators in Figure 2, in which plot (a) is for M = H−1, and plot (b) is for M = G−1. In our cases, the risk functions of the hat and tilde estimators for each type of estimators (RE, PTE, JSTE or PRSE) are close to each other.

Figure 2
Risk Plots. The dotted and solid lines are the risks for the tilde estimators and the hat estimators and the horizontal line is the risk of the AC estimator.

Sometimes, we are interested in estimating a linear combination of the intercept and the slope parameters. For example, to predict the response at a specified value of the predictor, say x0 [set membership] Rp, one needs to calculate the value of β0n+x0βn, where β0n and βn are the generic notation for the AC estimator, hat or tilde estimators of β0 and β. As we mentioned before, the corresponding estimators of the intercept β0 are given by Y¯W¯βn. For example, the hat AC estimator of β0 is defined as [beta]0,AC = Yw[beta]AC, and the tilde RE of β0 is given by beta0,RE = YwbetaRE. We can define the AC estimator and the hat, tilde estimators of the linear combination β0+x0β accordingly. The risk function of β0n+x0βn is then defined as the mean square error of the asymptotic distribution of n(β0n+x0βnβ0x0βn) which is denoted as ρ(β0x0β,β0n+x0βn), where βn is the same as in (2.6). The following theorem gives the risk functions of various estimators of the linear combination. Let τ=G1xx1Euβuuβμx,σ02=σ2+βuuβ+μxGμx2μxxxEuβuuβ.

Theorem 3.4

Suppose Conditions C1–C3 hold. Then ρ(β0x0β,β0n+x0βn)=σ02+x0Gx0+2x0Gτ+(x0μx)JC[RGREφ2(χq+2,λ2)+δδEφ2(χq+4,λ2)]JC(x0μx)2(x0μx)JC[RGEφ(χq+2,λ2)δδJGEφ(χq+2,λ2)+δδJGEφ(χq+4,λ2)](x0+τ), where C = H corresponds to the estimators based on [beta]RE and C = G corresponds to the estimators based on betaRE, [var phi](x) = 0 for the AC estimator, [var phi](x) = 1 for the RE, φ(x)=I(0,χα2)(x) for PTE, [var phi](x) = (q −2)x−1 for JSTE and [var phi](x) = 1 − [1 − [q − 2)x−1]I(q−2,∞) (x) for PRSE.

When the measurement errors are symmetric around 0, Euβuuβ = 0 for all β. As a consequence, we have the following corollary.

Corollary 3.4

Suppose Conditions C1–C3 hold, and Euβ′uu′β = 0. Then


where σ02=σ2+βuuβ+μxGμx, C and [var phi](·) are as in Theorem 3.4.

4. Simulation Studies

To see the finite sample performance of the proposed estimators, we conduct simulation experiments under the various scenarios. The data are generated from the following multiple linear regression model with measurement errors


The true regression parameter is chosen to be βn=β+δ1/n with = r. We consider two cases, δ = 0, and δ ≠ 0. For each case, we calculate the risk function for sample size n = 50, 100, 200, and 500, and repeat each simulation 1000 times. The values of the risk reported in the tables below are the average of 1000 sample risks. In the simulation, α is chosen to be 0.15. In practice, one can choose the value of α based on the maximin rule given in Table 1. The predictors X are generated from multivariate normal N(0, I8×8), and the measurement errors u are generated from multivariate normal N(0, 0.22I8×8). X and u are independent. The regression error ε follows N(0, 1).

Case 1

When δ = 0 or n = r holds.

The true values of the regression parameters are chosen to be β0 = 1, β1 = β5 = β7 = 0, β2 = 1.5, β3 = 0.75, β4 = −β6 = 2, β8 = 3. So the β’s satisfy the constraint = r with


Table 2 reports the risk values of various slope estimators.

Table 2
Risk for the slope when the null hypothesis is true

In Table 2, “hat” denotes the hat estimators, and “tilde” denotes the tilde estimators. Each cell gives the risk value with weight matrix G−1. The risks with weight matrix H−1 have a similar pattern and therefore are omitted. In this simulation, one can see that the risks are pretty stable within each class of estimators, also the risks of the hat estimators and of the tilde estimators are very close. The smallest risk is achieved by RE, then PTE, PRSE, JSTE, and the largest risk is obtained by the AC estimator. These results coincide with our theory. For an illustrative purpose, the histograms of the estimates, with the weight matrix M = H−1, of the slope parameter for n = 200 are given in Figure 3.

Figure 3
Histograms for the estimates of the slope parameters.

Case 2

When δ ≠ 0 or n = r does not hold.

In this case, Rβnr=δR1/n, where 1 is a 8 ×1 vector with elements all 1’s. Other quantities in the model stay unchanged. Figure 4 reports the risks of the tilde estimators with weight matrix G−1 for different sample sizes. The δ value ranges from 0 to 10. The risks of the hat estimators have the similar pattern regardless of the choice of weight matrix.

Figure 4
Risk Functions. Thin solid-line: risk of the AC estimator; dashed line: risk of RE; dotted line: risk of PTE; dash-dotted line: risk of JSTE; thick solid-line: risk of PRSE.

The risk of the AC estimator for the slope is almost constant for various δ. When the values of δ is close to 0, the RE for both slopes and intercept achieves the smallest risk, but its risk increases quickly when δ gets bigger. PTE for both slopes and intercept has smaller risk when δ is smaller, once δ leaves 0, the risk of PTE begins to increase and exceeds the risk of the AC estimator. After it hits a certain point, it comes down and eventually approaches to the risk of the AC estimator. For small δ values, the risks of JSTE and PRSE for the slope estimators are higher than that of the RE. However, when δ gets bigger, JSTE and PRSE begin to dominate all other estimators. The patterns of the risks of the AC estimator, RE and PTE for the intercept are similar to those of the AC estimator, RE and PTE for the slopes, but this is not true for the risks of JSTE and PRSE for the intercept.

Upon the request of one referee, we also conduct a simulation study when X and u follow non-normal distributions. Each component of the predictors X is generated from a uniform distribution on [1, 2], and each component of the measurement error u is generated from a uniform distribution on [−0.5, 0.5], X1, …, X8, u1, …, u8 are independent. All other entities are the same as in the normal case. The simulation results are similar to the normal cases and are not reported here.

5. Real Data Example

The assessment of an individual diet is difficult, but fundamental in exploring the relationship between diet and cancer, and in monitoring dietary behavior among individuals and populations. A variety of dietary assessment instruments have been derived, of which three main types are most commonly used in nutritional research. The instrument of choice in large nutritional epidemiology studies is the Food Frequency Questionnaire (FFQ). For proper interpretation of epidemiologic studies that use FFQ’s as the basic dietary instrument, one needs to know the relationship between reported intakes from the FFQ, usual intake, energy, vitamin A, and other variables such as age and body mass index (bmi).

FFQ’s are thought to often involve a systematic bias (i.e., under- or over-reporting at the level of the individual). The other records also include measurement errors. To illustrate the proposed method, we analyze a data set from the Nurses Health Study (Rosner, Spiegelman, and Willett, 1990), which has a calibration study of size n = 168 women. All of them completed a single FFQ and four multiple-day food diaries. There are 6 variables, age (X1), bmi (X2), energy (X3), vitamin A (X4), usual intake (X5), and the calories from fat, FFQ (Y), in this dataset. Among these 6 variables, energy, usual intake, vitamin A are measured with error, but for each subject, these 3 variables are measured four times. A simple variance analysis suggests that the variance of energy is 3.63, the variance of vitamin A is 381.92, and the variance of usual intake is 10.34. For an initial analysis, the following multiple regression model is used to fit this dataset.


The averages of these four replications in X3, X4, X5 are used in the design matrix. The covariance matrix of the measurement errors is estimated by using the formula in Liang, Härdle, and Carroll (1999).

With all variables in the model and without constraint, Table 3 lists the estimated values based on the AC estimation for the slope, the estimated standard errors based on the procedure developed by Liang, et al.(1999), and the associated p-values calculated from t-distribution with degrees of freedom n − 6 = 168 − 6 = 162.

Table 3
Estimates of the slope parameters with standard errors and associated p-values.

Table 3 shows that X3 and X4 are not significant at 0.1 significant level, while variable X1, X2, and X5 are significant. Recall that X5 represents usual intake, which is strongly related to intakes from the FFQ. On the other hand, vitamin A should not be a good predictor of food composition. Thus, using advanced statistical methods for getting a reasonable estimate but weighting towards β3 = β4 = 0 makes a lot of sense for nutritional research.

Now we impose a constraint on our model β3 = β4 = 0. In this case q = 2. Table 4 reports the estimated values, obtained by calculating the estimators we are studying, for the regression parameters. To measure the variation of these estimates, the bootstrap standard errors are also reported here. For each estimation procedure, 1000 bootstrap samples are drawn. Accordingly, 1000 bootstrap estimates for the slope parameters are obtained. To get a robust estimation for the standard errors, only the middle 80%, or 800 estimates are used in the calculation. The resulting bootstrap standard errors are shown in the brackets.

Table 4
Estimates and Bootstrap Standard Deviations

Note that the bootstrap standard errors of AC in Table 4 are close to the standard errors of AC in Table 3. Also, the standard errors of RE, PTE are all smaller than their counterparts of AC. These results clearly show that the proposed estimators improve upon the usual AC estimator.

6. Discussion

We have introduced two classes of estimators, the hat and tilde estimators, and made a comprehensive comparison of their risk functions in some special cases. The comparison in general cases is more complicated, and it is difficult to say which estimators should be used in practice, unless further information is available, such as the values of R, r, G and H etc. Usually, the practitioner may have some prior information about R and r. Also, once the sample is obtained, one can estimate G and H. Based on our comparison (see Figures 1 and and2),2), the quadratic bias function and the risk functions of the hat estimators and the tilde estimators for the slope parameters do not differ substantially, even for different weight matrices. Our analysis indicates that if the prior knowledge about the regression parameters is true, RE, PTE, JSTE and PRSE all have smaller risks than the AC estimator. When the regression parameters deviates from the prior information, RE becomes useless eventually, while PTE behaves quite well, except for some medium departure from the null hypothesis. If the slope parameters are of interest, JSTE and PRSE are highly recommended in that they possess smaller risk than the AC estimator, RE and PTE. In particular, PRSE dominates JSTE in some special cases, see the discussions following Corollary 3.1.

The procedure developed in this paper can easily be extended to the case where Σuu is unknown but we have a consistent estimator of Σuu.

It is also possible to extend the procedure to the partially linear models Y = Xβ + ν(z) + ε with error-prone linear covariate X. The major work can be regarded as a combination of this paper and the work of Liang et al. [1999], because the latter already derived a root −n consistent estimator of the parameter β.

In principle, the method proposed in this paper can also be extended to linear or partially linear models for longitudinal data. The derivation should be straightforward except for more complex notation.


The authors thank two referees for their helpful comments which improved the presentation of this manuscript. They also thank Dr. Raymond J. Carroll for very helpful comments. This research was partially supported by NIH/NIAID grants AI62247 and AI059773.

Appendix. Proofs of the Main Results

Proof of Theorem 2.1

A direct derivation yields [beta]ACβn = (SWW − Σuu)−1(SWYSWWβnuuβn), and SWYSWWβn=1ni=1n(Wiμx)(εiuiβn)(μxW¯)u¯βn+Op(1/n). Let ξin=(Xiμx)(εiuiβn)+uiεi+(uiuiuu)βn. It follows that


Note that the last two terms approach zero and SWW − Σuu → Σxx in probability. Theorem 2.1 follows for the AC estimator by applying the central limit theorem, together with the fact that βnβ.

For all other hat estimators, we have the general form [beta]* = [beta]AC − ([beta]AC[beta]RE)[var phi](Ln). Recall [beta]RE = [beta]ACĤnR′(nR′)−1(R[beta]ACr). It follows that [beta]* = [beta]AC[var phi](Ln)ĤnR′(nR′)−1(R[beta]ACr). Under the local alternative Rβn=r+δ/n, we know that


So the result for hat estimators follows from the fact that ĤnR′(nR′)−1JH in probability, and n(β^ACβn)Z in distribution. In the same way, we can prove Theorem for all tilde estimators.

To prove Corollary 3.1 and compute the risk function of other estimators, we need the following lemmas.

Lemma A.1

(Saleh, 2006) If the p × 1 vector Y is distributed normally with mean vector μ and covariance matrix Ip×p, then for any measurable function [var phi],



provided that the expectations exist.

Lemma A.2

If the p × 1 vector Y is distributed normally with mean vector μ and covariance matrix Ip×p, and A is an idempotent matrix with rank q ≤ p, then for any measurable function [var phi], we have




provided the expectations exist, where hi=Eφ(χ(q+i,λ)2), λ = μA μ.


Let P be the p × p orthogonal matrix such that PAP′ = Blockdiag(Iq×q,0(pq) × (pq)), and P=(P1,P2), P1 is q × p, and P2 is (pq) × p. Let Z = PY, then YAY = Z′Blockdiag(I, 0)Z. Partition Z into two blocks Z=(Z1,Z2). It is easy to see Z1 ~ N(P1μ, Iq×q), Z2 ~ N(P2μ, I(pq) × (pq)) Z1 and Z2 are independent. Hence YAY=Z1Z1, YY′ = PZZP, and


By (A.1), E[φ(Z1Z1)Z1Z1]=E[φ(χ(q+2,μP1P1μ)2)]Iq×q+E[φ(χq+4,μP1P1μ2)]P1μμP1. By the independence of Z1 and Z2 and (A.2),


Hence E[[var phi](YAY)YY′] equals to


Thus (A.3) is obtained by using the facts that P1P1=A,P2P2=IA.

To prove (A.4), one can show that Eφ(YAY)Y=P1P1μh2+P2P2μh0. Again P1P1=A, and P2P2=IA imply the desired result.

Proof of Corollary 3.1

It is sufficient to show that JHJHJGJG. From the normality of u and Lemma A.2, one can show that


Let c1=(σ2+σuuββ)(σxx+σuu)/σxx2,c2=σuu2/σxx2 The diagonal form of the covariance matrices of x and u implies G = c1I + c2ββ′. Then


Note that H=(σxx+σuu)I/σxx2, one can obtain that (RH R′)−1RH HR′(RH R′)−1 = (RR′)−1.

It follows from a direct calculation that


Since R′(RR′)−1R is an idempotent matrix with rank qp, so IR′(RR′)−1R is nonnegative definite, and then βR′(RR′)−1ββ, (A.6) and (A.7) imply




Note that the left hand side is JHJH, and the right hand side is JGJG. We complete the proof.

Proof of Theorems 3.2 and 3.3

From Theorem 2.1, one can see that the asymptotic distributions for all estimators are the same as that of Z[var phi](L)CR′(RCR′)−1 (RZ + δ), where C = H or G, and [var phi](·) is defined in Theorem 2.1. We now compute the risk function for this general form. The results in Theorems 3.2 and 3.3 can be obtained by replacing C with H and G, respectively.

Let Y = G−1/2Z + μ, μ = G1/2R′(RGR′)−1δ, A = G1/2R′(RGR′)−1RG1/2. Then Y ~ Np (μ, I), and L = YAY. We first show that




In fact, E[var phi](L)Z can be written as G1/2E[var phi](L)YG1/2μE[var phi](L). By (A.4) and Eφ(L)=Eφ(χq,λ2), we obtain that


Then (A.8) is obtained by noticing that = μ.

Note that E[var phi](L)ZZ′ = G1/2E[[var phi](L) (Yμ) (Yμ)′]G1/2, and E[[var phi](L) (Yμ) (Yμ)′] = E[[var phi](L)YY′] − E[[var phi](L)Y]μ′ − μE[[var phi](L)Y′] + μμE[[var phi](L)]. Using (A.3) and (A.4), after some algebra, we prove (A.9).

Now we are ready to compute the risk functions. For any positive definite matrix M, we have ρ(β,βn)=E[Zφ(L)JC(RZ+δ)]M[Zφ(L)JC(RZ+δ)], which equals


The first term equals tr(MG). To compute the second and third terms, note that




by (A.8) and (A.9). Correspondingly,


This completes the proof.

Proof of Theorem 3.4

Recall that the AC estimator for β is given by [beta]AC = (SWW − Σuu)−1SWY, and the estimator of β0 is given by [beta]0 = Yw[beta]AC. Note that


Let Zn=n(β^ACβn),Z0n=n(β^0nβ0). The multivariate central limit theorem and the fact that limn→∞βn = β indicate that (Z0n, Zn) converges to (Z0, Z′), a normal vector with mean (0, 0)′ and covariance matrix being


where G is given in (2.5) and β here is subject to = r.

Note that all the proposed estimators for β have a common form βn=β^ACφ(Ln)JCn(Rβ^ACr) with Cn = Ĥn or Ĝn. So n(βnβn)=Znφ(Ln)JCn(RZn+δ). From (A.12), we know that n(βnβn)Zφ(L)JC(RZ+δ) in distribution. Therefore, for any real vector x0,


The estimators for the intercept β0, β0n=Y¯W¯βn, can be expressed


which can be written as


by recalling the notation Z0n and Zn. From (A.12), we have n(β0nβ0)Z0+φ(L)μxJC(RZ+δ), in distribution. Therefore,


To deal with the last term, denote τ=G1xx1Euβuuβμx, then E(Z0|Z) = τZ, Z0E(Z0|Z) and Z are independent, and L depends on Z only. We have


Therefore, nE(β0nβ0)2 tends to


Finally, we have


From E(Z0|Z) = τZ, we have E(Z0Z) = E(ZZ′)τ = . Then by (A.14),


A combination of (A.15), (A.13) and (A.16) yields


The theorem is proved by this expression, (A.10) and (A.11).


AMS 2000 subject classification: 62J05; 62F30; secondary 62J99.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Contributor Information

Hua Liang, Department of Biostatistics and Computational Biology, University of Rochester Medical Center, Rochester, NY 14642, USA.

Weixing Song, Department of Statistics, Kansas State University, Manhattan, Kansas, 66506, USA.


1. Bancroft TA. On biases in estimation due to the use of preliminary tsts of significance. Ann Math Statist. 1944;15:190–204.
2. Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM. Measurement Error in Nonlinear Models. 2. New York: Chapman and Hall; 2006.
3. Fuller WA. Measurement Error Models. New York: Wiley & Sons; 1987.
4. Gleser LJ. The importance of assessing measurement reliability in multivariate regression. J Am Statist Assoc. 1992;87:696–707.
5. Goel PK, DeGroot MH. Only normal distributions have linear posterior expectations in linear regression. J Am Statist Assoc. 1980;75:895–900.
6. James W, Stein C. Estimation with quadratic loss. Proceedings of the Fourth Berkeley Symposium on Mathematics and Statistical Probability. 1961;1:361–379.
7. Judge GG, Bock ME. The Statistical Implications of Pre-test and Stein-Rule Estimators in Econometrics. New York: North Holland Publication; 1978.
8. Kim HM, Saleh AKMdE. Preliminary test estimators of the parameters of simple linear model with measurement error. Metrika. 2003;57:223–251.
9. Kim HM, Saleh AKMdE. Improved estimation of regression parameters in measurement error models. J Mult Anal. 2005;95:273–300.
10. Liang H, Härdle W, Carroll RJ. Estimation in a semiparametric partially linear errors-in-variables model. Ann Statist. 1999;27:1519–1535.
11. Rao CR. Characterization of prior distributions and solutions to a compound decision problem. Ann Statist. 1976;4:823–835.
12. Rosner B, Spiegelman D, Willett WC. Correction of logistic regression relative risk estimates and confidence intervals for measurement error: the case of multiple covariates measured with error. Am J Epidemiol. 1990;132:734–735. [PubMed]
13. Saleh AKMdE. Theory of Preliminary Test and Stein-Type Estimation with Application. New York: Wiley & Sons; 2006.
14. Saleh AKMdE, Sen PK. On shrinkage least-squares estimation in a parallelism problem. Ann Statist. 1978;6:154–168.
15. Saleh AKMdE, Sen PK. Non-parametric estimation of location parameter after a preliminary test regression. Commun Statist. 1986;15:1451–1466.
16. Schneeweiss Consistent estimation of a regression with errors in the variables. Metrika. 1976;23:101–115.
17. Sen PK, Saleh AKMdE. On preliminary test and shrinkage M-estimation in linear models. Ann Statist. 1987;15:1580–1592.
18. Shalabh Improved estimation in measurement error models through Stein rule procedure. J Mult Anal. 1998;67:35–48.
19. Stanley TD. Stein rule least squares estimation, a heuristic for fallible data. Econ Lett. 1986;20:147–150.
20. Stanley TD. Improved estimators in some linear errors-in-variables models in finite samples. J Forecasting. 1988;7:103–113.
21. Stein C. Inadmissibility of theusual estimator for the mean of a multivariate normal distribution. Proceedings of the Third Berkeley Symposium on Mathematics and Statistical Probability. 1956;1:197–206.