The parameter of interest is

*β*_{1} from the generalized linear model

where

*Y* is the outcome of interest,

*g[A]* is a link function which linearizes the conditional mean function in the covariates and

*U* is a vector of covariates measured without error. Substituting the covariate measured with error,

*X*, for

*x*, the uncorrected point and interval estimates of effect,

are adjusted for measurement error in a one-step procedure. When

*g*[

*A*] =

*E*(

*Y* |

*X*,

*U*), regression calibration is applied to a linear regression model (

Fuller 1987). When

*g*[

*A*] =

*logit*[

*E*(

*Y* |

*X*,

**U**)], regression calibration is applied to a logistic regression model (

Rosner, Spiegelman et al. 1990;

Rosner, Spiegelman et al. 1992). When

where

*I* (

*t*) is the incidence rate at time

*t*, then when the disease is rare, regression calibration can be applied to a Cox proportional hazards regression model (

Prentice 1982;

Spiegelman, McDermott et al. 1997;

Wang, Hsu et al. 1997;

Hu, Tsiatis et al. 1998). The application of regression calibration to these three basic models, all of which are used widely in epidemiology, was unified with a special focus on interval estimation and computing in SAS (

Spiegelman, McDermott et al. 1997). Application of measurement error methods to data requires that the main study, which contains data (

*Y*_{i},

*X*_{i},

*U*_{i}),

*i* =

*1*,…,

*n*_{1}, is supplemented by an external validation study, which contains data (

*X*_{i},

*x*_{i},

*U*_{i}),

*i* =

*n*_{1} +

*1*,…,

*n*_{1} +

*n*_{2}. Because it is difficult and expensive to validate

*X*_{i} with

*x*_{i}, the validation study is typically much smaller than the main study, i.e.

*n*_{1} >>

*n*_{2}.

The point and interval estimates of effect can be corrected for measurement error using Rosner

*et al*.’s formulas (

Rosner, Spiegelman et al. 1990)

where

and

are estimated by fitting (1) to the main study data,

*(Y,X,**U**).* The first row of

, denoted

, and

are obtained from fitting the linear regression model to the validation data

under the assumption that

Appendix 1 of

Rosner, Spiegelman et al. (1990) gives the construction of

from

and equation (A7) of the same paper gives

. Assumption (4) is the homoscedasticity assumption, the relaxation of which is the focus of this manuscript. Regression calibration has been presented by others for use in a variety of settings (

Prentice 1982;

Fuller 1987;

Armstrong, Whittemore et al. 1989;

Rosner, Spiegelman et al. 1990;

Rosner, Spiegelman et al. 1992;

Spiegelman, McDermott et al. 1997;

Wang, Hsu et al. 1997;

Carroll, Ruppert et al. 2006). All versions of the regression calibration method assume that measurement error is non-differential with respect to the response variable,

*Y*, i.e.

where

*f(*·

*)* is a density function. In addition to the assumptions specified by (1), (3) and (4), in the case of univariate

*x*, the key additional requirement for approximate unbiasedness of the regression calibration estimator when

*g*[

*E*(

*Y* |

*x*,

*U*)] in (1) is logistic is that either

is small, or

*Pr*(

*Y* = 1 |

*x*,

*U*) is small and

*f*(

*x* |

*X*,

*U*) is normal (

Kuha 1994).

It is sometimes the case with biological variables that heteroscedasticity is evident over the observed range of the data; when this occurs, it typically takes the form that the spread increases with the level. Sometimes but by no means always, the variance can be stabilized by log transforming the data but this solution can be undesirable when the variable in question is to be used as a predictor variable in a regression model, and the scientific hypothesis focuses on measuring the relationship of the variable in its originally observed scale to the outcome variable of interest. In this paper, we assume that an empirically verifiable linear model for the mean, (3), holds, but assumption (4), that of constant variance of the measurement error model del for

*x* |

*X*,

*U*, is untenable for the observed data. Instead, it may appear from the available validation data that

is a more reasonable model for the variance, where

*h* (

*X*_{i},

*U*_{i}) is some function of the covariates which induces heteroscedasticity in the measurement error model. In what follows, we derive an extension to the standard regression calibration estimator given above in (2) and to its multivariate counterpart, in which the constant residual variance requirement given by (4) is eliminated.

Standard regression calibration, which assumes homoscedastic measurement error, can be derived by a first order Taylor series expansion around the naïve likelihood in which the mis-measured exposure is treated as if there were no error. In what follows, the second order expansion is developed. By adding the additional term, measurement error model heteroscedasticity can be accommodated, albeit with an unavoidably more complex estimator. In another approach to deriving the regression calibration estimator, the approximate logistic likelihood is derived under the assumptions of normal residual measurement error and rare disease. We develop these approaches in more detail in what follows below, to provide the formal derivation for the new estimator.

We assume the following mean and variance model for

*x* given

*(**X**,* *U)* follows

*dim(**x**)=q,* and

*h* (

*X*,

*U*) is a

*q* ×

*q* function. Then, the distribution of

*Y* |

*X*,

*U* for logistic regression with rare disease and multivariate normality is

where

(

Kuha 1994). A similar result for general mean-variance models, using a second-order Taylor series expansion about

*E(x*|

*X,**U**)* (i.e. a small measurement error approximation) was obtained for scalar

*x* by Carroll

*et al.* (

Carroll, Ruppert et al. 2006). For

*dim(x)*>1, Carroll and Stefanski (

Carroll and Stefanski 1990) presented an analogous result, but omitted the detailed derivation. Similar approximations have been given for the relative risk function in

*X*(

*t*), a vector-valued, possibly time-varying covariate, in survival data analysis (

Prentice 1982), and for linear regression (

Spiegelman, McDermott et al. 1997) where (6) is exact.

Insight can be gained by inspecting the form of (6) with no covariates,

*U*, and for scalar

*x*, denoted

*x*,

shows the theoretical relationship between

*log[E(Y)]* under the rare disease assumption and

*x* given by (1) (solid line), and the approximate induced relationships between

*log[E(Y)]* and

*X* for

*h(X)=X* (dotted line),

*h(X)=X*^{2} (short dashed line), and

*h(X)=1,* the homoscedastic case, (long dashed lines), when plugging in the measurement error model parameters and logistic regression parameters estimated from the Nurses’ Health Study data on breast cancer incidence in relation to alcohol consumption discussed in Section 3.2. It is evident that when

*h(X)=X*, the uncorrected estimator is likely to under-estimate

*β*_{1} but when

*h(X)=X*^{2}, over-estimation could occur. If

*h(X)=X*, (6) simplifies to a linear model in

*X* as a function of the measurement error model and logistic regression model parameters (see Section 2.1), and it can be seen that the standard regression calibration estimator is also likely to over-estimate

*β*_{1} (long dashed line).

To simplify notation,

was rewritten as

. Then, solving for

*β*_{1} in the two terms multiplying

*X* and

*h(X)* in (6), two estimators for

*β*_{1},

and

where

*sign(t)* = 1 if

*t* is positive and −1 if

*t* is negative, are available from the uncorrected regression of

*Y* on

*X* and

*h(X)*. Of course, the approximate likelihood (6) can be directly fit to the data using an iterative approach, jointly estimating all parameters simultaneously. Alternatively, we suggest a procedure for obtaining

_{RCH,1} which can be constructed from standard software tools and used in routine analysis in what follows. Both approaches will provide consistent estimators, but the maximum likelihood estimator will, of course, be more efficient asymptotically. Later, we will compare the behavior of these estimators in a simulation study.

The new method is as follows:

- A logistic regression model of
*Y* on *X*, *h(X,* *U**)* and *U* is run in the main study to obtain
and
and their estimated variances, - A weighted linear regression is run in the validation study, with weights 1/
*h(X,* *U**)*, to obtain ^{2} and _{1}. _{11} and _{12} are formed from the formulas above and efficiently combined to produce a single estimate, _{RCH,1},

The asymptotically minimum variance weights and their derivation, as well as the derivation of the formula for the variance of

_{RCH,1}, are given in

Appendix 1.

_{RCH,2}, the measurement-error-corrected estimator of the coefficients corresponding to

*U*, has form

Its asymptotic variance is derived in

Appendix 2, along with

*Cov*(

_{RCH,1},

_{RCH,2}). As is evident from (6), this estimator can only be used for models with scalar

*x*. For multivariate

*x* with heteroscedastic covariance for

*x | X, U*, a term is added to the model for

*E(Y*|

*X**,* *U**)* equal to

which is a complicated function of the

*q* elements of

*β*_{1} that cannot be used to uniquely solve for the

*q* elements of

.

A result similar in form to (6) was given by Prentice (

Prentice 1982) for the proportional hazards regression model on a vector

*X**(t),* in which one or more of the elements of

*X**(t)* are measured with error, and where the conditional distribution of

*x**(t)* given

*X**(t)* is multivariate normal with a linear mean in

*X**(t)* and variance

*Σ*(

*X*(

*t*)). Hence, under the conditions specified by Prentice,

_{RCH} can be applied to Cox regression models with an arbitrary number of perfectly measured covariates and a single covariate measured with error, just as discussed above for logistic regression.

Under either small measurement error or a rare disease with normal errors for

*x*|

*X,* *U*, it is evident that if either

or

is small, for scalar

*x**,* the standard regression calibration estimator will be approximately valid, even in the presence of heteroscedasticity of the variance of the measurement error model, since the third term in the exponent of (6) vanishes.

2.1. Special case when an ‘alloyed’ gold standard is available

Standard regression calibration is valid even if instead of

*x*, the gold standard, an unbiased imperfect gold standard,

, is observed in the validation study, where

and

_{i} is a random, mean-zero error term. For example, doubly labeled water is considered an unbiased biomarker for total energy intake (

Subar, Kipnis et al. 2003). However, doubly labeled water is measured with some random-within person, unbiased error (

Preis, Spiegelman et al. 2010). Here as well, for scalar

*x*, as long as

and

,

*γ* and

*α′* can be consistently estimated from fitting the model

to the data. However,

where

, so Step 2 in the procedure for obtaining

_{RCH,1} given above will not provide a valid estimate of

*h*(

*X*_{i},

*U*_{i})

*σ*^{2}, the quantity needed for

_{RCH,1}. Without replicate data within subjects, followed by additional calculations not described here,

_{RCH} is not applicable when an alloyed gold standard is observed.

2.2. Special case of

_{RCH} when

*h*(

*X*)=

*X* Another important special case emerges when

*h(X)=X.* Then, from

equation (6) we obtain

where

is the regression coefficient for

*X* obtained from fitting a logistic regression model of

*Y* on

*X* and

*U*,

_{11} is as above in (7), and

_{RCH,2} is as above in (8). It is evident from (6) that the function

*h(X)=X* +

*b*, where

*b* is a positive constant, can also be considered here. In this case,

*b* is absorbed by the intercept,

, and does not need to be considered any further in the measurement error correction procedure. The variance of

_{RCH,1} in the special case was again derived using the multivariate delta method (

Bishop, Fienberg et al. 1975). It should be noted that when

*β*_{1}*γ*_{1} < 0,

_{RCH,1} will provide a consistent estimate of

*β*_{1} only when multivariate delta method (

Bishop, Fienberg et al. 1975). It should be noted that when

*β*_{1}*γ*_{1} < 0,

_{RCH,1} will provide a consistent estimate of

*β*_{1} only when

(

Appendix 3). However, under (3), as long as

*Corr(x,X)* is positive, as would usually be the case when

*X* is an surrogate for

*x*,

*γ*_{1} will be positive, and whenever it is anticipated that

*β*_{1} might be negative,

*x* and

*X* can be recoded to avoid this.

2.3. Application to the classical measurement error model

There are an important set of problems where

*x* is unobservable, i.e. no “gold standard” exists. In some of these cases, the measurement error model

is considered reasonable, where

*n*_{2i} is the number of replicates for subject

*i*. Examples of data which are believed to follow this model include blood pressure, serum biomarkers such as cholesterol and its subfractions, hormones, and vitamin concentrations. When model (10) holds,

Rosner *et al.*’s (1992) version of the regression calibration method applies, in a procedure similar to that described earlier. In the univariate case, estimates of the reliability coefficient,

*Var(x)/Var(X)*, and its variance are substituted for

_{1} and its variance in

equation (3). Multivariate generalizations have been given (

Rosner, Spiegelman et al. 1992). When the third component of (10) does not hold, i.e. when

, an extension of regression calibration to accommodate this expansion of the model is needed. We rederived the likelihood for scalar

*x* and no additional covariates

*U* to attempt to identify an appropriate estimator in this simple case, now assuming that

*X*_{ij} |

*x*_{i} ~

*N*(

*x*_{i},

*h*(

*x*_{i})

*σ*^{2}),

,

*i*=

*1,…,n*_{2},

*j=1,…,n*_{R}, where

*n*_{R} is the number of replicates for each subject

*i*, and obtained

and

Neither

*f(x*_{i},X_{ij}) nor

*f(x*_{i}|

*X*_{ij}) has a Gaussian structure, and the method of completing the square to obtain a closed-form solution for

*f(Y*_{i}|

*X*_{i}*)* used to derive (6) is therefore not applicable. It is unlikely that a closed-form for

exists when

and

*ε* is Gaussian. If functions

*h(x*_{i}) are found which fit the data at hand, it is unlikely that the resulting expression for

*f(Y*_{i}|

*X*_{i}) will be of a form such that the link function,

*g[E(Y*_{i}|

*X*_{i})], can be found with

*g* equal to the logistic or log transformations. Without these features, even

_{RCH}, the scalar version of

_{RCH}, does not exist.

2.4. Iterative methods

Iterative methods can also be applied to the problem of heteroscedastic measurement error in regression covariates. These methods will typically have the advantage of relaxing some of the assumptions required by closed-form methods, such as (1), (3) and (4), but will have the disadvantage of computational complexity which is a barrier to use in applications. In a main study/external validation study design for a binomial outcome and a Gaussian measurement error model, the log likelihood of the data is equal to

We assume here that

(

*α′*,

**γ**,

) is the normal density with mean given by (3) and variance given by (4), and the probability that

*Y*_{i} =

*1* given (

*X*_{i},

*U*_{i}) is

The numerical method given by (

Crouch and Spiegelman 1990) can be used to evaluate

*f*_{3} (

*Y*_{i} |

*X*_{i},

*U*_{i};

*β*,

*α′*,

*γ*,

*σ*^{2}), or code could be developed which makes use of standard statistical software through the non-linear optimizer that is provided. Generalizing Kuha’s derivation (

Kuha 1994) to the heteroscedastic measurement error variance case, by using a second-order Taylor series expansion of

*logit[Pr(Y=1)]* with respect to

*β*_{1} around

*β*_{1} =

*0* under the rare disease assumption and with

*x* |

*X*,

*U* normally distributed with linear mean and variance,

*f*_{3} (

*Y*_{i} |

*X*_{i},

*U*_{i};

*β*,

*α′*,

*γ*,

*σ*^{2}) can be approximated as follows:

Likelihoods based on both the exact,

_{ML}, and approximate,

_{aML}, expressions for

*f*_{3} (

*Y*_{i} |

*X*_{i},

*U*_{i};

*β*,

*α′*,

*γ*,

*σ*^{2}) given above were fit to the data in Section 4 and studied by simulation in Section 5.

A consistent, semi-parametric efficient estimator was proposed by (

Robins, Hsieh et al. 1995), where (

Begun, Hall et al. 1983) defined the semi-parametric efficiency bound as the smallest possible variance obtained by any estimator which is consistent for

*β* over all possible measurement error models for

*x* |

*X*,

*U*. Because the model for

*x* |

*X*,

*U* is unknown

*a priori* and validation study sample sizes are often small, methods that are robust to mis-specification of the model for

*x* |

*X*,

*U* are desirable. The semi-parametric fully efficient estimator of this class requires a computationally cumbersome non-parametric fit of the density of

*x* |

*X*,

*U* – instead, we investigated the semi-parametric locally efficient version, which is consistent even when the density of

*x* |

*X*,

*U* is mis-specified, and uses a parametric fit for

*x* |

*X*,

*U*. As this parametric density, fit and empirically verified in the validation study, approaches the true density for

*x* |

*X*,

*U*, full semi-parametric efficiency is approached. Details on this estimator can be found in (

Spiegelman and Casella 1997),

Appendix 1, where, here

*f*_{2} (

*x* |

*X*,

*U**; α′*,

*γ*,

*σ*^{2}) was taken to be the normal density with mean given by (3) and variance given by (4), and

*α′*,

*γ*, and

*σ*^{2} are estimated in the validation study, as usual. This estimator,

_{SPLE}, was fit to the data in Section 4 and studied by simulation in Section 5 along with the others.