Search tips
Search criteria 


Logo of ijbiostatThe International Journal of BiostatisticsThe International Journal of BiostatisticsSubmit to The International Journal of BiostatisticsSubscribe
Int J Biostat. 2009 January 1; 5(1): 12.
Published online 2009 April 2. doi:  10.2202/1557-4679.1143
PMCID: PMC2743435

Regression Calibration for Dichotomized Mismeasured Predictors*


Epidemiologic research focuses on estimating exposure-disease associations. In some applications the exposure may be dichotomized, for instance when threshold levels of the exposure are of primary public health interest (e.g., consuming 5 or more fruits and vegetables per day may reduce cancer risk). Errors in exposure variables are known to yield biased regression coefficients in exposure-disease models. Methods for bias-correction with continuous mismeasured exposures have been extensively discussed, and are often based on validation substudies, where the “true” and imprecise exposures are observed on a small subsample. In this paper, we focus on biases associated with dichotomization of a mismeasured continuous exposure. The amount of bias, in relation to measurement error in the imprecise continuous predictor, and choice of dichotomization cut point are discussed. Measurement error correction via regression calibration is developed for this scenario, and compared to naïly using the dichotomized mismeasured predictor in linear exposure-disease models. Properties of the measurement error correction method (i.e., bias, mean-squared error) are assessed via simulations.

1. Introduction

It is common in epidemiologic studies of exposure-disease associations, to transform a continuous exposure to a categorical one using cutpoints. Dichotomizing a continuous exposure variable can simplify interpretation of exposure-disease risks and inform public health decisions regarding associations between modifiable lifestyle factors, such as diet/body weight, and disease outcomes. For instance, it is believed that consuming ≥ 5 fruits and vegetables per day confers protection against developing cancer (American Cancer Society, 2007). Similarly, being obese with body mass index (BMI) > 30 is associated with increased risk of diabetes and cardiovascular disease (American Heart Association, 2005). The merits and disadvantages of categorizing a continuous exposure variable have been a subject of vigorous debate in the statistical literature. Much has been written about the dangers of dichotomization, including loss of information and power, bias due to data-driven choices for optimal cutpoints, and model misspecification, particularly model selection, since the model with dichotomization may be better specified than the model with a continuous effect or vice versa (Altman, 1998; Royston et al., 2004). In this article we do not expound on this topic, but instead adopt the pragmatic view that categorizing continuous variables using cutpoints is a fact of life in public health research. We focus on the issue of measurement error in the exposure and its impact on dichotomization.

Using a mismeasured continuous predictor can result in biased estimates of exposure-disease risks (Fuller, 1987; Carroll et al., 2006). This issue is well-documented in nutrition studies when dietary intake is measured using self-report instruments such as food frequency questionnaires (FFQ). Self-report dietary intake assessment tools, although simple and inexpensive to administer, are subject to recall bias and hence often lead to attenuated estimates of diet-disease associations (Carroll et al., 2006; Kipnis et al., 2003). More accurate methods for assessing dietary intake, such as biomarkers, may be available, but these are usually expensive to obtain in large epidemiologic studies. For instance, protein intake can be accurately quantified by measuring urinary nitrogen (Carroll et al., 2006; Kipnis et al., 2003). A standard approach for adjusting for exposure mismeasurement is to conduct a validation substudy in which a “gold standard” (e.g., urinary nitrogen) and surrogate (e.g., protein intake reported on a FFQ) are obtained on a subsample of participants (Kipnis et al., 2003). The “gold standard” is used to calibrate the error-prone surrogate, and this “calibrated” version of the surrogate is then used to assess exposure-disease associations in the larger sample. This method, referred to as “regression calibration”, has been extensively studied and is known to perform very well. In fact, assuming models are correctly specified, regression calibration completely corrects for bias in the linear regression setting.

When a continuous surrogate predictor is transformed to a categorical predictor, biased estimates of exposure-disease risk still obtain. Gustafson and Le (2002) discuss the magnitude of bias when W is an unbiased surrogate for X (i.e., EW = EX), and show that it depends on the amount of measurement error and on the choice of threshold of the dichotomization. In this article, we extend Gustafson and Le’s work in three ways. First, we allow W to be a biased estimate of X and compute the resulting bias in exposure-disease estimates when W is naïvely utilized in the model instead of X. Second, we extend regression calibration to the setting of dichotomized continuous predictors, and demonstrate analytically scenarios when using this “calibrated and dichotomized” predictor reduces bias compared to the “naïve, dichotomized” predictor. We also assess the performance of the regression calibration correction for dichotomized predictors via simulations. Third, we discuss bias for estimated regression coefficients of an additional covariate, Z, which is measured without error.

2. Background and Notation

In this article, X will denote the “true” exposure, W will denote a surrogate for X, Z will be a continuous covariate measured without error, and Y will be a continuous outcome variable. We will further assume that

(X,Z)N(0,Σ) W=α0+α1X+u,whereuN(0,σu) Y=β0+β1X+β2Z+ɛ,whereɛN(0,σɛ)

In the above, Σ is the variance-covariance matrix of (X, Z) but we assume without loss of generality that EX = EZ = 0, var(X) = var(Z) = 1 and cov(X, Z) = cor(X, Z) = ρ (with var, cov, and cor denoting variance, covariance, and correlation respectively). In addition, true exposure X is assumed independent of errors u and ε, and the errors are assumed independent of each other giving “non-differential” measurement error (Carroll et al., 2006). Further, α0 and α1 denote the bias in W as a surrogate for X.

The main objective is to quantify the association between Y and Xb, where Xb = I(X > c) represents a binary transformation of X for a given cutpoint c. Here I(·) is the indicator function. If Y is linearly associated with X and Z as in Equation (2.1) above, then Xb, Z, and Y are also linearly related via (Gustafson and Le, 2002)



Using model (2.1), it is easy to see that cov(X, Xb) = E(X Xb) = ∫ x I(x > c)[var phi](x)dx = [var phi](c). Similarly, cov(Xb, Z) = E(XbZ) = ρ[var phi](c) and var(Xb) = Φ(c)[1 – Φ(c)], where Φ(s) and [var phi](s) are the cumulative distribution function and density for a standard normal random variable S. Thus, we get:


where R(c)=Φ(c)(1Φ(c))φ(c). The primary research objective is to estimate the parameter β1b, but we will also investigate the impact of measurement error in W on β2b, the regression coefficient for Z.

3. Naïve versus Calibrated Surrogate

3.1. Bias from using the surrogate W

The true exposure X may be expensive or difficult to ascertain in practice and hence, one may instead measure the more easily obtained surrogate W. We could dichotomize the surrogate W as Wb = I(W > c), and assess the association between Y and Wb. Misclassification probabilities associated with using Wb instead of Xb can be calculated analytically under the set-up of Model (2.1). Gustafson and Le (2002) give expressions for the sensitivity and specificity of dichotomized surrogate W when W is an unbiased estimate of X. We generalize to allow for biased surrogates W: In particular, the sensitivity under model (2.1) for Wb is given by


Furthermore, similar to Equations (2.3), the association between Y and (Wb, Z) can be derived as follows.



Again using Gaussian theory based on model (2.1), it is easy to see that cov(X, Wb) = E(XWb) = EWE(XWb|W) = EW(WbE(X|W)) = α1λ[var phi](λ(cα0)), where λ=1/α12+σu2 is the reciprocal of the standard deviation of W. Similarly, cov (Wb, Z) = E(Xb Z) = ρα1λ[var phi](λ(cα0)) and var(Wb) = Φ[λ(cα0)](1 – Φ(λ(cα0))). Thus, β1b* and β2b* from equations (3.2) can be written as


Using equations (2.4) and (3.3), the “attenuation factor” (i.e., multiplicative bias) when Wb is used instead of Xb is


We note that if we set α0 = 0 and α1 = 1, the second term in the denominator of expression (3.4) differs from the corresponding term in Gustafson and Le (2002) which has ρ2λ[var phi](λc). We believe this is a typographical error in their article as this term stems from the square of the covariance between Wb and Z. Also, although estimating β2b is not of primary interest, comparing equations (2.4) and (3.3) we see that using Wb instead of Xb could result in biased estimates of the association between Y and Z, despite the fact that Z is measured without error.

We explore the behavior of the multiplicative bias AF as each of the parameters σu2, α0, α1, ρ and c is varied. Note that R(x) is symmetric and decreasing in |x| (Gustafson and Le, 2002). First as expected, when the measurement error variance σu2 increases, λ decreases to 0, and therefore the multiplicative bias, AF, approaches 0 (i.e., is worst) as shown in Figure 1. Similar behavior of AF is evident when |α1| decreases, i.e., when W is a poor surrogate for X (Figure 2). On the other hand, as |α1| increases to infinity, AF converges to ±(R(c)ρ2φ(c))R(0)ρ2φ(0), so that multiplicative bias from the naïve estimator converges to a constant (Figure 2).

Figure 1:
Multiplicative bias of each estimator as measurement error SD, σu varies; y-axis represents β1b*β1b the ratio of naïve (dashed black curve) or RC based (solid blue curve) regression coefficient to true coefficient; W = ...
Figure 2:
Multiplicative bias of each estimator as a1 varies; y-axis represents β1b*β1b the ratio of naïve (dashed black curve) or RC based (solid blue curve) regression coefficient to true coefficient; W = a0 + a1X + u with SD of measurement ...

Furthermore, because R(x) is decreasing in |x|, AF approaches infinity as the additive bias in W, |α0|, increases. Thus, additive bias in W can lead to an inflated estimate of β1b, rather than attenuation of effects, an arguably more pernicious effect of a poor surrogate W (Figures 1 and and22).

Finally, since for large |z|, R(z) ≈ 1/|z|, as the cutpoint |c| goes to infinity, AF converges to α1λ2. In particular, when α1 = 1, AF converges to λ2 as the cutpoint c moves away from 0, leading to more multiplicative bias when the cutpoint is far from the center of the true exposure distribution (Figures 3 and and4).4). The behavior of AF is more erratic as ρ increases from 0 to 1 (Figure 3) and also when W is a biased estimate of X i.e., α0 ≠ 0 or α1 ≠ 1 (data not shown).

Figure 3:
Multiplicative bias of each estimator as cor(X; Z) = ρ varies; y-axis represents β1b*β1b the ratio of naïve (thin black curve) or RC based (thick blue curve) regression coefficient to true coefficient; W = a0 ...
Figure 4:
Multiplicative bias of each estimator as cutpoint c varies; y-axis represents β1b*β1b the ratio of naïve (thin black curve) or RC based (thick blue curve) regression coefficient to true coefficient; W = a0 + a1X + u with SD of ...

3.2. Regression Calibration and Bias from using Calibrated W

Regression Calibration for continuous mismeasured predictors

One of the most widely used methods to adjust for biased estimation due to mismeasured covariates is regression calibration (Carroll et al., 2006). We briefly outline this method below. As before, assume that true continuous X, its surrogate W, and a continuous, accurately measured covariate Z are jointly normally distributed, and that interest focuses on estimating the association between true X and outcome Y (continuous or discrete): E(Y |X, Z) = g(X, Z, β), for some hypothesized model g. It is also assumed that a validation substudy is conducted in which X and W are measured on a small subsample of participants. If the validation substudy is conducted on a subsample of the main study, we have an internal validation design and X, W, and Y are observed for participants in the validation substudy; if the substudy is not part of the main study, we have an external validation design, and Y is not observed for participants in the validation substudy. Note that Z is assumed to be measured on all participants, and hence is available for main study as well as validation substudy participants.

The regression calibration algorithm proceeds as follows: (i) fit a calibration model E(X|W, Z) for X, W, and Z in the validation substudy; (ii) using regression coefficients from the calibration model, obtain a predicted value Xi = Ê (X|Wi, Zi) for subject i who is not in the validation study. Replace Xi by Xi in the model E(Y |Xi, Zi) = g(Xi, Zi, β); (iii) estimate β by fitting the model g using standard methods; (iv) calculate standard errors of β by bootstrap or a sandwich estimator. With an internal validation design, where X is measured on a subsample of the main study population, the imputation step of substituting Xi for Xi would only be carried out for subjects not in the validation substudy (for the others we would just use X in the disease model).

The regression calibration algorithm is simple to implement but does require validation or replication data in order to model E(X|W, Z). It is easy to see that regression calibration gives the correct mean function in linear models E(Y |X, Z) = β0 + β1X + β2Z in the non-differential error setting (Carroll et al., 2006). Of course, in this article we are interested in a dichotomized version of X, Xb, so in the following sections we will describe an implementation of regression calibration for this scenario, and examine whether this method is less biased than naïvely substituting Wb for Xb in model (2.2).

Regression Calibration for dichotomized mismeasured predictors

In this article, our primary focus is estimating the association between dichotomized X (i.e., Xb) and Y in the linear regression setting. As shown in equation (3.4), dichotomizing the noisy surrogate W, denoted Wb, and using Wb instead of Xb leads to substantial bias. We propose the following as a possible extension of regression calibration to the case of dichotomized mismeasured continuous predictors.

  • Fit a calibration model for X given (W, Z), i.e., E(X|W, Z) = γ0 + γ1W + γ2Z. Let Xrc = [gamma with circumflex]0 + [gamma with circumflex]1W + [gamma with circumflex]2Z be the predicted value of X given W and Z using estimated regression coefficients [gamma with circumflex]0, [gamma with circumflex]1, and [gamma with circumflex]2 from the calibration model.
  • Define Xrcb = I(Xrc > c), a dichotomized version of Xrc.
  • For each subject i who is missing X, use Xrci = [gamma with circumflex]0+[gamma with circumflex]1Wi+[gamma with circumflex]2Zi and replace Xbi with Xrcbi = I(Xrci > c).
  • Use least-squares regression to fit E(Y |Xrcb, Z).

Clearly, Xrcb is an approximation of Xb, hence will still yield biased estimates of regression coefficients. The bias associated with using Xrcb instead of Xb in model (2.2) can be derived in a manner similar to equations (3.3); additional details are provided in the Appendix. In particular, suppose E(Y|Xrcb,Z)=β0brc+β1brcXrcb+β2brcZ, then


where R(x)=Φ(x)(1Φ(x))φ(x) and σrc is the standard deviation of Xrc with σrc=(α12(1ρ2)+ρ2σu2α12(1ρ2)+σu2). We will refer to AFrc(α1, σu2, ρ, c) as AFrc and not include the parameters explicitly unless necessary to make a point. Finally, for the sake of completeness we also provide the regression coefficient for Z when Xrcb is used instead of Xb, namely, β2brc=(β1ρ+β2)R(c/σrc)(β1σrc+β2ρσrc)ρσrcφ(c/σrc)R(c/σrc)ρ2σrc2φ(c/σrc).

3.3. Comparing the Naïve versus Calibrated W

In this section, we compare the multiplicative bias when either the naïve or regression calibration estimator are used to estimate β1b. Figures 14 depict the regression calibration bias, AFrc, for different values of σu2, α1, ρ and c. First, it is easy to see that when either σu increases to infinity or α1 decreases to 0, σrc converges to |ρ|; and hence AFrc converges to 0. Thus, when W has large measurement error (i.e., σrc is large), or is a poor surrogate of X (i.e., α1 is small), the regression calibration estimator has large multiplicative bias, although as seen in Figures 1 and and2,2, this bias is often less than that for the naïve estimator. In addition, lim|α1 |→∞ AFrc = 1, so that as |α1| increases, the regression calibration estimator becomes unbiased, unlike the naïve estimator which is still biased (Figure 2). Finally, a further advantage of regression calibration is that AFrc is independent of α0, hence using Xrcb rather than Wb, will mitigate the effects of additive bias in W.

The behavior of AFrc as the cutpoint or correlation between X and Z vary is more complicated. As ρ2 approaches 1, σrc converges to |ρ|, and therefore AFrc converges to 0, so that the regression calibration estimator is severely biased when true exposure and an accurately measured confounder are highly correlated (Figure 3). The multiplicative bias for the naïve estimator, on the other hand, converges to a non-zero constant for α1 ≠ 0. However, for values of ρ not too large, AFrc is larger than AF; suggesting that the regression calibration estimator is less biased than the naïve estimator (Figure 3 and and4).4). Further, for these “not-too-large” values of ρ, the regression calibration estimator performs better as the cutpoint moves away from 0 (the center of the true exposure distribution), whereas the naïve estimator performs worse. Interestingly, AFrc ≤ 1, so the multiplicative bias with regression calibration is always in the direction of attenuation rather than inflation of effects.

4. Simulations

4.1. Simulation Characteristics

To further examine the behavior of the naïve and regression calibration approaches in finite sampling situations, we conducted a simulation study. We generated 1000 datasets each of size 500 conforming to Model (2.1). The true exposure X and potentially confounding covariate Z were simulated from a bivariate Gaussian distribution with mean (0, 0), var(X) = var(Z) = 1, and cov(X, Z) = cor(X, Z) = ρ. Simulated datasets were generated for three values of ρ, namely, ρ = 0, 0.3, or 0.7. For each X and Z, the outcome was generated as Y = X + Z + ε where ε was drawn from a mean-zero Gaussian distribution with standard deviation equal to 0.1.

In the previous sections we have already noted (see Equations (3.4) and (3.5) and Figures 1 and and2)2) that the multiplicative bias AF increases as the additive bias in W, α0, increases, whereas the regression calibration estimator is independent of α0. Hence, for the simulations, we assumed that α0 = 0, thus removing this source of error from W: Accordingly, the surrogate W was generated via Model (2.1) as W = α1X + u. We compared the naïve and regression calibration estimators over a range of values for the measurement error parameters (α1, σu). In particular, we varied the multiplicative bias in W by considering α1 = 0.5, 1, or 10. The measurement error u was assumed independent of X with standard deviation σu equal to 1 or 2 in the simulations, corresponding to correlations between W and X of 0.24 or 0.99 for the above choices of α1.

Values chosen for α1 and σu were guided by nutrition studies, where food frequency questionnaires (FFQ) are often used to estimate usual intake of various food components. FFQs are inexpensive and easy to administer, but are known to be subject to large measurement errors (Kipnis et al., 2003; Day et al., 2001). In some instances less error-prone measures using biomarkers, such as doubly labeled water for energy intake or urinary nitrogen for protein intake, are used to calibrate the surrogate FFQ in validation substudies. Scaling biases in W (i.e., α1 in our models) have been reported to range from 0.24 to 1:2 when FFQs are calibrated against a better measure such as a biomarker or a more accurate self-report assessment method (Kipnis et al., 2003; Day et al., 2001). Also, correlations between FFQbased dietary intake and “true” intake modeled using biomarkers have been reported to range between 0.1 and 0.45 (Kipnis et al., 2003; Natarajan et al., 2006). Thus, measurement error parameter values considered in our simulations include levels commonly encountered in nutrition studies.

Finally, cutpoint values of 0. 1, and 2 were used. For each cutpoint, the regression coefficients β1b and β2b (Equation (2.4)) derived from the regression of Y on Xb = I(X > c) and Z respectively, were computed. These are the true values of the parameters of interest.

For each of the 54 combinations of α1, σu, ρ and cutpoint, 1000 datasets each of size 500 were generated according to Model (2.1). For every dataset, 4 methods were used to estimate β1b and β2b: (i) a “gold standard” estimate corresponding to fitting a least-squares regression of Y on Xb and Z (ii) a naïve estimate corresponding to fitting a least-squares regression of Y on Wb and Z (iii) an external design regression calibration estimate (iv) internal design regression calibration estimate. The gold standard estimate would clearly not be available in practice, but is provided here for assessing the performance of the other three methods.

For the internal design, a 50% validation subset of 250 was randomly chosen from the dataset of 500. This subsample of 250 was the designated internal validation subsample used to calibrate W. A least-squares regression of X on W and Z was fitted in the validation subsample to obtain Xrc = E(X|W, Z). In the internal design, Xi was assumed to be missing for the 250 observations not in the validation substudy. For this set of 250 “missing” Xi’s, Xrci = E(X|Wi, Zi) was imputed by using the estimated regression coefficients for X given W and Z based on the validation subsample. For the external design, a separate external validation sample of size 250 was generated. Coefficients from the least-squares regression of X on W, Z in this validation sample were used to predict Xrci = E(Xi|Wi, Zi) for each subject i, thus imputing X for all 500 subjects (i.e., in this design it was assumed that X is missing for the entire sample). Finally, external and internal regression calibration estimates were obtained by fitting a least-squares regression of Y on Xrcb, where Xrcb = I(Xrc > c).

For each of the 54 simulated scenarios, mean bias, standard deviation (SD), and root-mean-squared-error (RMSE) of the estimated β1b and β2b were calculated using the 1000 datasets.

4.2. Simulation Results

Estimates of β1b

Bias, standard deviation (SD), and root mean-squared errors (RMSE) of estimates of β1b for each of the naïve, regression calibration external design, and regression calibration internal design methods are presented in Tables 13. The “gold standard” estimate where the true X is available on the entire dataset is also included in the tables to provide the best estimate possible given the sample size. The simulations were based on 1000 datasets of size 500 for each of 54 simulated scenarios corresponding to α1 = 0.5, 1, 10, measurement error SD σu = 1, 2, c = 0, 1, 2, and cor(X, Z) = 0, 0.3, 0.7. The results of the simulations for σu = 1 are summarized in three tables (Table 1 corresponds to α1 = 0.5, Table 2 corresponds to α1 = 1, Table 3 corresponds to α1 = 10). The results for σu = 2 were qualitatively similar to those for σu = 1, and hence are not presented.

Table 1:
Simulation Resultsa:
Table 2:
Simulation Resultsa:
Table 3:
Simulation Resultsa:

Comparing across the three tables, as expected for the naïve and external design regression calibration method, absolute value of bias and SD of the estimates of β1b increased as α1 decreased, with most bias when α1 = 0.5. The regression calibration method in the internal design also followed this pattern except when the cutpoint was 2, whence the scenario with α1 = 1 had worse bias than the α1 = 0.5 case. However, these differences were negligible and in most cases the regression calibration method for the internal design had much less bias and RMSE than the other methods.

For the naïve method, absolute value of bias increased as the cutpoint increased, except for ρ = 0.7, where no clear pattern emerged. Regression calibration for the external and internal design displayed the opposite behavior, with bias decreasing as the cutpoint increased except for ρ = 0.7, where the external design regression calibration estimator displayed slightly increasing bias as the cutpoint increased. Also for α1 = 0.5 and ρ = 0.3, the bias when using the external design increased as the cutpoint increased. For all the methods, the standard deviation of estimates increased as the cutpoint increased.

When comparing the performance of the methods to each other, regression calibration in the external design had less absolute bias than the naïve method in all simulated scenarios, except when ρ = 0.7, where regression calibration in the external design had 10 – 30% more bias than the naïve method. In all other scenarios, and particularly as the cutpoint and/or α1 increased, regression calibration estimates for the external design had markedly less bias than the naïve approach. However, the reduction in bias when using regression calibration was accompanied by an increase in variability of estimates when c = 1, 2, a known feature of regression calibration (Carroll et al., 2006). Nevertheless, the regression calibration estimator still displayed lower RMSE than the naïve approach in all scenarios except when ρ = 0.7.

The regression calibration internal design had less absolute bias and SD compared to the external design in all situations, but there were striking improvements when α1 = 0.5, 1. This was to be expected, since in the internal design the true X is available on 50% of the sample. Needless to say, regression calibration in the internal design displayed marked reductions in absolute bias compared to the naïve estimator. Furthermore, the SD of estimates using the regression calibration internal method were also smaller or comparable to the naïve approach except when the (i) cutpoint was non-zero and α1 = 10 or α1 = 1 or, (ii) when ρ = 0.7 and α1 = 0.5. However, the substantial bias reduction of regression calibration for the internal designs more than compensated for the corresponding small increase in variability of estimated parameters, as is evident from a comparison of the root mean-squared errors of the 3 methods: regression calibration internal design RMSEs were often smaller than RMSEs of the other methods by a factor of 50%.

Estimates of β2b

The primary parameter of interest is β1b; which measures the strength of association between the outcome Y and dichotomized predictor Xb = I(X > c). However, for the sake of completeness we have included bias and SD for the estimates of β2b, the regression coefficient for the association between Y and Z (Tables 13). We estimated β2b by naïvely substituting Wb for Xb in Model (2.2) and also by using regression calibration. First note that within each method, absolute bias increased as ρ increased. Also, within each method, as α1 increased, bias, SD, and RMSE decreased except for the naïve case where α1 = 1 had less bias than α1 = 10 for ρ = 0.3, 0.7, c ≠ 0.

As seen in Tables 13, the regression calibration estimates of β2b were less biased than those derived from using the surrogate W in most cases: (i) regression calibration in the external design had more bias than the naïve estimator in the scenarios where the cutpoint was 2 and α1 = 0.5, 1 and ρ = 0.3, 0.7, (ii) regression calibration in the internal designs had more bias than the naïve approach when the cutpoint was 2 and α1 = 1 and ρ = 0.3, 0.7. These results emphasize the point that even if Z is measured without error, using the surrogate Wb for Xb can seriously bias estimates of β2b. In most situations, regression calibrations was able to reduce this bias as well.


The simulations suggest that regression calibration can substantially correct biases introduced by dichotomizing a mismeasured predictor Particular gains were evident as the cutpoint increased. The internal design performed best, as would be expected. Although not presented here, the naïve method did not work at all when α0 = ±10 and α1 = 0.5, 1, whereas regression calibration was still able to estimate β1b and β2b in these situations. Thus regression calibration offers the most advantage when W has additive bias (i.e., α0 ≠ 0), when X and Z are not highly correlated, or when the cutpoint threshold, c, is far from the center of the X distribution.

5. Example

We applied the methods described here to a subsample of breast cancer survivors who participated in the Women’s Healthy Eating and Living (WHEL) Study, a dietary intervention trial (Pierce et al., 2002) aimed at reducing breast cancer recurrence. For the current analysis, we focused on a dataset of 1673 women who had blood pressure recordings and carotenoid data at study entry. Carotenoids are bioactive compounds provided mainly by vegetables and fruit in the diet (Rock, 1997), hypothesized to reduce cancer and cardiovascular disease risk. In the WHEL study, carotenoid intake was assessed in two ways: (i) based on self-reported fruit and vegetable intake obtained during multiple 24-hr recall telephone interviews of WHEL participants by trained nutritionists, (ii) by measuring plasma levels of circulating carotenoids in blood samples obtained from participants at clinic visits (Pierce et al., 2002). The plasma measure is a biomarker known to be well correlated with fruit and vegetable intake (Rock, 1997), and for this analysis serves as the “gold standard”. Self-reported carotenoid intake represents the surrogate measure for the true plasma value. The outcome of interest is systolic blood pressure.

Mean (SD) systolic blood pressure (SBP) in this population was 117.3 (16) mm Hg, and displayed a reasonably Gaussian distribution. We log-transformed and then centered and scaled the carotenoid measures to have zero mean and unit variance. Applying the notation of Model (2.1) to this example, Y represents SBP, X is the standardised (log) plasma carotenoid concentration level, and W is (log) self-reported carotenoid intake. We used a cutpoint of c = 1, so that Xb = I(X > 1):We also included body mass index (BMI=weight [in kg]/height2 [in m]), a factor known to be associated with blood pressure and dietary intake. BMI is usually measured accurately and plays the role of Z (Model (2.1)) in the analyses below. This example is meant to illustrate the analytic methods proposed and should not be interpreted as having particular public health import.

We hypothesized a linear model and estimated regression coefficients for the association between Y (dependent variable) and each of (i) Xb the gold standard, (ii) Wb the naïve surrogate, (iii) Xrcbint a regression calibration estimator for the internal design, and (iv) Xrcbext a regression calibration estimator for the external design. The regression calibration estimators were based on a validation sample size of 418, randomly drawn from the full sample of 1673 (25% validation samplesize). The estimated regression coefficients for the association between SBP (Y) and carotenoid intake and BMI in each model are presented in Table 4. The results show clearly that using the self-reported carotenoid intake (naïve Wb) instead of the plasma marker (Xb) leads to substantial bias in the regression coefficient (−3.08 versus −1.75). Both regression calibration estimators substantially corrected this bias. As expected regression calibration methods have larger standard errors than the naïve method; yet the substantial reduction in bias when using regression calibration leads to a more accurate estimate of the association between carotenoid intake and SBP. Regression calibration estimates of the BMI regression coefficients were also less biased than those in the naïve model. This example further demonstrates the dangers of naïvely dichotomizing mismeasured predictors, and the important bias-correction that can be achieved by applying regression calibration to this scenario.

Table 4:
Illustration: Linear regression of systolic blood pressure on carotenoid intake and BMI

6. Generalizations

Our development so far has focused on the multivariate normal situation where Y, X, W, Z are jointly normally distributed. We could relax this condition and instead assume that Y, X, W, Z jointly have finite first and second moments (Gustafson and Le, 2002). If we assume as before non-differential measurement error so that E(Y |X, Z, W) = E(Y |X, Z) and a linear outcome model E(Y|X, Z) = β0 + β1X + β2Z, then the parameters of interest, β1b and β1b*, can be obtained from equations (2.3) and (3.2). Hence the multiplicative bias when Wb is naïvely substituted for Xb can be calculated using these equations. Similarly, if we postulate a mean model for X given W and Z, we can obtain a predicted X from this model, and hence a regression calibration estimate of β1b can also be computed. Clearly, the ability of the regression calibration estimator to correct for the biases of the naïve approach would depend on how good the calibration model is. If not modeled appropriately, situations with non-Gaussian distributions for X or u, the measurement error, could lead to poor regression calibration estimators. We do not pursue this further here, except to note that in most applications, skewness or heteroschedasticity in the X or W|X distributions can be corrected with either appropriate transformations of the variables (e.g., logarithm or square-root), or by using weighted least-squares. Of course, regression calibration will not perform well if the calibration model is grossly misspecified, and we leave to future work a further exploration of these issues.


Epidemiological studies often seek to estimate disease risk associated with categorized levels of a continuous exposure. When a mismeasured exposure is split into categories using pre-defined cutpoints, biased estimates of exposure-disease associations are obtained. Gustafson and Le (2002) discuss the bias in exposure-disease associations resulting from dichotomization of a mismeasured predictor, when the latter is an unbiased estimate of true exposure. We extend their results and derive formulae for the multiplicative bias in linear exposure-disease associations when the mismeasured exposure is a biased estimate of true exposure.

In nutritional epidemiology, surrogates such as food frequency questionnaires and food diaries are known to provide biased estimates of true dietary intake (Kipnis et al., 2003; Day et al., 2001). Furthermore, in nutrition studies, exposures are often dichotomized: for instance it is believed that (American Cancer Society, 2007) consuming ≥ 5 fruits/vegetables per day will confer protection against cancer. Thus, quantifying the bias associated with dichotomizing a mismeasured and likely biased estimate of true dietary intake is important for designing future diet-cancer studies.

Regression calibration is a well-known method (Carroll et al., 2006), used to adjust for measurement error in continuous exposures. In this article, we describe a form of regression calibration for dichotomized mismeasured predictors in the linear regression setting. We rigorously derive analytic formulae for the multiplicative bias for this regression calibration approach, and compare these to naïvely dichotomizing the mismeasured surrogate without calibration. We also compare the performance of the methods using simulations to mimic many real-world applications when true exposure may be available on a subsample of the study population.

Our findings suggest that in the linear regression setting, when additive bias (i.e., α0) in the surrogate is large, the naïve method fails and cannot be used to estimate exposure-disease associations. Regression calibration, though biased, does work in these situations. Further, regression calibration usually outperforms the naïve method when both methods can be applied. Scenarios when regression calibration has the most advantage are discussed.

Several caveats must be noted when applying the methods described here. We have considered a particular application of regression calibration which seems natural for the analyses we are interested in undertaking. There are other ways in which we could have implemented regression calibration, such as fitting a model for Xb given W and Z. However, since Xb is a 0 – 1 variable, imputing its value would require first fitting a binary regression model P(W, Z) = P(Xb = 1|W, Z) = E(Xb|W, Z). Then for a subject i, one could calculate P(Wi, Zi), and impute Xbi as follows: impute a value of 1 for Xbi if this probability is “large” (i.e., above some threshold p); else impute a value of 0 for Xbi. However, specifying the threshold value, p, is arbitrary and could lead to additional error. One way around this difficulty would be to impute Xbi according to a Bernoulli random variable with probability P(Wi, Zi). Further, for the measurement error model (2.1), the binary regression model would have the form E(Xb|W,Z)=P(X>c|W=w,Z)=Φ(wα0α1cσu) (assuming α1 > 0 and omitting Z from the calculations to simplify notation), a probit model which is not easily fit in many standard software packages. Thus, though simple in principle, imputing the dichotomized variable Xb directly could actually compound errors and require specialized software, and hence we did not pursue this approach.

Dichotomizing a mismeasured predictor leads to differential error even when the error in the continuous predictor is non-differential (Gustafson and Le, 2002). The impact of this differential error is already evident in the erratic behavior of the multiplicative bias as ρ = cor(X, Z) varies. In the case of a continuous or binary predictor with non-differential error, as ρ increases to 1, AF decreases to 0, leading to substantial bias when X and Z are highly correlated (Gustafson and Le, 2002); whereas for the dichotomized mismeasured predictor considered in this article, the multiplicative bias as ρ increases (Equation (3.4)) depends also on the cutpoint c. Additionally, bias due to dichotomizing a mismeasured predictor depends on the distribution function of true exposure X, while a salient feature of non-differential error is that the bias only depends on the variance of X. We conjecture that it is due to this differential error, that our regression calibration implementation still yields biased estimates of β1b. For instance, assuming Xb and Wb are binary variables, if Y and Wb were conditionally independent given Xb, then ignoring Z to simplify exposition, E(Y |Wb) = E(E(Y |Xb)|Wb) = β0b + β1b E(Xb|Wb). Thus, replacing Xb with E(Xb|Wb) would give the correct regression coefficient β1b. In the dichotomized situation, this approach is not appropriate as Y and Wb are not conditionally independent given Xb. It is possible that other types of regression calibration (e.g., imputing E(Xb|W, Z)) would lead to better estimators than the approach adopted by us. However, as mentioned in the previous paragraphs, such methods would necessitate imputing binary data and would likely require specialized software programs. Our goal in this article was to describe the performance of a simple to implement and arguably natural application of regression calibration, in the setting of dichotomized mismeasured predictors. Comparing among competing estimators for this set-up is left to future studies.

It is important to note that this article focuses on disease outcomes that can be measured continuously, and such that exposure-disease associations can be modeled using a linear regression model. Clearly, other disease models (logistic, survival) should be considered and we leave this to future work. There are many approaches to measurement error adjustment including multiple imputation (Cole et al., 2006), simulation-extrapolation (Stefanski and Cook, 1995), Bayesian (Richardson and Gilks, 1993; Gustafson, 2003), and maximum likelihood methods (Spiegelman et al., 2003; Messer and Natarajan, 2008). These methods would likely be good competitors to regression calibration in this setting of dichotomization of mismeasured predictors. However, most of these alternate approaches require more extensive programming and are not as easily implemented as regression calibration.

In summary, in fields such as nutritional epidemiology, where mismeasured dietary intake is often dichotomized, we strongly recommend conducting calibration studies and using the method of regression calibration to adjust for measurement error. This approach can substantially reduce biases and lead to more accurate estimates of exposure-disease model coefficients.


Details on the derivation of the bias associated with using Xrcb instead of Xb are provided below. Under Model (2.1)


The multivariate normality of (X, W, Z) implies that (a) cov(X,Xrcb)=σrcφ(cσrc), (b) cov(Xrcb,Z)=ρσrcφ(cσrc), and (c) var(Xrcb)=Φ(cσrc)(1Φ(cσrc)). Substituting for each term in Equation (7.1), we get


The expression for AFrc (3.5) is then easily derived from Equations (2.4) and (7.2).


*This research was supported in part by NIH grant 5 R03 CA117292-02. The author thanks Dr. John Pierce and the WHEL study for providing the blood pressure data that was used to illustrate the statistical methods proposed in this article. The author also thanks the referees and editor for their insightful comments which helped greatly to improve the manuscript.


  • American Cancer Society, Cancer Facts & Figures ,, 2007
  • American Heart Association Statistics, Heart Disease & Stroke Statistics Update ,
  • Altman DG. ‘Suboptimal analysis using optimal cutpoints’ British Journal of Cancer. 1998;78:556–557. [PMC free article] [PubMed]
  • Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM. Monographs on Statistics and Applied Probability. Vol. 105. Chapman & Hall/CRC; 2006. ’Measurement Error in Nonlinear Models’
  • Cole SR, Chu H, Greenland S. ‘Multiple-imputation for measurement-error correction’ International Journal of Epidemiology. 2006;35(4):1074–81. doi: 10.1093/ije/dyl097. [PubMed] [Cross Ref]
  • Day NE, McKeown N, Wong MY, Welch A, Bingham S. ‘Epidemiological assessment of diet: a comparison of a 7-day diary with a food frequency questionnaire using urinary markers of nitrogen, potassium and sodium’ International Journal of Epidemiology. 2001;30:309–317. doi: 10.1093/ije/30.2.309. [PubMed] [Cross Ref]
  • Fuller WA. ‘Measurement Error Models’ Wiley Series in Probability and Mathematical Statistics. 1987
  • Gustafson P, Le DN. Comparing the Effects of Continuous and Discrete Covariate Mismeasurement with emphasis on Dichotomization of Mismeasured Predictors. Biometrics. 2002;28:878–887. doi: 10.1111/j.0006-341X.2002.00878.x. [PubMed] [Cross Ref]
  • Gustafson P. ‘Measurement Error and Misclassification in Statistics and Epidemiology’ Chapman & Hall. 2003
  • Kipnis V, Subar AF, Midthune D, Freedman LS, Ballard-Barbash R, Troiano RP, Bingham S, Schoeller DA, Schatzkin A, Carroll R. ‘Structure of Dietary Measurement Error: Results of the OPEN Biomarker Study’ American Journal of Epidemiology. 2003;158:14–21. doi: 10.1093/aje/kwg091. [PubMed] [Cross Ref]
  • Messer K, Natarajan L. ‘Maximum likelihood, multiple imputation, and regression calibration for measurement error adjustment’ Statistics in Medicine. 2008;27:6332–6350. doi: 10.1002/sim.3458. [PMC free article] [PubMed] [Cross Ref]
  • Natarajan L, Flatt SW, Sun X, Gamst AC, Major JM, Rock CL, Al-Delaimy W, Thomson CA, Newman VA, Pierce JP. Women’s Healthy Eating and Living Study Group, ‘Validity and systematic error in measuring carotenoid consumption with dietary self-report instruments’ American Journal of Epidemiology. 2006;163(8):770–8. doi: 10.1093/aje/kwj082. [PubMed] [Cross Ref]
  • Pierce JP, Faerber S, Wright F, Rock CL, Newman V, Flatt SW, Kealey S, Jones VE, Caan BJ, Gold EB, Haan M, Hollenbach KA, Jones L, Marshall JR, Ritenbaugh C, Stefanick M, Thomson C, Wasserman L, Natarajan L, Thomas RG, Gilpin EA. ‘A randomized trial of the effect of a plant based dietary pattern on breast cancer recurrence: The Women’s Healthy Eating and Living (WHEL) Study’ Controlled Clinical Trials. 2002;23:728–756. doi: 10.1016/S0197-2456(02)00241-6. [PubMed] [Cross Ref]
  • Richardson S, Gilks WR. ‘Conditional independence models for epidemiological studies with covariate measurement error’ Statistics in Medicine. 1993;12:1703–1722. doi: 10.1002/sim.4780121806. [PubMed] [Cross Ref]
  • Rock CL. ‘Carotenoids: biology and treatment’ Pharmocological Therapy. 1997;75:185–197. doi: 10.1016/S0163-7258(97)00054-5. [PubMed] [Cross Ref]
  • Royston P, Altman DG, Sauerbrei W. ‘Dichotomizing continuous predictors in multiple regression: a bad idea’ Statistics in Medicine. 2006;25:127–141. doi: 10.1002/sim.2331. [PubMed] [Cross Ref]
  • Spiegelman D, Rosner B, Logan R. ‘Estimation and inference for logistic regression with covariate misclassification and measurement error, in main study/validation designs’ Journal of the American Statistical Association. 2000;95:51–61. doi: 10.2307/2669522. [Cross Ref]
  • Stefanski LA, Cook J. ‘Simulation Extrapolation: the measurement error jackknife’ Journal of the American Statistical Association. 1995;90:1247–156. doi: 10.2307/2291515. [Cross Ref]

Articles from The International Journal of Biostatistics are provided here courtesy of Berkeley Electronic Press