Home | About | Journals | Submit | Contact Us | Français |

**|**Int J Biostat**|**PMC2743435

Formats

Article sections

- Abstract
- 1. Introduction
- 2. Background and Notation
- 3. Naïve versus Calibrated Surrogate
- 4. Simulations
- 5. Example
- 6. Generalizations
- 7. Comments
- References

Authors

Related links

Int J Biostat. 2009 January 1; 5(1): 12.

Published online 2009 April 2. doi: 10.2202/1557-4679.1143

PMCID: PMC2743435

NIHMSID: NIHMS121098

Loki Natarajan^{*}

Copyright © 2009 The Berkeley Electronic Press. All rights reserved

This article has been cited by other articles in PMC.

Epidemiologic research focuses on estimating exposure-disease associations. In some applications the exposure may be dichotomized, for instance when threshold levels of the exposure are of primary public health interest (e.g., consuming 5 or more fruits and vegetables per day may reduce cancer risk). Errors in exposure variables are known to yield biased regression coefficients in exposure-disease models. Methods for bias-correction with continuous mismeasured exposures have been extensively discussed, and are often based on validation substudies, where the “true” and imprecise exposures are observed on a small subsample. In this paper, we focus on biases associated with dichotomization of a mismeasured continuous exposure. The amount of bias, in relation to measurement error in the imprecise continuous predictor, and choice of dichotomization cut point are discussed. Measurement error correction via regression calibration is developed for this scenario, and compared to naïly using the dichotomized mismeasured predictor in linear exposure-disease models. Properties of the measurement error correction method (i.e., bias, mean-squared error) are assessed via simulations.

It is common in epidemiologic studies of exposure-disease associations, to transform a continuous exposure to a categorical one using cutpoints. Dichotomizing a continuous exposure variable can simplify interpretation of exposure-disease risks and inform public health decisions regarding associations between modifiable lifestyle factors, such as diet/body weight, and disease outcomes. For instance, it is believed that consuming ≥ 5 fruits and vegetables per day confers protection against developing cancer (American Cancer Society, 2007). Similarly, being obese with body mass index (BMI) > 30 is associated with increased risk of diabetes and cardiovascular disease (American Heart Association, 2005). The merits and disadvantages of categorizing a continuous exposure variable have been a subject of vigorous debate in the statistical literature. Much has been written about the dangers of dichotomization, including loss of information and power, bias due to data-driven choices for optimal cutpoints, and model misspecification, particularly model selection, since the model with dichotomization may be better specified than the model with a continuous effect or vice versa (Altman, 1998; Royston et al., 2004). In this article we do not expound on this topic, but instead adopt the pragmatic view that categorizing continuous variables using cutpoints is a fact of life in public health research. We focus on the issue of measurement error in the exposure and its impact on dichotomization.

Using a mismeasured continuous predictor can result in biased estimates of exposure-disease risks (Fuller, 1987; Carroll et al., 2006). This issue is well-documented in nutrition studies when dietary intake is measured using self-report instruments such as food frequency questionnaires (FFQ). Self-report dietary intake assessment tools, although simple and inexpensive to administer, are subject to recall bias and hence often lead to attenuated estimates of diet-disease associations (Carroll et al., 2006; Kipnis et al., 2003). More accurate methods for assessing dietary intake, such as biomarkers, may be available, but these are usually expensive to obtain in large epidemiologic studies. For instance, protein intake can be accurately quantified by measuring urinary nitrogen (Carroll et al., 2006; Kipnis et al., 2003). A standard approach for adjusting for exposure mismeasurement is to conduct a validation substudy in which a “gold standard” (e.g., urinary nitrogen) and surrogate (e.g., protein intake reported on a FFQ) are obtained on a subsample of participants (Kipnis et al., 2003). The “gold standard” is used to calibrate the error-prone surrogate, and this “calibrated” version of the surrogate is then used to assess exposure-disease associations in the larger sample. This method, referred to as “regression calibration”, has been extensively studied and is known to perform very well. In fact, assuming models are correctly specified, regression calibration completely corrects for bias in the linear regression setting.

When a continuous surrogate predictor is transformed to a categorical predictor, biased estimates of exposure-disease risk still obtain. Gustafson and Le (2002) discuss the magnitude of bias when *W* is an unbiased surrogate for *X* (i.e., *EW* = *EX*), and show that it depends on the amount of measurement error and on the choice of threshold of the dichotomization. In this article, we extend Gustafson and Le’s work in three ways. First, we allow *W* to be a biased estimate of *X* and compute the resulting bias in exposure-disease estimates when *W* is naïvely utilized in the model instead of *X*. Second, we extend regression calibration to the setting of dichotomized continuous predictors, and demonstrate analytically scenarios when using this “calibrated and dichotomized” predictor reduces bias compared to the “naïve, dichotomized” predictor. We also assess the performance of the regression calibration correction for dichotomized predictors via simulations. Third, we discuss bias for estimated regression coefficients of an additional covariate, *Z*, which is measured without error.

In this article, *X* will denote the “true” exposure, *W* will denote a surrogate for *X*, *Z* will be a continuous covariate measured without error, and *Y* will be a continuous outcome variable. We will further assume that

$$\begin{array}{ll}(X,Z)\hspace{0.17em}& \sim \hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}N(\overrightarrow{0},\hspace{0.17em}\hspace{0.17em}\Sigma )\\ W& =\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}{\alpha}_{0}+{\alpha}_{1}X+u,\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\text{where}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}u\hspace{0.17em}\sim \hspace{0.17em}N(0,\hspace{0.17em}{\sigma}_{u})\\ Y& =\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}{\beta}_{0}+{\beta}_{1}X+{\beta}_{2}Z+\varepsilon ,\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\text{where}\hspace{0.17em}\varepsilon \hspace{0.17em}\sim \hspace{0.17em}N(0,\hspace{0.17em}{\sigma}_{\varepsilon})\end{array}$$

(2.1)

In the above, Σ is the variance-covariance matrix of (*X, Z*) but we assume without loss of generality that *EX* = *EZ* = 0, var(*X*) = var(*Z*) = 1 and cov(*X, Z*) = cor(*X, Z*) = *ρ* (with var, cov, and cor denoting variance, covariance, and correlation respectively). In addition, true exposure *X* is assumed independent of errors *u* and ε, and the errors are assumed independent of each other giving “non-differential” measurement error (Carroll et al., 2006). Further, α_{0} and α_{1} denote the bias in *W* as a surrogate for *X*.

The main objective is to quantify the association between *Y* and *X** _{b}*, where

$$E(Y|{X}_{b},\hspace{0.17em}Z)={\beta}_{0b}+{\beta}_{1b}{X}_{b}+{\beta}_{2b}Z\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\text{with}$$

(2.2)

$$\begin{array}{l}{\beta}_{1\text{b}}=\frac{{\beta}_{1}[\text{cov}(X,\hspace{0.17em}{X}_{b})-\text{cov}({X}_{b},\hspace{0.17em}Z)\text{cov}(Z,\hspace{0.17em}X)]}{\text{var}({X}_{b})-{\text{cov}}^{2}({X}_{b},\hspace{0.17em}Z)}\\ {\beta}_{2b}=\frac{\text{var}({X}_{b})[{\beta}_{1}\text{cov}(Z,\hspace{0.17em}X)+{\beta}_{2}]-\text{cov}({X}_{b},\hspace{0.17em}Z)[{\beta}_{1}\text{cov}({X}_{b},\hspace{0.17em}X)+{\beta}_{2}\text{cov}(X{}_{b},\hspace{0.17em}Z)]}{\text{var}({X}_{b})-{\text{cov}}^{2}({X}_{b},\hspace{0.17em}Z)}.\end{array}$$

(2.3)

Using model (2.1), it is easy to see that cov(*X, X** _{b}*) =

$$\begin{array}{ll}{\beta}_{1b}& =\frac{{\beta}_{1}\phi (c)(1-{\rho}^{2})}{\Phi (c)(1-\Phi (c))-{\rho}^{2}{\phi}^{2}(c)}\\ & =\frac{{\beta}_{1}(1-{\rho}^{2})}{R(c)-{\rho}^{2}\phi (c)}\\ {\beta}_{2b}& =\frac{({\beta}_{1}\rho +{\beta}_{2})[\Phi (c)(1-\Phi (c))]-({\beta}_{1}+{\beta}_{2}\rho )\rho {\phi}^{2}(c)}{\Phi (c)(1-\Phi (c))-{\rho}^{2}{\phi}^{2}(c)}\\ & =\frac{({\beta}_{1}\rho +{\beta}_{2})R(c)-({\beta}_{1}+{\beta}_{2}\rho )\rho \phi (c)}{R(c)-{\rho}^{2}\phi (c)}\end{array}$$

(2.4)

where
$R(c)=\frac{\Phi (c)(1-\Phi (c))}{\phi (c)}$. The primary research objective is to estimate the parameter *β*_{1}* _{b}*, but we will also investigate the impact of measurement error in

The true exposure *X* may be expensive or difficult to ascertain in practice and hence, one may instead measure the more easily obtained surrogate *W*. We could dichotomize the surrogate *W* as *W** _{b}* =

$$P({W}_{b}=1|{X}_{b}=1)=P(W>c|X>c)=\frac{{\int}_{c}^{\infty}[1-\Phi (\frac{c-{\alpha}_{0}-{\alpha}_{1}s}{{\sigma}_{u}})]\phi (s)ds}{1-\Phi (c)}.$$

Furthermore, similar to Equations (2.3), the association between *Y* and (*W** _{b}*, Z) can be derived as follows.

$$E(Y|{W}_{b},\hspace{0.17em}\hspace{0.17em}Z)={\beta}_{0b}^{*}+{\beta}_{1b}^{*}{W}_{b}+{\beta}_{2b}^{*}Z\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\text{with}$$

(3.1)

$$\begin{array}{l}{\beta}_{1b}^{*}=\frac{{\beta}_{1}[\text{cov}(X,\hspace{0.17em}\hspace{0.17em}{W}_{b})-\text{cov}({W}_{b},\hspace{0.17em}\hspace{0.17em}Z)\text{cov}(Z,\hspace{0.17em}\hspace{0.17em}X)]}{\text{var}({W}_{b})-{\text{cov}}^{2}({W}_{b},\hspace{0.17em}\hspace{0.17em}Z)}\\ {\beta}_{2b}^{*}=\frac{\text{var}({W}_{b})[{\beta}_{1}\text{cov}(Z,\hspace{0.17em}\hspace{0.17em}X)+{\beta}_{2}]-\text{cov}({W}_{b},\hspace{0.17em}\hspace{0.17em}Z)[{\beta}_{1}\text{cov}({W}_{b},\hspace{0.17em}\hspace{0.17em}X)+{\beta}_{2}\text{cov}({W}_{b},\hspace{0.17em}\hspace{0.17em}Z)]}{\text{var}({W}_{b})-{\text{cov}}^{2}({W}_{b},\hspace{0.17em}\hspace{0.17em}Z)}.\end{array}$$

(3.2)

Again using Gaussian theory based on model (2.1), it is easy to see that cov(*X, W** _{b}*) =

$$\begin{array}{ll}{\beta}_{1b}^{*}& =\frac{{\beta}_{1}{\alpha}_{1}\lambda \phi (\lambda (c-{\alpha}_{0}))(1-{\rho}^{2})}{\Phi (\lambda (c-{\alpha}_{0}))(1-\Phi (\lambda (c-{\alpha}_{0})))-{\alpha}_{1}^{2}{\lambda}^{2}{\rho}^{2}{\phi}^{2}(\lambda (c-{\alpha}_{0}))}\\ & =\frac{{\beta}_{1}{\alpha}_{1}\lambda (1-{\rho}^{2})}{R(\lambda (c-{\alpha}_{0}))-{\alpha}_{1}^{2}{\lambda}^{2}{\rho}^{2}\phi (\lambda (c-{\alpha}_{0}))}\\ {\beta}_{2b}^{*}& =\frac{({\beta}_{1}\rho +{\beta}_{2})[\Phi (\lambda (c-{\alpha}_{0}))(1-\Phi (\lambda (c-{\alpha}_{0})))]-({\beta}_{1}+{\beta}_{2}\rho ){\alpha}_{1}^{2}{\lambda}^{2}\rho {\phi}^{2}(\lambda (c-{\alpha}_{0}))}{\Phi (\lambda (c-{\alpha}_{0}))(1-\Phi (\lambda (c-{\alpha}_{0})))-{\alpha}_{1}^{2}{\lambda}^{2}{\rho}^{2}{\phi}^{2}(\lambda (c-{\alpha}_{0}))}\\ & =\frac{({\beta}_{1}\rho +{\beta}_{2})R(\lambda (c-{\alpha}_{0}))-({\beta}_{1}+{\beta}_{2}\rho ){\alpha}_{1}^{2}{\lambda}^{2}\rho \phi (\lambda (c-{\alpha}_{0}))}{R(\lambda (c-{\alpha}_{0}))-{\alpha}_{1}^{2}{\lambda}^{2}{\rho}^{2}\phi (\lambda (c-{\alpha}_{0}))}\end{array}$$

(3.3)

Using equations (2.4) and (3.3), the “attenuation factor” (i.e., multiplicative bias) when *W** _{b}* is used instead of

$$AF=\frac{{\beta}_{1b}^{*}}{{\beta}_{1b}}=\frac{{\alpha}_{1}\lambda (R(c)-{\rho}^{2}\phi (c))}{R(\lambda (c-{\alpha}_{0}))-{\alpha}_{1}^{2}{\lambda}^{2}{\rho}^{2}\phi (\lambda (c-{\alpha}_{0}))}.$$

(3.4)

We note that if we set *α*_{0} = 0 and *α*_{1} = 1, the second term in the denominator of expression (3.4) differs from the corresponding term in Gustafson and Le (2002) which has *ρ*^{2}*λ*(*λc*). We believe this is a typographical error in their article as this term stems from the *square* of the covariance between *W** _{b}* and

We explore the behavior of the multiplicative bias *AF* as each of the parameters
${\sigma}_{u}^{2}$, *α*_{0}, *α*_{1}, *ρ* and *c* is varied. Note that *R*(*x*) is symmetric and decreasing in |*x*| (Gustafson and Le, 2002). First as expected, when the measurement error variance
${\sigma}_{u}^{2}$ increases, *λ* decreases to 0, and therefore the multiplicative bias, *AF*, approaches 0 (i.e., is worst) as shown in Figure 1. Similar behavior of *AF* is evident when |*α*_{1}| decreases, i.e., when *W* is a poor surrogate for *X* (Figure 2). On the other hand, as |*α*_{1}| increases to infinity, *AF* converges to
$\frac{\pm (R(c)-{\rho}^{2}\phi (c))}{R(0)-{\rho}^{2}\phi (0)}$, so that multiplicative bias from the naïve estimator converges to a constant (Figure 2).

Multiplicative bias of each estimator as measurement error SD, *σ*_{u} varies; y-axis represents
$\frac{{\beta}_{1b}^{*}}{{\beta}_{1b}}$ the ratio of naïve (dashed black curve) or RC based (solid blue curve) regression coefficient to true coefficient; *W* = **...**

Multiplicative bias of each estimator as *a*_{1} varies; y-axis represents
$\frac{{\beta}_{1b}^{*}}{{\beta}_{1b}}$ the ratio of naïve (dashed black curve) or RC based (solid blue curve) regression coefficient to true coefficient; *W* = *a*_{0} + *a*_{1}*X* + *u* with SD of measurement **...**

Furthermore, because *R*(*x*) is decreasing in |*x*|, *AF* approaches infinity as the additive bias in *W*, |*α*_{0}|, increases. Thus, additive bias in *W* can lead to an inflated estimate of *β*_{1}* _{b}*, rather than attenuation of effects, an arguably more pernicious effect of a poor surrogate

Finally, since for large |*z*|, *R*(*z*) ≈ 1/|*z*|, as the cutpoint |*c*| goes to infinity, *AF* converges to *α*_{1}*λ*_{2}. In particular, when *α*_{1} = 1, *AF* converges to *λ*^{2} as the cutpoint *c* moves away from 0, leading to more multiplicative bias when the cutpoint is far from the center of the true exposure distribution (Figures 3 and and4).4). The behavior of *AF* is more erratic as *ρ* increases from 0 to 1 (Figure 3) and also when *W* is a biased estimate of *X* i.e., *α*_{0} ≠ 0 or *α*_{1} ≠ 1 (data not shown).

Multiplicative bias of each estimator as cor(*X; Z*) = *ρ* varies; y-axis represents
$\frac{\beta {\u200a}_{1b}^{*}}{{\beta}_{1b}}$ the ratio of naïve (thin black curve) or RC based (thick blue curve) regression coefficient to true coefficient; *W* = *a*_{0} **...**

One of the most widely used methods to adjust for biased estimation due to mismeasured covariates is *regression calibration* (Carroll et al., 2006). We briefly outline this method below. As before, assume that true continuous *X*, its surrogate *W*, and a continuous, accurately measured covariate *Z* are jointly normally distributed, and that interest focuses on estimating the association between true *X* and outcome *Y* (continuous or discrete): *E*(*Y* |*X, Z*) = *g*(*X, Z, β*), for some hypothesized model *g*. It is also assumed that a *validation substudy* is conducted in which *X* and *W* are measured on a small subsample of participants. If the validation substudy is conducted on a subsample of the main study, we have an *internal validation design* and *X*, *W*, and *Y* are observed for participants in the validation substudy; if the substudy is not part of the main study, we have an *external validation design*, and *Y* is not observed for participants in the validation substudy. Note that *Z* is assumed to be measured on all participants, and hence is available for main study as well as validation substudy participants.

The regression calibration algorithm proceeds as follows: (i) fit a *calibration* model *E*(*X*|*W, Z*) for *X*, *W,* and *Z* in the validation substudy; (ii) using regression coefficients from the calibration model, obtain a predicted value * _{i}* =

The regression calibration algorithm is simple to implement but does require validation or replication data in order to model *E*(*X*|*W, Z*). It is easy to see that regression calibration gives the correct mean function in linear models *E*(*Y* |*X, Z*) = *β*_{0} + *β*_{1}*X* + *β*_{2}*Z* in the non-differential error setting (Carroll et al., 2006). Of course, in this article we are interested in a dichotomized version of *X*, *X** _{b}*, so in the following sections we will describe an implementation of regression calibration for this scenario, and examine whether this method is less biased than naïvely substituting

In this article, our primary focus is estimating the association between dichotomized *X* (i.e., *X** _{b}*) and

- Fit a calibration model for
*X*given (*W*,*Z*), i.e.,*E*(*X*|*W, Z*) = γ_{0}+ γ_{1}*W*+ γ_{2}*Z*. Let*X*=_{rc}_{0}+_{1}*W*+_{2}*Z*be the predicted value of*X*given*W*and*Z*using estimated regression coefficients_{0}*,*_{1}*,*and_{2}from the calibration model. - Define
*X*=_{rcb}*I*(*X*_{rc}*> c*), a dichotomized version of*X*._{rc} - For each subject
*i*who is missing*X*, use*X*=_{rci}_{0}+_{1}*W*+_{i}_{2}*Z*and replace_{i}*X*with_{bi}*X*=_{rcbi}*I*(*X*_{rci}*> c*). - Use least-squares regression to fit
*E*(*Y*|*X*_{rcb}*, Z*).

Clearly, *X** _{rcb}* is an approximation of

$$\begin{array}{l}{\beta}_{1b}^{rc}=\frac{{\beta}_{1}({\sigma}_{rc}-\frac{{\rho}^{2}}{{\sigma}_{rc}})}{R(c/{\sigma}_{rc})-\frac{{\rho}^{2}}{{\sigma}_{rc}^{2}}\phi (c/{\sigma}_{rc})}\\ A{F}_{rc}=A{F}_{rc}({\alpha}_{1},\hspace{0.17em}\hspace{0.17em}{\sigma}_{u}^{2},\hspace{0.17em}\hspace{0.17em}\rho ,\hspace{0.17em}\hspace{0.17em}c)=\frac{{\beta}_{1b}^{rc}}{{\beta}_{1b}}=\frac{({\sigma}_{rc}-\frac{{\rho}^{2}}{{\sigma}_{rc}})(R(c)-{\rho}^{2}\phi (c))}{(1-{\rho}^{2})(R(c/{\sigma}_{rc})-\frac{{\rho}^{2}}{{\sigma}_{rc}^{2}}\phi (c/{\sigma}_{rc}))}\end{array}$$

(3.5)

where
$R(x)=\frac{\Phi (x)(1-\Phi (x))}{\phi (x)}$ and *σ** _{rc}* is the standard deviation of

In this section, we compare the multiplicative bias when either the naïve or regression calibration estimator are used to estimate *β*_{1b}. Figures 1 – 4 depict the regression calibration bias, *AF** _{rc}*, for different values of
${\sigma}_{u}^{2}$,

The behavior of *AF** _{rc}* as the cutpoint or correlation between

To further examine the behavior of the naïve and regression calibration approaches in finite sampling situations, we conducted a simulation study. We generated 1000 datasets each of size 500 conforming to Model (2.1). The true exposure *X* and potentially confounding covariate *Z* were simulated from a bivariate Gaussian distribution with mean (0*,* 0), var(*X*) = var(*Z*) = 1*,* and cov(*X, Z*) = cor(*X, Z*) = *ρ*. Simulated datasets were generated for three values of *ρ*, namely, *ρ* = 0, 0.3, or 0.7. For each *X* and *Z*, the outcome was generated as *Y* = *X* + *Z* + *ε* where *ε* was drawn from a mean-zero Gaussian distribution with standard deviation equal to 0.1.

In the previous sections we have already noted (see Equations (3.4) and (3.5) and Figures 1 and and2)2) that the multiplicative bias *AF* increases as the additive bias in *W*, *α*_{0}, increases, whereas the regression calibration estimator is independent of *α*_{0}. Hence, for the simulations, we assumed that *α*_{0} = 0, thus removing this source of error from *W:* Accordingly, the surrogate *W* was generated via Model (2.1) as *W* = *α*_{1}*X* + *u*. We compared the naïve and regression calibration estimators over a range of values for the measurement error parameters (*α*_{1}, *σ** _{u}*). In particular, we varied the multiplicative bias in

Values chosen for *α*_{1} and *σ** _{u}* were guided by nutrition studies, where food frequency questionnaires (FFQ) are often used to estimate usual intake of various food components. FFQs are inexpensive and easy to administer, but are known to be subject to large measurement errors (Kipnis et al., 2003; Day et al., 2001). In some instances less error-prone measures using biomarkers, such as doubly labeled water for energy intake or urinary nitrogen for protein intake, are used to calibrate the surrogate FFQ in validation substudies. Scaling biases in

Finally, cutpoint values of 0. 1, and 2 were used. For each cutpoint, the regression coefficients *β*_{1}* _{b}* and

For each of the 54 combinations of *α*_{1}, *σ** _{u}*,

For the internal design, a 50% validation subset of 250 was randomly chosen from the dataset of 500. This subsample of 250 was the designated *internal validation subsample* used to calibrate *W*. A least-squares regression of *X* on *W* and *Z* was fitted in the validation subsample to obtain *X** _{rc}* =

For each of the 54 simulated scenarios, mean bias, standard deviation (SD), and root-mean-squared-error (RMSE) of the estimated *β*_{1b} and *β*_{2}* _{b}* were calculated using the 1000 datasets.

Bias, standard deviation (SD), and root mean-squared errors (RMSE) of estimates of *β*_{1}* _{b}* for each of the naïve, regression calibration external design, and regression calibration internal design methods are presented in Tables 1 – 3. The “gold standard” estimate where the true

Comparing across the three tables, as expected for the naïve and external design regression calibration method, absolute value of bias and SD of the estimates of *β*_{1}* _{b}* increased as

For the naïve method, absolute value of bias increased as the cutpoint increased, except for *ρ* = 0.7, where no clear pattern emerged. Regression calibration for the external and internal design displayed the opposite behavior, with bias decreasing as the cutpoint increased except for *ρ* = 0.7, where the external design regression calibration estimator displayed slightly increasing bias as the cutpoint increased. Also for *α*_{1} = 0.5 and *ρ* = 0.3, the bias when using the external design increased as the cutpoint increased. For all the methods, the standard deviation of estimates increased as the cutpoint increased.

When comparing the performance of the methods to each other, regression calibration in the external design had less absolute bias than the naïve method in all simulated scenarios, except when *ρ* = 0.7*,* where regression calibration in the external design had 10 – 30% more bias than the naïve method. In all other scenarios, and particularly as the cutpoint and/or *α*_{1} increased, regression calibration estimates for the external design had markedly less bias than the naïve approach. However, the reduction in bias when using regression calibration was accompanied by an increase in variability of estimates when *c* = 1*,* 2, a known feature of regression calibration (Carroll et al., 2006). Nevertheless, the regression calibration estimator still displayed lower RMSE than the naïve approach in all scenarios except when *ρ* = 0.7.

The regression calibration internal design had less absolute bias and SD compared to the external design in all situations, but there were striking improvements when *α*_{1} = 0.5*,* 1. This was to be expected, since in the internal design the true *X* is available on 50% of the sample. Needless to say, regression calibration in the internal design displayed marked reductions in absolute bias compared to the naïve estimator. Furthermore, the SD of estimates using the regression calibration internal method were also smaller or comparable to the naïve approach except when the (i) cutpoint was non-zero and *α*_{1} = 10 or *α*_{1} = 1 or, (ii) when *ρ* = 0.7 and *α*_{1} = 0.5. However, the substantial bias reduction of regression calibration for the internal designs more than compensated for the corresponding small increase in variability of estimated parameters, as is evident from a comparison of the root mean-squared errors of the 3 methods: regression calibration internal design RMSEs were often smaller than RMSEs of the other methods by a factor of 50%.

The primary parameter of interest is *β*_{1}* _{b}*; which measures the strength of association between the outcome

As seen in Tables 1 – 3, the regression calibration estimates of *β*_{2}* _{b}* were less biased than those derived from using the surrogate

The simulations suggest that regression calibration can substantially correct biases introduced by dichotomizing a mismeasured predictor Particular gains were evident as the cutpoint increased. The internal design performed best, as would be expected. Although not presented here, the naïve method did not work at all when *α*_{0} = ±10 and *α*_{1} = 0.5*,* 1, whereas regression calibration was still able to estimate *β*_{1}* _{b}* and

We applied the methods described here to a subsample of breast cancer survivors who participated in the Women’s Healthy Eating and Living (WHEL) Study, a dietary intervention trial (Pierce et al., 2002) aimed at reducing breast cancer recurrence. For the current analysis, we focused on a dataset of 1673 women who had blood pressure recordings and carotenoid data at study entry. Carotenoids are bioactive compounds provided mainly by vegetables and fruit in the diet (Rock, 1997), hypothesized to reduce cancer and cardiovascular disease risk. In the WHEL study, carotenoid intake was assessed in two ways: (i) based on self-reported fruit and vegetable intake obtained during multiple 24-hr recall telephone interviews of WHEL participants by trained nutritionists, (ii) by measuring plasma levels of circulating carotenoids in blood samples obtained from participants at clinic visits (Pierce et al., 2002). The plasma measure is a biomarker known to be well correlated with fruit and vegetable intake (Rock, 1997), and for this analysis serves as the “gold standard”. Self-reported carotenoid intake represents the surrogate measure for the true plasma value. The outcome of interest is systolic blood pressure.

Mean (SD) systolic blood pressure (SBP) in this population was 117.3 (16) mm Hg, and displayed a reasonably Gaussian distribution. We log-transformed and then centered and scaled the carotenoid measures to have zero mean and unit variance. Applying the notation of Model (2.1) to this example, *Y* represents SBP, *X* is the standardised (log) plasma carotenoid concentration level, and *W* is (log) self-reported carotenoid intake. We used a cutpoint of *c* = 1, so that *X** _{b}* =

We hypothesized a linear model and estimated regression coefficients for the association between *Y* (dependent variable) and each of (i) *X** _{b}* the gold standard, (ii)

Our development so far has focused on the multivariate normal situation where *Y, X, W, Z* are jointly normally distributed. We could relax this condition and instead assume that *Y, X, W, Z* jointly have finite first and second moments (Gustafson and Le, 2002). If we assume as before non-differential measurement error so that *E*(*Y* |*X, Z, W*) = *E*(*Y* |*X, Z*) and a linear outcome model *E*(*Y*|*X, Z*) = *β*_{0} + *β*_{1}*X* + *β*_{2}*Z*, then the parameters of interest, *β*_{1}* _{b}* and
${\beta}_{1b}^{*}$, can be obtained from equations (2.3) and (3.2). Hence the multiplicative bias when

Epidemiological studies often seek to estimate disease risk associated with categorized levels of a continuous exposure. When a mismeasured exposure is split into categories using pre-defined cutpoints, biased estimates of exposure-disease associations are obtained. Gustafson and Le (2002) discuss the bias in exposure-disease associations resulting from dichotomization of a mismeasured predictor, when the latter is an unbiased estimate of true exposure. We extend their results and derive formulae for the multiplicative bias in linear exposure-disease associations when the mismeasured exposure is a biased estimate of true exposure.

In nutritional epidemiology, surrogates such as food frequency questionnaires and food diaries are known to provide biased estimates of true dietary intake (Kipnis et al., 2003; Day et al., 2001). Furthermore, in nutrition studies, exposures are often dichotomized: for instance it is believed that (American Cancer Society, 2007) consuming ≥ 5 fruits/vegetables per day will confer protection against cancer. Thus, quantifying the bias associated with dichotomizing a mismeasured and likely biased estimate of true dietary intake is important for designing future diet-cancer studies.

Regression calibration is a well-known method (Carroll et al., 2006), used to adjust for measurement error in continuous exposures. In this article, we describe a form of regression calibration for dichotomized mismeasured predictors in the linear regression setting. We rigorously derive analytic formulae for the multiplicative bias for this regression calibration approach, and compare these to naïvely dichotomizing the mismeasured surrogate without calibration. We also compare the performance of the methods using simulations to mimic many real-world applications when true exposure may be available on a subsample of the study population.

Our findings suggest that in the linear regression setting, when additive bias (i.e., *α*_{0}) in the surrogate is large, the naïve method fails and cannot be used to estimate exposure-disease associations. Regression calibration, though biased, does work in these situations. Further, regression calibration usually outperforms the naïve method when both methods can be applied. Scenarios when regression calibration has the most advantage are discussed.

Several caveats must be noted when applying the methods described here. We have considered a particular application of regression calibration which seems natural for the analyses we are interested in undertaking. There are other ways in which we could have implemented regression calibration, such as fitting a model for *X** _{b}* given

Dichotomizing a mismeasured predictor leads to differential error even when the error in the continuous predictor is non-differential (Gustafson and Le, 2002). The impact of this differential error is already evident in the erratic behavior of the multiplicative bias as *ρ* = cor(*X, Z*) varies. In the case of a continuous or binary predictor with non-differential error, as *ρ* increases to 1, *AF* decreases to 0*,* leading to substantial bias when *X* and *Z* are highly correlated (Gustafson and Le, 2002); whereas for the dichotomized mismeasured predictor considered in this article, the multiplicative bias as *ρ* increases (Equation (3.4)) depends also on the cutpoint *c*. Additionally, bias due to dichotomizing a mismeasured predictor depends on the distribution function of true exposure *X*, while a salient feature of non-differential error is that the bias only depends on the variance of *X*. We conjecture that it is due to this differential error, that our regression calibration implementation still yields biased estimates of *β*_{1}* _{b}*. For instance, assuming

It is important to note that this article focuses on disease outcomes that can be measured continuously, and such that exposure-disease associations can be modeled using a linear regression model. Clearly, other disease models (logistic, survival) should be considered and we leave this to future work. There are many approaches to measurement error adjustment including multiple imputation (Cole et al., 2006), simulation-extrapolation (Stefanski and Cook, 1995), Bayesian (Richardson and Gilks, 1993; Gustafson, 2003), and maximum likelihood methods (Spiegelman et al., 2003; Messer and Natarajan, 2008). These methods would likely be good competitors to regression calibration in this setting of dichotomization of mismeasured predictors. However, most of these alternate approaches require more extensive programming and are not as easily implemented as regression calibration.

In summary, in fields such as nutritional epidemiology, where mismeasured dietary intake is often dichotomized, we strongly recommend conducting calibration studies and using the method of regression calibration to adjust for measurement error. This approach can substantially reduce biases and lead to more accurate estimates of exposure-disease model coefficients.

Details on the derivation of the bias associated with using *X** _{rcb}* instead of

$$\begin{array}{ll}{\beta}_{1rcb}& =\frac{\text{cov}(Y,\hspace{0.17em}\hspace{0.17em}{X}_{rcb})-\text{cov}({X}_{rcb},\hspace{0.17em}\hspace{0.17em}Z)\text{cov}(Z,\hspace{0.17em}\hspace{0.17em}Y)}{\text{var}({X}_{rcb})-{\text{cov}}^{2}({X}_{rcb},\hspace{0.17em}\hspace{0.17em}Z)}\\ & =\frac{{\beta}_{1}[\text{cov}(X,\hspace{0.17em}\hspace{0.17em}{X}_{rcb})-\text{cov}({X}_{rcb},\hspace{0.17em}\hspace{0.17em}Z)\text{cov}(Z,\hspace{0.17em}\hspace{0.17em}X)]}{\text{var}({X}_{rcb})-{\text{cov}}^{2}({X}_{rcb},\hspace{0.17em}\hspace{0.17em}Z)}\end{array}$$

(7.1)

The multivariate normality of (*X, W, Z*) implies that (a)
$\text{cov}(X,\hspace{0.17em}\hspace{0.17em}{X}_{rcb})={\sigma}_{rc}\phi (\frac{c}{{\sigma}_{rc}})$, (b)
$\text{cov}({X}_{rcb},\hspace{0.17em}\hspace{0.17em}Z)=\frac{\rho}{{\sigma}_{rc}}\phi (\frac{c}{{\sigma}_{rc}})$, and (c)
$\text{var}({X}_{rcb})=\Phi (\frac{c}{{\sigma}_{rc}})(1-\Phi (\frac{c}{{\sigma}_{rc}}))$. Substituting for each term in Equation (7.1), we get

$$\begin{array}{ll}{\beta}_{1rcb}& =\frac{{\beta}_{1}[{\sigma}_{rc}\phi (c/{\sigma}_{rc})-\frac{{\rho}^{2}}{{\sigma}_{rc}}\phi (c/{\sigma}_{rc})]}{\Phi (c/{\sigma}_{rc})(1-\Phi (c/{\sigma}_{rc}))-\frac{{\rho}^{2}}{{\sigma}_{rc}^{2}}{\phi}^{2}(c/{\sigma}_{rc})}\\ & =\frac{{\beta}_{1}({\sigma}_{rc}-\frac{{\rho}^{2}}{{\sigma}_{rc}})}{R(c/{\sigma}_{rc})-\frac{{\rho}^{2}}{{\sigma}_{rc}^{2}}\phi (c/{\sigma}_{rc})}\end{array}$$

(7.2)

The expression for *AF** _{rc}* (3.5) is then easily derived from Equations (2.4) and (7.2).

^{*}This research was supported in part by NIH grant 5 R03 CA117292-02. The author thanks Dr. John Pierce and the WHEL study for providing the blood pressure data that was used to illustrate the statistical methods proposed in this article. The author also thanks the referees and editor for their insightful comments which helped greatly to improve the manuscript.

- American Cancer Society, Cancer Facts & Figures , http://www.cancer.org/, 2007
- American Heart Association Statistics, Heart Disease & Stroke Statistics Update , http://www.americanheart.org/2005
- Altman DG. ‘Suboptimal analysis using optimal cutpoints’ British Journal of Cancer. 1998;78:556–557. [PMC free article] [PubMed]
- Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM. Monographs on Statistics and Applied Probability. Vol. 105. Chapman & Hall/CRC; 2006. ’Measurement Error in Nonlinear Models’
- Cole SR, Chu H, Greenland S. ‘Multiple-imputation for measurement-error correction’ International Journal of Epidemiology. 2006;35(4):1074–81. doi: 10.1093/ije/dyl097. [PubMed] [Cross Ref]
- Day NE, McKeown N, Wong MY, Welch A, Bingham S. ‘Epidemiological assessment of diet: a comparison of a 7-day diary with a food frequency questionnaire using urinary markers of nitrogen, potassium and sodium’ International Journal of Epidemiology. 2001;30:309–317. doi: 10.1093/ije/30.2.309. [PubMed] [Cross Ref]
- Fuller WA. ‘Measurement Error Models’ Wiley Series in Probability and Mathematical Statistics. 1987
- Gustafson P, Le DN. Comparing the Effects of Continuous and Discrete Covariate Mismeasurement with emphasis on Dichotomization of Mismeasured Predictors. Biometrics. 2002;28:878–887. doi: 10.1111/j.0006-341X.2002.00878.x. [PubMed] [Cross Ref]
- Gustafson P. ‘Measurement Error and Misclassification in Statistics and Epidemiology’ Chapman & Hall. 2003
- Kipnis V, Subar AF, Midthune D, Freedman LS, Ballard-Barbash R, Troiano RP, Bingham S, Schoeller DA, Schatzkin A, Carroll R. ‘Structure of Dietary Measurement Error: Results of the OPEN Biomarker Study’ American Journal of Epidemiology. 2003;158:14–21. doi: 10.1093/aje/kwg091. [PubMed] [Cross Ref]
- Messer K, Natarajan L. ‘Maximum likelihood, multiple imputation, and regression calibration for measurement error adjustment’ Statistics in Medicine. 2008;27:6332–6350. doi: 10.1002/sim.3458. [PMC free article] [PubMed] [Cross Ref]
- Natarajan L, Flatt SW, Sun X, Gamst AC, Major JM, Rock CL, Al-Delaimy W, Thomson CA, Newman VA, Pierce JP. Women’s Healthy Eating and Living Study Group, ‘Validity and systematic error in measuring carotenoid consumption with dietary self-report instruments’ American Journal of Epidemiology. 2006;163(8):770–8. doi: 10.1093/aje/kwj082. [PubMed] [Cross Ref]
- Pierce JP, Faerber S, Wright F, Rock CL, Newman V, Flatt SW, Kealey S, Jones VE, Caan BJ, Gold EB, Haan M, Hollenbach KA, Jones L, Marshall JR, Ritenbaugh C, Stefanick M, Thomson C, Wasserman L, Natarajan L, Thomas RG, Gilpin EA. ‘A randomized trial of the effect of a plant based dietary pattern on breast cancer recurrence: The Women’s Healthy Eating and Living (WHEL) Study’ Controlled Clinical Trials. 2002;23:728–756. doi: 10.1016/S0197-2456(02)00241-6. [PubMed] [Cross Ref]
- Richardson S, Gilks WR. ‘Conditional independence models for epidemiological studies with covariate measurement error’ Statistics in Medicine. 1993;12:1703–1722. doi: 10.1002/sim.4780121806. [PubMed] [Cross Ref]
- Rock CL. ‘Carotenoids: biology and treatment’ Pharmocological Therapy. 1997;75:185–197. doi: 10.1016/S0163-7258(97)00054-5. [PubMed] [Cross Ref]
- Royston P, Altman DG, Sauerbrei W. ‘Dichotomizing continuous predictors in multiple regression: a bad idea’ Statistics in Medicine. 2006;25:127–141. doi: 10.1002/sim.2331. [PubMed] [Cross Ref]
- Spiegelman D, Rosner B, Logan R. ‘Estimation and inference for logistic regression with covariate misclassification and measurement error, in main study/validation designs’ Journal of the American Statistical Association. 2000;95:51–61. doi: 10.2307/2669522. [Cross Ref]
- Stefanski LA, Cook J. ‘Simulation Extrapolation: the measurement error jackknife’ Journal of the American Statistical Association. 1995;90:1247–156. doi: 10.2307/2291515. [Cross Ref]

Articles from The International Journal of Biostatistics are provided here courtesy of **Berkeley Electronic Press**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |