Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2809827

Formats

Article sections

- Abstract
- 1 Introduction
- 2 Nurses’ Health Study data sets
- 3 Cumulative average model
- 4 Measurement error correction method
- 5 Analysis of NHS data
- 6 Results
- 7 Discussion
- Supplementary Material
- References

Authors

Related links

Lifetime Data Anal. Author manuscript; available in PMC 2011 January 1.

Published in final edited form as:

Published online 2009 September 16. doi: 10.1007/s10985-009-9124-6

PMCID: PMC2809827

NIHMSID: NIHMS145713

Channing Laboratory, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115, USA

Bernard Rosner: ude.dravrah.gninnahc@rabts

The publisher's final edited version of this article is available at Lifetime Data Anal

See other articles in PMC that cite the published article.

The use of the cumulative average model to investigate the association between disease incidence and repeated measurements of exposures in medical follow-up studies can be dated back to the 1960s (Kahn and Dawber, J Chron Dis 19:611–620, 1966). This model takes advantage of all prior data and thus should provide a statistically more powerful test of disease-exposure associations. Measurement error in covariates is common for medical follow-up studies. Many methods have been proposed to correct for measurement error. To the best of our knowledge, no methods have been proposed yet to correct for measurement error in the cumulative average model. In this article, we propose a regression calibration approach to correct relative risk estimates for measurement error. The approach is illustrated with data from the Nurses’ Health Study relating incident breast cancer between 1980 and 2002 to time-dependent measures of calorie-adjusted saturated fat intake, controlling for total caloric intake, alcohol intake, and baseline age.

Kahn and Dawber (1966) used a cumulative average model to make use of all known cholesterol values prior to the occurrence of an event to investigate the development of coronary heart disease over a short time period in relation to sequential biennial measures of cholesterol in the Framingham Heart Study. The cumulative average model has been also studied by several researchers (e.g., Wu and Ware 1979; Cupples et al. 1988; D’Agostino et al. 1990) using data from the Framingham Heart Study. Several authors (e.g., Hu et al. 1997; Kim et al. 2006) applied the cumulative average model to the Nurses’ Health Study.

The cumulative average model takes advantage of all prior data and thus should provide a statistically more powerful test of an association of cumulative exposure (Willett 1998, p. 333). However, measurement error in covariates is not unusual for medical follow-up studies. For example, nutritional intake measured with instruments such as a food frequency questionnaire (FFQ) or a 24-h recall often have substantial systematic measurement error, in addition to random measurement error. Unlike random measurement error, systematic measurement error cannot be reduced by obtaining replicate measures of diet at different points in time and averaging the responses. Ignoring systematic measurement error could cause biased inference. Hence correction for systematic measurement error is needed for the cumulative average model.

Many methods have been proposed to correct for systematic measurement error (see Carroll et al. 2006, for an overview). However, most existing methods (e.g., Rosner et al. 1989, 1990; Spiegelman et al. 1997) considered measurement error correction based on covariates measured at only one time point. Recently, several authors, such as, Rosner et al. (2008) and Yi (2008) proposed methods for longitudinal data. To the best of our knowledge, no measurement correction methods have been proposed for the cumulative average model. In this article, we propose a regression calibration approach based on the method described in Rosner et al. (1990) to correct measurement error for the cumulative average relative risk estimates. The approach is illustrated with data from the Nurses’ Health Study relating incident breast cancer between 1980 and 2002 to calorie-adjusted saturated fat intake, controlling for total caloric intake, alcohol intake, and baseline age.

The remainder of the article is structured as follows. In Sect. 2, we briefly describe the Nurses’ Health Study data sets to motivate the cumulative average model. In Sect. 3, we then introduce the cumulative average model. In Sect. 4, we propose the measurement error correction method for the cumulative average model for the case where both main study data set and validation study data set are available at each time point. We conduct a simulation study to evaluate the performance of the measurement error correction in this setting. In Sect. 5, we describe how to correct measurement error based on the cumulative average model when validation study data are not available at some time points. The analysis results for the Nurses’ Health Study (NHS) data based on the proposed measurement error correction method are presented in Sect. 6. A discussion is given in Sect. 7. Technique details are given in Web Appendices.

The NHS is a continuing epidemiologic cohort study established at the Channing Lab- oratory, Brigham and Women’s Hospital in 1976. Initially, the NHS investigated the potential long term consequences of the use of oral contraceptives. Around 120,000 married registered nurses (aged 30–54 in 1976) were selected to be followed prospectively. In 1980, diet information were collected via food frequency questionnaire (FFQ), which measures how often an individual eats particular types of food in the previous year. Subsequent diet information were collected in 1984, 1986, and every four years since.

It is well-known that nutrient intakes measured by FFQ have substantial measurement error. To correct for measurement error, two valiation/calibration studies were conducted in 1980 and 1986 for two subsets of participants (Willett et al. 1985, 1988). During 1980, 173 participants were asked to fill out 4 weeks of diet record (DR) approximately 3months apart over 1 year. At the end of the year, a second FFQ was administered which could be directly compared with the average of 28 days of DR. The 1986 validation study consisted of 190 participants, 92 of whom were also in the 1980 validation study.

We are interested in assessing the possible association between the incidence of breast cancer and calorie-adjusted saturated fat intake, controlling for total caloric intake, alcohol intake, and baseline age.

Calorie-adjusted saturated fat intake is defined as percent of energy due to saturated fat (=100 × 9× saturated fat intake (g)/total caloric intake (kcal)). Age in 1980 is categorized in five categories: 34–39, 40–44, 45–49, 50–54, and 55+. Calorie-adjusted saturated fat, total caloric intake, and alcohol intake are measured with error while age is measured without error. FFQ questionnaires were collected from the main NHS studies in 1980, 1984, 1986, 1990, 1994, and 1998.

Some descriptive statistics concerning the main and validation study populations are given in Table 1.

For the main studies, the average % calories due to saturated fat decreased over time. The total caloric intake increased from 1980 to 1984, and then remained about the same from 1984 to 1998. This is probably because the 1980 questionnaire had 61 food items while questionnaires from 1984 on had over 120 food items. Mean alcohol intake decreased from 1980 to 1998, except for a slight increase from 1980 to 1984.

Pearson correlations between FFQ intake at repeated surveys over 18 years are presented in Table 2.

As expected, the correlations between intake of the same nutrient over time decreases as the difference between time points increases, although the rate of decrease is small. There is a low correlation (0.05–0.16) between calorie-adjusted saturated fat intake and total caloric intake when measured at the same time. The cross correlation between calorie-adjusted saturated fat intake at time *t*_{1} and total caloric intake at time *t*_{2} ranges from −0.01 to 0.08. The correlation between calorie-adjusted saturated fat intake and alcohol intake is slightly negative and the correlation between total caloric intake and alcohol intake is slightly positive. The correlation between the same nutrient measured in 1980 and that measured in 1984 is smaller than the correlation between the same nutrient measured over any other two consecutive surveys. This is partly because the FFQ included around 60 more food items starting from 1984, resulting in more stable estimates of nutrient intake over time.

We used four different approaches to assess the possible association between the incidence of breast cancer and calorie-adjusted saturated fat intake, controlling for total caloric intake, alcohol intake, and baseline age. All four approaches make use of the Cox proportional hazards regression model.

With the first approach, we related exposures in 1998 to breast cancer incidence from 1998 to 2002 (1181 cases). With the second approach, we related the average exposure from 1980 to 1998 to breast cancer incidence from 1998 to 2002 (1351 cases). With the third approach, we related exposure in 1980 to breast cancer incidence from 1980 to 2002 (5672 cases). With the fourth approach, we first related average exposure during each of 6 time periods (1980, 1980–1984–1980–1986–1980–1990–1980–1994–1980–1998) to breast cancer incidence over the corresponding succeeding time period (1980–1984–1984–1986–1986–1990–1990–1994–1994–1998–1998–2002) with a total of 5672 cases over 1980–2002, and then pooled the six estimates.

Compared to the first three approaches, the fourth approach make full use of all available information and hence can dampen the effects of random errors more effectively. The fourth approach was mentioned in the literature as early as 1960s (e.g., Kahn and Dawber 1966) and was referred to as the cumulative average model. In this article, we consider the measurement error correction problem for the cumulative average model. We briefly describe this model in the next section.

The idea of the cumulative average model described in Kahn and Dawber (1966) is to first treat each time interval as a mini follow-up study, then to pool observations over all intervals to examine the short-term development of disease (D’Agostino et al. 1990). Wu and Ware (1979) implemented the cumulative average model using pooled logistic regression. In addition, Cupples et al. (1988), and Kim et al. (2006) implemented the cumulative average model using Cox proportional hazards regression. In this article, we will also use the Cox proportional hazards regression model.

Let *X** _{it}* be

$${h}_{c}(t{\overline{\mathit{X}}}_{it},{\overline{\mathit{U}}}_{it})={h}_{c0}(t)exp[{\mathit{\beta}}_{c1}^{}$$

(1)

where *h _{c}* (

Denote
${\widehat{\mathit{\beta}}}_{c}^{(t)}$ as the parameter estimates for Model (1). A *pooled* estimate of regression coefficients up to time *T* for the cumulative average model is

$${\widehat{\beta}}_{c,\mathrm{,T}=\sum _{t=1}^{T}{w}_{\mathrm{(t){\widehat{\beta}}_{c,(t)=({\mathit{w}}_{\mathrm{,T}{)}^{\prime}{\widehat{\mathit{\gamma}}}_{\mathrm{,T}}^{,}}^{}}^{}}}^{}}^{}$$

(2)

where
${\widehat{\beta}}_{c,(t)}^{}$ is the -th element of the 1 × (*k*_{1} + *k*_{2}) vector
${\widehat{\mathit{\beta}}}_{c}^{(t)}$, = 1, …, *k* *k*_{1} + *k*_{2}, and the *T* × 1 vector
${\mathit{w}}_{\mathrm{,T}={({w}_{\mathrm{(1),{w}_{\mathrm{(2),\dots ,{w}_{\mathrm{(T)}}^{)}}}^{}\prime}}^{=}}^{}}^{}$, where **1*** _{T}* is the

The Lagrange multiplier method can be used to derive the weight vector
${\mathit{w}}_{\mathrm{,T}}^{}$ which minimizes the variance of the overall estimator
${\sum}_{t=1}^{T}{v}^{(t)}$, subject to the conditions
${\sum}_{t=1}^{T}{v}^{(t)}$, and *υ*^{*(}^{t}^{)} > 0, *t* = 1, …, *T* (see Web Appendix A). The variance of
${\widehat{\beta}}_{c,\mathrm{,T}}^{}$ given *W*^{*} is

$$\text{Var}({\widehat{\beta}}_{c,\mathrm{,T}{\mathit{W}}^{})=1/[{\mathbf{1}}_{T}^{\prime}{({\mathbf{\Omega}}_{\mathrm{,T})}^{}-1{\mathbf{1}}_{T}}^{]}.}^{}$$

(3)

It is usually impossible or expensive to directly measure *X** _{it}* on a large number of subjects. Instead, we usually observe

$${h}_{c}(t{\overline{\mathit{Z}}}_{it},{\overline{\mathit{U}}}_{it})={h}_{c0}(t)exp[{\mathit{\beta}}_{c1}{\overline{\mathit{Z}}}_{it}+{\mathit{\beta}}_{c2}{\overline{\mathit{U}}}_{it}],\phantom{\rule{0.38889em}{0ex}}i=1,\dots ,N,$$

(4)

where ${\overline{\mathit{Z}}}_{it}={\sum}_{m=1}^{t}{\mathit{Z}}_{im}/{n}_{t}$, Denote ${\widehat{\mathit{\beta}}}_{c}^{(t)}=\left({\widehat{\mathit{\beta}}}_{c1}^{(t)},{\widehat{\mathit{\beta}}}_{c2}^{(t)}\right)$ as the parameter estimates for Model (4).

The parameters ${\mathit{\beta}}_{c}^{(t)}=\left({\mathit{\beta}}_{c1}^{(t)},{\mathit{\beta}}_{c2}^{(t)}\right)$ for the surrogate regression model (4) might be attenuated relative to the true regression parameters ${\mathit{\beta}}_{c}^{(t)}$ due to measurement error (see the simulation study below). Many methods (e.g. Prentice 1982; Spiegelman et al. 1997; Li and Lin 2003; Yi and He 2006; Carroll et al. 2006; Yi and Lawless 2007) have been proposed to correct relative risk estimates for measurement error in a survival data analysis setting. However, to the best of our knowledge, no methods have been proposed for measurement error correction of cumulative average exposures.

We apply the regression calibration method described in Rosner et al. (1990) to handle the cumulative average exposure case. The regression calibration method described in Rosner et al. (1990) first models the relation between true exposure and surrogate exposure, usually based on a validation study. It then replaces the true exposures by the conditional expectation of the true exposures conditioned on the surrogate exposures in the model relating the true exposures to disease. Next the regression coefficients of the revised model are compared with those of the model relating the surrogate exposures to disease. The comparison finally leads to the corrected regression coefficients. The regression calibration method assumes that a nondifferential error mechanism holds (i.e. *Pr* (**|****, ***D*) = *Pr* (**|****), where ***D* = 1 means diseased, *D* = 0 means not-diseased). In our example, it is reasonable to assume a nondifferential error mechanism since both *Z* values and *X* values are observed before any breast cancer occurs.

We assume a multivariate regression relating * _{it}*,

$${\overline{\mathbf{X}}}_{it}={\mathit{\alpha}}_{c}+{\mathbf{\Lambda}}_{c1}{\overline{\mathit{Z}}}_{it}+{\mathbf{\Lambda}}_{c2}{\overline{\mathit{U}}}_{it}+{\mathit{e}}_{c,it},\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}i=1,\dots ,{n}_{t},$$

(5)

where **Λ**_{c}_{1} and **Λ**_{c}_{2} are *k*_{1} × *k*_{1} and *k*_{1} × *k*_{2} matrices, *e*_{c}_{,}* _{it}* is multivariate normally distributed with a

By substituting expected values E [**_{it}** |

$${\mathit{\beta}}_{c}^{}$$

(6)

where
${\mathit{\beta}}_{c}^{}$ and *β** _{c}* = (

$${\mathbf{\Lambda}}_{c}=\left(\begin{array}{cc}{\mathbf{\Lambda}}_{c1}& {\mathbf{\Lambda}}_{c2}\\ {\mathbf{0}}_{{k}_{2}\times {k}_{1}}& {\mathit{I}}_{{k}_{2}\times {k}_{2}}\end{array}\right)$$

(7)

and *k* = *k*_{1} + *k*_{2}.

Using the results from Web Appendix B, we have

$$\widehat{\text{Cov}}({\widehat{\mathit{\beta}}}_{c,{j}_{1}}^{}={\left({\mathit{A}}_{c}^{\prime}{\widehat{\mathbf{\sum}}}_{{\widehat{\mathit{\beta}}}_{c}}{\mathit{A}}_{c}\right)}_{{j}_{1},{j}_{2}}+{\widehat{\mathit{\beta}}}_{c}{\widehat{\mathbf{\sum}}}_{{\mathit{A}}_{c,{j}_{1},{j}_{2}}}{\widehat{\mathit{\beta}}}_{c}^{\prime},$$

(8)

where ${\mathit{A}}_{c}={\mathbf{\Lambda}}_{c}^{-1}$,

$${\widehat{\mathbf{\sum}}}_{{\mathit{A}}_{c,{j}_{1},{j}_{2}}}=w\xb7[{\mathit{A}}_{c1}{\widehat{\mathbf{\sum}}}_{ce}{\mathit{A}}_{c1}^{\prime}]$$

(9)

is the *k* × *k* variance-covariance matrix relating the elements in the *j*_{1}-th and *j*_{2}-th columns of ** A_{c}** (denoted by

$${\overline{\mathit{W}}}_{t}=(\begin{array}{ccc}1& {\overline{\mathit{Z}}}_{1t}^{\prime}& {\overline{\mathit{U}}}_{1t}^{\prime}\\ & 1& {\overline{\mathit{Z}}}_{{n}_{t}t}^{\prime}& {\overline{\mathit{U}}}_{{n}_{t}t}^{\prime}& )\\ ,\end{array}$$

where * _{t}* is obtained from the validation study of size

To evaluate the validity of the approximation in Eq. 6, we conducted a simulation study based on the 1980 and 1986 surveys. In the simulation study, we consider three exposures: calorie-adjusted saturated fat intake (denoted as *Z*_{1} measured by FFQ, *X*_{1} measured by DR), total caloric intake (denoted as *Z*_{2} measured by FFQ, *X*_{2} measured by DR), and baseline age (regarded as a continuous variable and denoted by *U*). We generated 1,000 simulated data sets. Each data set consists of a main study and a corresponding validation study at 2 time points. Each main study contains 10,000 subjects. Each validation study contains 200 subjects.

To take into account of (a) the correlations among different exposures at the same time point, (b) the correlations of the same exposure measured at different times, and (c) the cross correlations among different exposures measured at different times, we assume that the random vector ** W** = (

For each data set, we first generated random multivariate normal vectors *W*_{1}, …, *W*_{10,000} in the main study. We then computed the average nutrient intakes _{1,}* _{k}* = (

We next generated survival times based on the Weibull distribution with hazard function *h*(*t*|*δ*, *η*(**, ***U*)) = *δt ^{δ}*

Once we generated the survival times, we choose the 10% quantile of the survival times as the censoring time so that the specified survival rate 90% is achieved.

Finally, we apply the measurement error correction method we proposed for the cumulative average model to each simulated data set.

The percent biases and coverage probabilities of un-corrected and corrected estimates are shown in Table 3. The full simulation results are given in Web Appendix D.

The simulation results show that (i) corrected estimates of HRs have small percent bias and have appropriate coverage both for *H R* = 1.0 and *H R* ≠ 1.0; (ii) If the true HR is 1.0, the uncorrected estimates have small percent bias and have appropriate coverage; (iii) If the true HR is not equal to 1.0, the un-corrected estimates for the exposure (*Z*) subject to measurement error have a large percent bias towards the null and low coverage. (iv) If the true HR is not equal to 1.0, the un-corrected estimates for the exposure (*U*) without measurement error have small percent bias, but low coverage; (v) For the exposure (*U*) without measurement error, corrected estimates have smaller percent bias than un-corrected estimates. In summary, the corrected estimates in Eq. 6 have small bias and appropriate coverage, while the un-corrected estimates tend to be attenuated towards 1.0, resulting in large bias and low coverage.

We could not apply the proposed method directly to NHS data sets because validation studies are not available at some time points. Hence, we cannot estimate **Λ*** _{c}* directly.

We propose an indirect method to overcome this difficulty. From (5), we have

$$\begin{array}{l}\text{Cov}\phantom{\rule{0.16667em}{0ex}}({\overline{\mathbf{X}}}_{it},{\overline{\mathit{Z}}}_{it})={\mathbf{\Lambda}}_{c1}\text{Var}\phantom{\rule{0.16667em}{0ex}}({\overline{\mathit{Z}}}_{it})+{\mathbf{\Lambda}}_{c2}\text{Cov}\phantom{\rule{0.16667em}{0ex}}({\overline{\mathit{U}}}_{it},{\overline{\mathit{Z}}}_{it})\\ \text{Cov}\phantom{\rule{0.16667em}{0ex}}({\overline{\mathbf{X}}}_{it},{\overline{\mathit{U}}}_{it})={\mathbf{\Lambda}}_{c1}\text{Cov}\phantom{\rule{0.16667em}{0ex}}({\overline{\mathit{Z}}}_{it},{\overline{\mathit{U}}}_{it})+{\mathbf{\Lambda}}_{c2}\text{Var}\phantom{\rule{0.16667em}{0ex}}({\overline{\mathit{U}}}_{it}).\end{array}$$

(10)

Thus, from (10)

$$({\mathbf{\Lambda}}_{c1},{\mathbf{\Lambda}}_{c2})={\mathit{B}}_{1}{\mathit{B}}_{2}^{-1},$$

(11)

where *B*_{1} and *B*_{2} are *k*_{1} × *k* and *k* × *k* matrices given by

$$\begin{array}{l}{\mathit{B}}_{1}=[\text{Cov}\phantom{\rule{0.16667em}{0ex}}({\overline{\mathbf{X}}}_{it},{\overline{\mathit{Z}}}_{it}),\text{Cov}\phantom{\rule{0.16667em}{0ex}}({\overline{\mathbf{X}}}_{it},{\overline{\mathit{U}}}_{it})]\\ {\mathit{B}}_{2}=\left[\begin{array}{cc}\text{Var}\phantom{\rule{0.16667em}{0ex}}({\overline{\mathit{Z}}}_{it})& \text{Cov}\phantom{\rule{0.16667em}{0ex}}({\overline{\mathit{Z}}}_{it},{\overline{\mathit{U}}}_{it})\\ \text{Cov}\phantom{\rule{0.16667em}{0ex}}({\overline{\mathit{U}}}_{it},{\overline{\mathit{Z}}}_{it})& \text{Var}\phantom{\rule{0.16667em}{0ex}}({\overline{\mathit{U}}}_{it})\end{array}\right].\end{array}$$

We can estimate *B*_{2} from the main study where
$\widehat{\text{Var}}({\overline{\mathit{Z}}}_{it})={\scriptstyle \frac{1}{{t}^{2}}}{\sum}_{{t}_{1},{t}_{2}=1}^{t}\widehat{\text{Cov}}({\mathit{Z}}_{i{t}_{1}},{\mathit{Z}}_{i{t}_{2}}),\widehat{\text{Var}}({\overline{\mathit{U}}}_{it})={\scriptstyle \frac{1}{{t}^{2}}}{\sum}_{{t}_{1},{t}_{2}=1}^{t}\widehat{\text{Cov}}({\mathit{U}}_{i{t}_{1}},{\mathit{U}}_{i{t}_{2}}),\widehat{\text{Cov}}({\overline{\mathit{U}}}_{it},{\overline{\mathit{Z}}}_{it})={\scriptstyle \frac{1}{{t}^{2}}}{\sum}_{{t}_{1},{t}_{2}=1}^{t}\widehat{\text{Cov}}({\mathit{U}}_{i{t}_{1}},{\mathit{Z}}_{i{t}_{2}})$.

To estimate *B*_{1}, we assume a stationary covariance structure, whereby the covariance between the true exposures *X**t*_{1} at time point *t*_{1} and the surrogate exposures *Z*_{t2} at time point *t*_{2} (or the covariates measured without error *U*_{t2} at time point *t*_{2}) do not depend on the starting time point, but on the duration |*t*_{2} − *t*_{1}|, i.e., Cov (**X**_{it1}, *Z*_{it2}) = Cov (**X**_{i1}, *Z*_{i,|t1 − t2,|+1}), Cov (**X**_{it1}, *U*_{it2})= Cov (**X**_{i1}, *U*_{i,|t1 − t2|+1}). The assumption is reasonable based on the NHS data (see Web Appendix E). Hence,

$$\begin{array}{l}\widehat{\text{Cov}}({\overline{\mathbf{X}}}_{it},{\overline{\mathit{Z}}}_{it})=\frac{1}{{t}^{2}}\sum _{{t}_{1},{t}_{2}=1}^{t}\widehat{\text{Cov}}({\mathbf{X}}_{i{t}_{1}},{\mathit{Z}}_{i{t}_{2}})=\frac{1}{{t}^{2}}\sum _{{t}_{1},{t}_{2}=1}^{t}\widehat{\text{Cov}}({\mathbf{X}}_{i1},{\mathit{Z}}_{i,{t}_{1}-{t}_{2}+1)}\widehat{\text{Cov}}({\overline{\mathbf{X}}}_{it},{\overline{\mathit{U}}}_{it})=\frac{1}{{t}^{2}}\sum _{{t}_{1},{t}_{2}=1}^{t}\widehat{\text{Cov}}({\mathbf{X}}_{i{t}_{1}},{\mathit{U}}_{i{t}_{2}})=\frac{1}{{t}^{2}}\sum _{{t}_{1},{t}_{2}=1}^{t}\widehat{\text{Cov}}({\mathbf{X}}_{i1},{\mathit{U}}_{i,{t}_{1}-{t}_{2}+1)}\end{array}$$

(12)

Thus, we can estimate *β** _{c}* from (4),

$${\widehat{\mathit{\beta}}}_{c}^{}$$

(13)

Equations 8 and 9 indicate that the calculation of the covariance matrix
$\text{Cov}\phantom{\rule{0.16667em}{0ex}}({\widehat{\mathit{\beta}}}_{c}^{})$ involves the estimation of **Σ_{ce}**, i.e., Cov (

$$\text{Cov}({\overline{X}}_{t,r},{\overline{X}}_{t,s}{\overline{\mathit{W}}}_{t})=\frac{1}{{t}^{2}}\sum _{{t}_{1},{t}_{2}=1}^{t}sd({X}_{{t}_{1},r}{\mathit{W}}_{t})sd({X}_{{t}_{2},s}{\mathit{W}}_{t})\mathit{corr}({X}_{{t}_{1},r},{X}_{{t}_{2},s}{\mathit{W}}_{t}),$$

(14)

where *sd* (*X _{t}*

We assume the availability of longitudinal validation study data where *X** _{t}*,

Note that *sd* (*X _{r}* |

$$\begin{array}{l}\widehat{\text{Cov}}({\overline{X}}_{t,r},{\overline{X}}_{t,s}{\overline{\mathit{W}}}_{t})=\frac{1}{{t}^{2}}\sum _{{t}_{1},{t}_{2}=1}^{t}\widehat{sd}({X}_{{t}_{1}r}{\mathit{Z}}_{{t}_{1}},{\mathit{U}}_{{t}_{1}})\widehat{sd}({X}_{{t}_{2}s}{\mathit{Z}}_{{t}_{2}},{\mathit{U}}_{{t}_{2}})({\widehat{\rho}}_{rs}{\mathit{Z}}_{t},{\mathit{U}}_{t})=\frac{1}{t}\widehat{sd}({X}_{r}\stackrel{~}{\mathit{W}})\widehat{sd}({X}_{s}\stackrel{~}{\mathit{W}})[{\widehat{\rho}}_{rs}^{(1)}+(t-1)({\widehat{\rho}}_{rs}\stackrel{~}{\mathit{W}})],\end{array}$$

(15)

where
${\widehat{\rho}}_{rs}^{(1)}=\mathit{corr}({X}_{t,r},{X}_{t,s}{\mathit{W}}_{t})$ and * _{rs}* =

$${\widehat{\mathbf{\sum}}}_{ce}=\frac{1}{t}\left(\frac{{\widehat{\mathbf{\sum}}}^{(1)}+{\widehat{\mathbf{\sum}}}^{(3)}}{2}\right)+\frac{t-1}{t}\left(\frac{{\widehat{\mathbf{\sum}}}^{(1,3)}+{\widehat{\mathbf{\sum}}}^{(3,1)}}{2}\right),$$

(16)

where **Σ**^{(1)} and **Σ**^{(3)} are the covariance matrices for the *k*_{1} variables measured with error at time point 1 (1980) and 3 (1986), respectively, the (*r*, *s*)-th cell of the matrix **Σ**^{(1,3)} is the covariance between the *r*-th variable measured with error at time point 1 and the *s*-th variable measured with error at time point 3, and the (*r*, *s*)-th cell of the matrix **Σ**^{(3,1)} is the covariance between the *r*-th variable measured with error at time point 3 and the *s*-th variable measured with error at time point 1. **Σ**^{(1)} can be estimated based on the 1980 validation study with size *n*_{1} = 173, **Σ**^{(3)} can be estimated based on the 1986 validation study with size *n*_{3} = 190, and **Σ**^{(1,3)} and **Σ**^{(3,1)} can be estimated based on the longitudinal validation study consisting of *n*_{1,3} = 92 subjects in both 1980 and 1986 validation studies.

Thus, from (9) and (16) we can estimate **Σ A**

We apply the proposed measurement error correction method described in (6), (8), (9), (15), and (16) to Nurses’ Health Study to assess the possible association between the incidence of breast cancer and calorie-adjusted saturated fat intake (time-dependent), controlling for total caloric intake (time-dependent), alcohol intake (time-dependent), and baseline age. We present four different models.

In *Model 1*, we relate exposures in 1998 to breast cancer incidence from 1998 to 2002 (1181 cases) using the measurement error correction method in Rosner et al. (1990). In *Model 2*, we relate cumulative average exposures from 1980 to 1998 to breast cancer incidence from 1998 to 2002 (1351 cases) using Eq. 13 without any pooling and assessing outcome over only one time period (1998 to 2002). In *Model 3*, we relate baseline exposure in 1980 (without updating) to breast cancer incidence from 1980 to 2002 (5672 cases) using the measurement error correction method in Rosner et al. (1990). Finally, in *Model 4*, we relate cumulative average exposure from 1980 to 1998 to breast cancer incidence from 1980 to 2002 (5672 cases) using the pooled estimate in Eq. 2.

The point and interval estimates of hazard ratios (both un-corrected and corrected estimates) for *Model 1*–*Model 4* are shown in Table 4.

Since 1 g and 1 kcal are small increments for nutrient intake, we followed Kim et al. (2006) and have expressed the hazard ratios in increments of 5% energy for calorie-adjusted saturated fat intake, 800 kcal for total caloric intake, and 25 g for alcohol intake, which approximately correspond to increments between the 10th and 90th percentiles of the distribution of FFQ intake.

For all four models, we observe that (1) the corrected point estimates of the regression coefficients are further away from 1 than the uncorrected estimates (In other words, the point estimates are attenuated if we do not correct for measurement error), (2) the corrected 95% CIs are wider than the uncorrected 95% CIs, (3) total caloric intake has no significant effect on the development of breast cancer, and (4) alcohol intake and age in 1980 have a positive association with the incidence of breast cancer. These four observations are consistent with previous results (e.g. Rosner et al. 1989, 1990; Spiegelman et al. 1997). The observation that calorie-adjusted saturated fat intake has a slightly protective, but not significant, effect on the development of breast cancer for *Model 1* and *Model 3* is also consistent with previous results. However, for both the corrected and uncorrected estimates in *Model 2* and *Model 4* (the cumulative average model), calorie-adjusted saturated fat intake has a significant protective effect on the incidence of breast cancer.

Confidence interval widths for saturated fat intake tend to be smaller based on cumulative average intake vs. baseline intake. However, this is not true for alcohol or total caloric intake. Finally, as expected, confidence interval widths are narrower for *Models 3* and *4* based on 5672 events over 22 years than *Models 1* and *2* based on 1181 and 1351 events over 4 years. The number of events and person-years are larger for *Model 2* than *Model 1* because some subjects (*n* = 20507, 170 events) did not return the 1998 questionnaire, but did return at least one previous questionnaire.

The proposed regression calibration model for cumulative average intake is based on a continuous representation of nutrient intake. However, nutrient intake is often represented in terms of quantiles to avoid making assumptions of a linear relationship between outcome and exposure. We have previously considered measurement error models where nutrient intake is represented as an ordinal variable (Rosner 1996). More work is needed to extend this model to the cumulative average setting such as in Eqs. 1, 4, and 5. Similar issues apply to extending misclassification models for nominal categorical exposure variables (e.g., Greenland 1980) to the cumulative average setting.

A possible extension for the cumulative average model is to use a more general weighted average
${\sum}_{j=1}^{K}{w}_{j}{\mathit{X}}_{{t}_{j}}$ of a vector of covariates ** X** measured over time with more weight given to reported intake on the most recent surveys. A possible weight

Another assumption is that the diet record provides an unbiased estimate of true intake with no correlated error between DR and FFQ measurements. To account for correlated error in a cumulative average setting one would also need simultaneous biomarker measurements as part of the validation study data obtained on at least two points in time (Rosner et al. 2008).

It would be conceptually straightforward to work out the likelihood method. However, the likelihood method involves multiple integrations that integrate out the unobserved true exposures and hence is computationally intensive (Crouch and Spiegelman 1990). Hence, it would be prohibitively expensive to use likelihood method in a data set of the size of the Nurses’ Health Study. The simulation study in Sect. 4 indicates that the regression calibration method produces approximately unbiased estimates and approximately correct coverage under conditions similar to those seen in Nurses’ Health Study data.

In conclusion, we have presented a method for correcting for measurement error with nutrient intake measured on a continuous scale over multiple surveys. To use this methodology, one needs at least two validation studies that are concurrent with the main study at two different points in time with some subjects participating in both validation studies. It of course would be desirable to have validation study data at additional time points to further validate some of the assumptions mentioned above.

We acknowledge the support of NHS program project CA87969 and also CA50597 in performing this work. We thank the editors and referees for their invaluable comments and suggestions. We acknowledge programming support of Rong Chen.

Electronic supplementary material The online version of this article (doi:10.1007/s10985-009-9124-6) contains supplementary material, which is available to authorized users.

- Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM. Measurement error in nonlinear models: a modern perspective. 2. Chapman & Hall; Boca Raton, FL: 2006.
- Crouch EA, Spiegelman D. The evaluation of integrals of the form ${\int}_{-\infty}^{\infty}f(t)\mathit{exp}(-{t}^{2})dt$: application to logistic-normal models. J Am Stat Assoc. 1990;85:464–469.
- Cupples LA, D’Agostino RB, Anderson K, Kannel WB. Comparison of baseline and repeated measure covariate techniques in the Framingham Heart Study. Stat Med. 1988;7:205–218. [PubMed]
- D’Agostino RB, Lee M-LT, Belanger AJ, Cupples LA, Anderson K, Kannel W. Relation of pooled logistic regression to time dependent Cox regression analysis: the Framingham Heart Study. Stat Med. 1990;9:1501–1515. [PubMed]
- Greenland S. The effect of misclassification in the presence of covariates. Am J Epidemiol. 1980;112:564–569. [PubMed]
- Hu FB, Stampfer MJ, Manson JE, Rimm E, Colditz GA, Rosner BA, Hennekens CH, Willett WC. Dietary fat intake and the risk of coronary heart disease in women. N Engl J Med. 1997;337:1491–1499. [PubMed]
- Kahn HA, Dawber TR. The development of coronary heart disease in relation to sequential biennial measures of cholesterol in the Framingham Study. J Chron Dis. 1966;19:611–620. [PubMed]
- Kim EHJ, Willett WC, Colditz GA, Hankinson SE, Stampfer MJ, Hunter DJ, Rosner B, Holmes MD. Dietary fat and risk of postmenopausal breast cancer in a 20-year follow-up. Am J Epidemiol. 2006;164:990–997. [PubMed]
- Li Y, Lin X. Functional inference in frailty measurement error models for clustered survival data using the simex approach. J Am Stat Assoc. 2003;98:191–203.
- Prentice RL. Covariate measurement errors and parameter estimation in a failure time regression model. Biometrika. 1982;69:331–342.
- Rosner BA. Measurement error models for ordinal exposure variables measured with error. Stat Med. 1996;15:293–303. [PubMed]
- Rosner B, Willett WC, Spiegelman D. Correction of logistic regression relative risk estimates and confidence intervals for systematic within-person measurement error. Stat Med. 1989;8:1051–1069. [PubMed]
- Rosner B, Spiegelman D, Willett WC. Correction of logistic regression relative risk estimates and confidence intervals for measurement error: the case of multiple covariates measured with error. Am J Epidemiol. 1990;132:734–745. [PubMed]
- Rosner B, Michels KB, Chen Y-H, Day NE. Measurement error correction for nutritional exposures with correlated measurement error: use of the method of triads in a longitudinal setting. Stat Med. 2008;27:3466–3489. [PMC free article] [PubMed]
- Spiegelman D, McDermott A, Rosner B. Regression calibration method for correcting measurement-error bias in nutritional epidemiology. Am J Clin Nutr. 1997;65(suppl):1179S–1186S. [PubMed]
- Willett W. Nutritional epidemiology. 2. Oxford University Press; New York: 1998.
- Willett WC, Sampson L, Stampfer MJ, Rosner B, Bain C, Witschi J, Hennekens CH, Speizer FE. Reproducibility and validity of a semiquantitative food frequency questionnaire. Am J Epidemiol. 1985;122:51–65. [PubMed]
- Willett WC, Sampson L, Browne ML, Stampfer MJ, Rosner B, Hennekens CH, Speizer FE. The use of a self-administered questionnaire to assess diet four years in the past. Am J Epidemiol. 1988;127:188–199. [PubMed]
- Wu M, Ware JH. On the use of repeated measurements in regression analysis with dichotomous responses. Biometrics. 1979;35:513–521. [PubMed]
- Yi GY. A simulation-based marginal method for longitudinal data with dropout and mismeasured covariates. Biostatistics. 2008;9:501–512. [PMC free article] [PubMed]
- Yi GY, He W. Methods for bivariate survival data with mismeasured covariates under an accelerated failure time model. Commun Stat A Theory Methods. 2006;35:1539–1554.
- Yi GY, Lawless JF. A corrected likelihood method for the proportional hazards model with covariates subject to measurement error. J Stat Plan Infer. 2007;137:1816–1828.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |