PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
Lifetime Data Anal. Author manuscript; available in PMC 2011 January 1.
Published in final edited form as:
PMCID: PMC2809827
NIHMSID: NIHMS145713

Measurement error correction for the cumulative average model in the survival analysis of nutritional data: application to Nurses’ Health Study

Abstract

The use of the cumulative average model to investigate the association between disease incidence and repeated measurements of exposures in medical follow-up studies can be dated back to the 1960s (Kahn and Dawber, J Chron Dis 19:611–620, 1966). This model takes advantage of all prior data and thus should provide a statistically more powerful test of disease-exposure associations. Measurement error in covariates is common for medical follow-up studies. Many methods have been proposed to correct for measurement error. To the best of our knowledge, no methods have been proposed yet to correct for measurement error in the cumulative average model. In this article, we propose a regression calibration approach to correct relative risk estimates for measurement error. The approach is illustrated with data from the Nurses’ Health Study relating incident breast cancer between 1980 and 2002 to time-dependent measures of calorie-adjusted saturated fat intake, controlling for total caloric intake, alcohol intake, and baseline age.

Keywords: Measurement error, Regression calibration, Nutritional data

1 Introduction

Kahn and Dawber (1966) used a cumulative average model to make use of all known cholesterol values prior to the occurrence of an event to investigate the development of coronary heart disease over a short time period in relation to sequential biennial measures of cholesterol in the Framingham Heart Study. The cumulative average model has been also studied by several researchers (e.g., Wu and Ware 1979; Cupples et al. 1988; D’Agostino et al. 1990) using data from the Framingham Heart Study. Several authors (e.g., Hu et al. 1997; Kim et al. 2006) applied the cumulative average model to the Nurses’ Health Study.

The cumulative average model takes advantage of all prior data and thus should provide a statistically more powerful test of an association of cumulative exposure (Willett 1998, p. 333). However, measurement error in covariates is not unusual for medical follow-up studies. For example, nutritional intake measured with instruments such as a food frequency questionnaire (FFQ) or a 24-h recall often have substantial systematic measurement error, in addition to random measurement error. Unlike random measurement error, systematic measurement error cannot be reduced by obtaining replicate measures of diet at different points in time and averaging the responses. Ignoring systematic measurement error could cause biased inference. Hence correction for systematic measurement error is needed for the cumulative average model.

Many methods have been proposed to correct for systematic measurement error (see Carroll et al. 2006, for an overview). However, most existing methods (e.g., Rosner et al. 1989, 1990; Spiegelman et al. 1997) considered measurement error correction based on covariates measured at only one time point. Recently, several authors, such as, Rosner et al. (2008) and Yi (2008) proposed methods for longitudinal data. To the best of our knowledge, no measurement correction methods have been proposed for the cumulative average model. In this article, we propose a regression calibration approach based on the method described in Rosner et al. (1990) to correct measurement error for the cumulative average relative risk estimates. The approach is illustrated with data from the Nurses’ Health Study relating incident breast cancer between 1980 and 2002 to calorie-adjusted saturated fat intake, controlling for total caloric intake, alcohol intake, and baseline age.

The remainder of the article is structured as follows. In Sect. 2, we briefly describe the Nurses’ Health Study data sets to motivate the cumulative average model. In Sect. 3, we then introduce the cumulative average model. In Sect. 4, we propose the measurement error correction method for the cumulative average model for the case where both main study data set and validation study data set are available at each time point. We conduct a simulation study to evaluate the performance of the measurement error correction in this setting. In Sect. 5, we describe how to correct measurement error based on the cumulative average model when validation study data are not available at some time points. The analysis results for the Nurses’ Health Study (NHS) data based on the proposed measurement error correction method are presented in Sect. 6. A discussion is given in Sect. 7. Technique details are given in Web Appendices.

2 Nurses’ Health Study data sets

The NHS is a continuing epidemiologic cohort study established at the Channing Lab- oratory, Brigham and Women’s Hospital in 1976. Initially, the NHS investigated the potential long term consequences of the use of oral contraceptives. Around 120,000 married registered nurses (aged 30–54 in 1976) were selected to be followed prospectively. In 1980, diet information were collected via food frequency questionnaire (FFQ), which measures how often an individual eats particular types of food in the previous year. Subsequent diet information were collected in 1984, 1986, and every four years since.

It is well-known that nutrient intakes measured by FFQ have substantial measurement error. To correct for measurement error, two valiation/calibration studies were conducted in 1980 and 1986 for two subsets of participants (Willett et al. 1985, 1988). During 1980, 173 participants were asked to fill out 4 weeks of diet record (DR) approximately 3months apart over 1 year. At the end of the year, a second FFQ was administered which could be directly compared with the average of 28 days of DR. The 1986 validation study consisted of 190 participants, 92 of whom were also in the 1980 validation study.

We are interested in assessing the possible association between the incidence of breast cancer and calorie-adjusted saturated fat intake, controlling for total caloric intake, alcohol intake, and baseline age.

Calorie-adjusted saturated fat intake is defined as percent of energy due to saturated fat (=100 × 9× saturated fat intake (g)/total caloric intake (kcal)). Age in 1980 is categorized in five categories: 34–39, 40–44, 45–49, 50–54, and 55+. Calorie-adjusted saturated fat, total caloric intake, and alcohol intake are measured with error while age is measured without error. FFQ questionnaires were collected from the main NHS studies in 1980, 1984, 1986, 1990, 1994, and 1998.

Some descriptive statistics concerning the main and validation study populations are given in Table 1.

Table 1
Descriptive statistics for the main and validation study population

For the main studies, the average % calories due to saturated fat decreased over time. The total caloric intake increased from 1980 to 1984, and then remained about the same from 1984 to 1998. This is probably because the 1980 questionnaire had 61 food items while questionnaires from 1984 on had over 120 food items. Mean alcohol intake decreased from 1980 to 1998, except for a slight increase from 1980 to 1984.

Pearson correlations between FFQ intake at repeated surveys over 18 years are presented in Table 2.

Table 2
Correlation between FFQ nutrient intakes over time in main studies

As expected, the correlations between intake of the same nutrient over time decreases as the difference between time points increases, although the rate of decrease is small. There is a low correlation (0.05–0.16) between calorie-adjusted saturated fat intake and total caloric intake when measured at the same time. The cross correlation between calorie-adjusted saturated fat intake at time t1 and total caloric intake at time t2 ranges from −0.01 to 0.08. The correlation between calorie-adjusted saturated fat intake and alcohol intake is slightly negative and the correlation between total caloric intake and alcohol intake is slightly positive. The correlation between the same nutrient measured in 1980 and that measured in 1984 is smaller than the correlation between the same nutrient measured over any other two consecutive surveys. This is partly because the FFQ included around 60 more food items starting from 1984, resulting in more stable estimates of nutrient intake over time.

We used four different approaches to assess the possible association between the incidence of breast cancer and calorie-adjusted saturated fat intake, controlling for total caloric intake, alcohol intake, and baseline age. All four approaches make use of the Cox proportional hazards regression model.

With the first approach, we related exposures in 1998 to breast cancer incidence from 1998 to 2002 (1181 cases). With the second approach, we related the average exposure from 1980 to 1998 to breast cancer incidence from 1998 to 2002 (1351 cases). With the third approach, we related exposure in 1980 to breast cancer incidence from 1980 to 2002 (5672 cases). With the fourth approach, we first related average exposure during each of 6 time periods (1980, 1980–1984–1980–1986–1980–1990–1980–1994–1980–1998) to breast cancer incidence over the corresponding succeeding time period (1980–1984–1984–1986–1986–1990–1990–1994–1994–1998–1998–2002) with a total of 5672 cases over 1980–2002, and then pooled the six estimates.

Compared to the first three approaches, the fourth approach make full use of all available information and hence can dampen the effects of random errors more effectively. The fourth approach was mentioned in the literature as early as 1960s (e.g., Kahn and Dawber 1966) and was referred to as the cumulative average model. In this article, we consider the measurement error correction problem for the cumulative average model. We briefly describe this model in the next section.

3 Cumulative average model

The idea of the cumulative average model described in Kahn and Dawber (1966) is to first treat each time interval as a mini follow-up study, then to pool observations over all intervals to examine the short-term development of disease (D’Agostino et al. 1990). Wu and Ware (1979) implemented the cumulative average model using pooled logistic regression. In addition, Cupples et al. (1988), and Kim et al. (2006) implemented the cumulative average model using Cox proportional hazards regression. In this article, we will also use the Cox proportional hazards regression model.

Let Xit be k1 × 1 vector corresponding to k1 exposure variables (e.g. calorie-adjusted saturated fat intake, total caloric intake, and alcohol intake measured via DR) which represent the true levels of exposures for the i-th subject at time t. Let Uit be a k2 × 1 vector of variables measured without error (e.g. baseline age) for the i-th subject at time t. The period between exams t and t +1 will be called the t-th observation period. For the t-th observation period, a Cox proportional hazards regression model can be fitted to relate disease hazard to cumulative average exposures measured up to time t:

hc(t[mid ]X¯it,U¯it)=hc0(t)exp[βc1*X¯it+βc2*U¯it],i=1,,N,t=1,,T,
(1)

where hc (t|Xit, Ūit) is the hazard function during observation period t, hc0(t) is the baseline hazard function at time t, X¯it=[ell]=1tXi[ell]/nt,U¯it=[ell]=1tUi[ell]/nt, nt is the number of subjects free of disease up to time t, and T is the end point of the study (e.g. T = 6 for NHS). Most standard statistical software can provide parameter estimates for the Cox proportional hazards regression model.

Denote β^c*(t)=(β^c1*(t),β^c2*(t)) as the parameter estimates for Model (1). A pooled estimate of regression coefficients up to time T for the cumulative average model is

β^c,[ell],T*=t=1Tw[ell]*(t)β^c,[ell]*(t)=(w[ell],T*)γ^[ell],T*,
(2)

where β^c,[ell]*(t) is the [ell]-th element of the 1 × (k1 + k2) vector β^c*(t), [ell] = 1, …, k [equivalent] k1 + k2, and the T × 1 vector w[ell],T*=(w[ell]*(1),w[ell]*(2),,w[ell]*(T))=(Ω[ell],T*)11T/[1T(Ω[ell],T*)11T], where 1T is the T × 1 vector of ones, Ω[ell],T* is the conditional variance-covariance matrix of the T × 1 vector γ^[ell],T*=(β^c,[ell]*(1),,β^c,[ell]*(T)) given W*, W* is the matrix of exposures X(t) and Ū(t) up to time t, t = 1, …, T, and the symbol ′ represents the matrix/vector transpose operation.

The Lagrange multiplier method can be used to derive the weight vector w[ell],T* which minimizes the variance of the overall estimator t=1Tv*(t)β^[ell]*(t), subject to the conditions t=1Tv*(t)=1, and υ*(t) > 0, t = 1, …, T (see Web Appendix A). The variance of β^c,[ell],T* given W* is

Var(β^c,[ell],T*[mid ]W*)=1/[1T(Ω[ell],T*)11T].
(3)

4 Measurement error correction method

It is usually impossible or expensive to directly measure Xit on a large number of subjects. Instead, we usually observe surrogate exposure variables which are subject to measurement error. Let Zit be a k1 × 1 vector corresponding to the k1 surrogate exposure variables (e.g. calorie-adjusted saturated fat intake, total caloric intake, and alcohol intake measured via FFQ) for the i-th subject at time t. The Cox proportional hazards regression for the surrogate variables is

hc(t[mid ]Z¯it,U¯it)=hc0(t)exp[βc1Z¯it+βc2U¯it],i=1,,N,
(4)

where Z¯it=m=1tZim/nt, Denote β^c(t)=(β^c1(t),β^c2(t)) as the parameter estimates for Model (4).

The parameters βc(t)=(βc1(t),βc2(t)) for the surrogate regression model (4) might be attenuated relative to the true regression parameters βc*(t)=(βc1*(t),βc2*(t)) due to measurement error (see the simulation study below). Many methods (e.g. Prentice 1982; Spiegelman et al. 1997; Li and Lin 2003; Yi and He 2006; Carroll et al. 2006; Yi and Lawless 2007) have been proposed to correct relative risk estimates for measurement error in a survival data analysis setting. However, to the best of our knowledge, no methods have been proposed for measurement error correction of cumulative average exposures.

We apply the regression calibration method described in Rosner et al. (1990) to handle the cumulative average exposure case. The regression calibration method described in Rosner et al. (1990) first models the relation between true exposure and surrogate exposure, usually based on a validation study. It then replaces the true exposures by the conditional expectation of the true exposures conditioned on the surrogate exposures in the model relating the true exposures to disease. Next the regression coefficients of the revised model are compared with those of the model relating the surrogate exposures to disease. The comparison finally leads to the corrected regression coefficients. The regression calibration method assumes that a nondifferential error mechanism holds (i.e. Pr (Z|X, D) = Pr (Z|X), where D = 1 means diseased, D = 0 means not-diseased). In our example, it is reasonable to assume a nondifferential error mechanism since both Z values and X values are observed before any breast cancer occurs.

We assume a multivariate regression relating Xit, Zit, and Ūit in the validation data:

X¯it=αc+Λc1Z¯it+Λc2U¯it+ec,it,i=1,,nt,
(5)

where Λc1 and Λc2 are k1 × k1 and k1 × k2 matrices, ec,it is multivariate normally distributed with a k1 × 1 mean vector 0 and a k1 × k1 covariance matrix Σce.

By substituting expected values E [Xit |Zit, Ūit] obtained from Model (5) for Xit in Model (1) and then comparing it with Model (4), we can show that to an excellent degree of approximation (see the simulation study below):

βc*=βcΛc1,
(6)

where βc*=(βc1*,βc2*) and βc = (βc1, βc2) are 1 × k vectors, Λc is a k × k matrix given by

Λc=(Λc1Λc20k2×k1Ik2×k2)
(7)

and k = k1 + k2.

Using the results from Web Appendix B, we have

Cov^(β^c,j1*,β^c,j2*)=(Ac^β^cAc)j1,j2+β^c^Ac,j1,j2β^c,
(8)

where Ac=Λc1,

^Ac,j1,j2=w·[Ac1^ceAc1]
(9)

is the k × k variance-covariance matrix relating the elements in the j1-th and j2-th columns of Ac (denoted by acj1 and acj2, respectively), w=[acj1D(W¯tW¯t)1Dacj2] is a scalar, the k× k1 matrix Ac1 is the first k1 columns of Ac, Σce = (Cov(Xt,r, Xt,s)|wt)r,s=1,…,k1 = Cov(ect,r, ect,s)r,s=1,…,k1, D = (0k × 1, Ik × k), Xt,r is the average true value of the r-th variable measured with error, and

W¯t=(1Z¯1tU¯1t[vertical ellipsis][vertical ellipsis][vertical ellipsis]1Z¯nttU¯ntt),

where wt is obtained from the validation study of size nt at time t. Using the results from Web Appendix C, we can obtain the variance β^c,[ell],T* in Eq. 3 of the pooled estimate of regression coefficients in Eq. 2 up to time T.

To evaluate the validity of the approximation in Eq. 6, we conducted a simulation study based on the 1980 and 1986 surveys. In the simulation study, we consider three exposures: calorie-adjusted saturated fat intake (denoted as Z1 measured by FFQ, X1 measured by DR), total caloric intake (denoted as Z2 measured by FFQ, X2 measured by DR), and baseline age (regarded as a continuous variable and denoted by U). We generated 1,000 simulated data sets. Each data set consists of a main study and a corresponding validation study at 2 time points. Each main study contains 10,000 subjects. Each validation study contains 200 subjects.

To take into account of (a) the correlations among different exposures at the same time point, (b) the correlations of the same exposure measured at different times, and (c) the cross correlations among different exposures measured at different times, we assume that the random vector W = (Z1,t1, Z2,t1, Z1,t2, Z2,t2, U)′ is normally distributed with mean vector μc and covariance matrix Σc. We regard the moment estimates of μc and Σc obtained based on the 1980 and 1986 main studies as the true values of μc and Σc in the simulations study.

For each data set, we first generated random multivariate normal vectors W1, …, W10,000 in the main study. We then computed the average nutrient intakes Z1,k = (Z1,t1,k + Z1,t2,k)/2 and Z2,k = (Z2,t1k + Z2,t2,k)/2, k = 1, …, 10,000. We next generated random error vectors ec,k, k = 1, …, 10,000, from the multivariate normal distribution N (0, Σce), where Σce is the sample covariance matrix of the residuals of the multivariate linear model (5) based on the 1980 and 1986 validation studies. We next computed Xk = (X1,k, X2,k)′ based in Eq. 5, where the parameters αc, Λ1, and Λ2 are the estimated regression coefficients based on the 1980 and 1986 validation studies. In the same way, we generated data for the validation study (k = 1, …, 200) using the same μc and Σce as in the main study.

We next generated survival times based on the Weibull distribution with hazard function h(t|δ, η(X, U)) = δtδ − 1 ηδ, where δ is the shape parameter and η (X, U) is the scale parameter. We assume that the logarithm of the scale parameter η is a linear function of X and U: log(η)=α0+θ1X¯+θ2U. Then, the Weibull distribution has proportional hazards. The log hazard ratios for X and U are βc1*=δθ1 and βc2*=δθ2, respectively. In the simulation study, the true values of the parameters α0 and δ are the estimates obtained by fitting the Weibull regression to the FFQ data (Z, U) based on 1980 and 1986 main studies. We considered several values for exp(βc1,1*)=0.7,1.0,1.5 corresponding to a protective, null and deleterious effect of saturated fat. We set exp(βc1,2*)=1 since there usually is no association between total calories and breast cancer, and set exp(βc2,1*)=1.1 allowing for a 10% increase in breast cancer incidence per year, where βc1,1* = % calories due to saturated fat, βc1,2* = total caloric intake, and βc2,1* = age at baseline.

Once we generated the survival times, we choose the 10% quantile of the survival times as the censoring time so that the specified survival rate 90% is achieved.

Finally, we apply the measurement error correction method we proposed for the cumulative average model to each simulated data set.

The percent biases and coverage probabilities of un-corrected and corrected estimates are shown in Table 3. The full simulation results are given in Web Appendix D.

Table 3
Percent bias and coverage based on 1,000 simulated data sets

The simulation results show that (i) corrected estimates of HRs have small percent bias and have appropriate coverage both for H R = 1.0 and H R ≠ 1.0; (ii) If the true HR is 1.0, the uncorrected estimates have small percent bias and have appropriate coverage; (iii) If the true HR is not equal to 1.0, the un-corrected estimates for the exposure (Z) subject to measurement error have a large percent bias towards the null and low coverage. (iv) If the true HR is not equal to 1.0, the un-corrected estimates for the exposure (U) without measurement error have small percent bias, but low coverage; (v) For the exposure (U) without measurement error, corrected estimates have smaller percent bias than un-corrected estimates. In summary, the corrected estimates in Eq. 6 have small bias and appropriate coverage, while the un-corrected estimates tend to be attenuated towards 1.0, resulting in large bias and low coverage.

5 Analysis of NHS data

We could not apply the proposed method directly to NHS data sets because validation studies are not available at some time points. Hence, we cannot estimate Λc directly.

We propose an indirect method to overcome this difficulty. From (5), we have

Cov(X¯it,Z¯it)=Λc1Var(Z¯it)+Λc2Cov(U¯it,Z¯it)Cov(X¯it,U¯it)=Λc1Cov(Z¯it,U¯it)+Λc2Var(U¯it).
(10)

Thus, from (10)

(Λc1,Λc2)=B1B21,
(11)

where B1 and B2 are k1 × k and k × k matrices given by

B1=[Cov(X¯it,Z¯it),Cov(X¯it,U¯it)]B2=[Var(Z¯it)Cov(Z¯it,U¯it)Cov(U¯it,Z¯it)Var(U¯it)].

We can estimate B2 from the main study where Var^(Z¯it)=1t2t1,t2=1tCov^(Zit1,Zit2),Var^(U¯it)=1t2t1,t2=1tCov^(Uit1,Uit2),Cov^(U¯it,Z¯it)=1t2t1,t2=1tCov^(Uit1,Zit2).

To estimate B1, we assume a stationary covariance structure, whereby the covariance between the true exposures Xt1 at time point t1 and the surrogate exposures Zt2 at time point t2 (or the covariates measured without error Ut2 at time point t2) do not depend on the starting time point, but on the duration |t2t1|, i.e., Cov (Xit1, Zit2) = Cov (Xi1, Zi,|t1t2,|+1), Cov (Xit1, Uit2)= Cov (Xi1, Ui,|t1t2|+1). The assumption is reasonable based on the NHS data (see Web Appendix E). Hence,

Cov^(X¯it,Z¯it)=1t2t1,t2=1tCov^(Xit1,Zit2)=1t2t1,t2=1tCov^(Xi1,Zi,[mid ]t1t2[mid ]+1),Cov^(X¯it,U¯it)=1t2t1,t2=1tCov^(Xit1,Uit2)=1t2t1,t2=1tCov^(Xi1,Ui,[mid ]t1t2[mid ]+1).
(12)

Thus, we can estimate βc from (4), Λc from (10) to (12), and βc* from (6) whereby

β^c*=β^cΛ^c1.
(13)

Equations 8 and 9 indicate that the calculation of the covariance matrix Cov(β^c*) involves the estimation of Σce, i.e., Cov (Xt,r, Xt,s|wt). We have

Cov(X¯t,r,X¯t,s[mid ]W¯t)=1t2t1,t2=1tsd(Xt1,r[mid ]Wt)sd(Xt2,s[mid ]Wt)corr(Xt1,r,Xt2,s[mid ]Wt),
(14)

where sd (Xt,r |Wt) is the standard deviation of the r-th true exposure at time point t given the surrogate exposures Zt and variables measured without error Ut at time point t (c.f. Model (5)).

We assume the availability of longitudinal validation study data where Xt, Zt and Ut are available for at least a subset of subjects at 2 points in time. We also assume stationarity, whereby sd(Xt,r |Wt) is the same for all t and is denoted by sd (Xr |W). The estimates of the conditional standard deviations for calorie-adjusted saturated fat intake, total caloric intake, and alcohol intake are 0.33, 0.31, and 0.18, respectively, based on the 1980 validation study, and are 0.32, 0.41, 0.22, respectively, based on the 1986 validation study. Hence the stationarity assumption is reasonable. Finally, we also assume a compound symmetric correlation structure whereby corr (Xt1,r, Xt2,s|Wt) is the same for all t1t2 and is denoted by (ρrs|W). Since we have only two validation study data sets, we could not check this assumption. However, the correlations and cross-correlations of surrogate intake (Zit ) in Table 2 decreases only slightly over time. This gives some indication that the compound symmetric correlation assumption is reasonable for the NHS data sets.

Note that sd (Xr |W) can be estimated from the cross-sectional validation study database of size nt based on the residual mean square from the regression of Xt on Zt, Ut. Also (ρrs|W) can be estimated from the longitudinal validation study database of nt1t2 subjects seen at both times t1 and t2 obtained by correlating the residual (Xt1,r|Zt1, Ut1) and the residual (Xt2,s|Zt2, Ut2). Thus, we estimate Cov (Xt,r, Xt,s|wit) from

Cov^(X¯t,r,X¯t,s[mid ]W¯t)=1t2t1,t2=1tsd^(Xt1r[mid ]Zt1,Ut1)sd^(Xt2s[mid ]Zt2,Ut2)(ρ^rs[mid ]Zt,Ut)=1tsd^(Xr[mid ]W~)sd^(Xs[mid ]W~)[ρ^rs(1)+(t1)(ρ^rs[mid ]W~)],
(15)

where ρ^rs(1)=corr(Xt,r,Xt,s[mid ]Wt) and [rho with circumflex]rs = corr (Xt1,r, Xt2,s|Wt1, Wt2), t1t2. Hence

^ce=1t(^(1)+^(3)2)+t1t(^(1,3)+^(3,1)2),
(16)

where Σ(1) and Σ(3) are the covariance matrices for the k1 variables measured with error at time point 1 (1980) and 3 (1986), respectively, the (r, s)-th cell of the matrix Σ(1,3) is the covariance between the r-th variable measured with error at time point 1 and the s-th variable measured with error at time point 3, and the (r, s)-th cell of the matrix Σ(3,1) is the covariance between the r-th variable measured with error at time point 3 and the s-th variable measured with error at time point 1. Σ(1) can be estimated based on the 1980 validation study with size n1 = 173, Σ(3) can be estimated based on the 1986 validation study with size n3 = 190, and Σ(1,3) and Σ(3,1) can be estimated based on the longitudinal validation study consisting of n1,3 = 92 subjects in both 1980 and 1986 validation studies.

Thus, from (9) and (16) we can estimate ΣAc, j1, j2 based on the cross-sectional validation study databases of sizes n1 and n3 and the longitudinal validation study database of size n1,3, respectively. We can also estimate Σ[beta]c from the main study database of size N. Therefore, using (8), we can estimate the variance-covariance matrix of β^c* given by Cov(β^c,j1*,β^c,j2*), j1, j2 = 1, …, k, and can estimate 95% confidence limits for exp(βc,[ell]*) based on the Wald method given by exp[β^c,[ell]*±1.96Var(β^c,[ell]*)].

6 Results

We apply the proposed measurement error correction method described in (6), (8), (9), (15), and (16) to Nurses’ Health Study to assess the possible association between the incidence of breast cancer and calorie-adjusted saturated fat intake (time-dependent), controlling for total caloric intake (time-dependent), alcohol intake (time-dependent), and baseline age. We present four different models.

In Model 1, we relate exposures in 1998 to breast cancer incidence from 1998 to 2002 (1181 cases) using the measurement error correction method in Rosner et al. (1990). In Model 2, we relate cumulative average exposures from 1980 to 1998 to breast cancer incidence from 1998 to 2002 (1351 cases) using Eq. 13 without any pooling and assessing outcome over only one time period (1998 to 2002). In Model 3, we relate baseline exposure in 1980 (without updating) to breast cancer incidence from 1980 to 2002 (5672 cases) using the measurement error correction method in Rosner et al. (1990). Finally, in Model 4, we relate cumulative average exposure from 1980 to 1998 to breast cancer incidence from 1980 to 2002 (5672 cases) using the pooled estimate in Eq. 2.

The point and interval estimates of hazard ratios (both un-corrected and corrected estimates) for Model 1Model 4 are shown in Table 4.

Table 4
Comparison of the measurement error correction models based on Nurses’ Health Studya

Since 1 g and 1 kcal are small increments for nutrient intake, we followed Kim et al. (2006) and have expressed the hazard ratios in increments of 5% energy for calorie-adjusted saturated fat intake, 800 kcal for total caloric intake, and 25 g for alcohol intake, which approximately correspond to increments between the 10th and 90th percentiles of the distribution of FFQ intake.

For all four models, we observe that (1) the corrected point estimates of the regression coefficients are further away from 1 than the uncorrected estimates (In other words, the point estimates are attenuated if we do not correct for measurement error), (2) the corrected 95% CIs are wider than the uncorrected 95% CIs, (3) total caloric intake has no significant effect on the development of breast cancer, and (4) alcohol intake and age in 1980 have a positive association with the incidence of breast cancer. These four observations are consistent with previous results (e.g. Rosner et al. 1989, 1990; Spiegelman et al. 1997). The observation that calorie-adjusted saturated fat intake has a slightly protective, but not significant, effect on the development of breast cancer for Model 1 and Model 3 is also consistent with previous results. However, for both the corrected and uncorrected estimates in Model 2 and Model 4 (the cumulative average model), calorie-adjusted saturated fat intake has a significant protective effect on the incidence of breast cancer.

Confidence interval widths for saturated fat intake tend to be smaller based on cumulative average intake vs. baseline intake. However, this is not true for alcohol or total caloric intake. Finally, as expected, confidence interval widths are narrower for Models 3 and 4 based on 5672 events over 22 years than Models 1 and 2 based on 1181 and 1351 events over 4 years. The number of events and person-years are larger for Model 2 than Model 1 because some subjects (n = 20507, 170 events) did not return the 1998 questionnaire, but did return at least one previous questionnaire.

7 Discussion

The proposed regression calibration model for cumulative average intake is based on a continuous representation of nutrient intake. However, nutrient intake is often represented in terms of quantiles to avoid making assumptions of a linear relationship between outcome and exposure. We have previously considered measurement error models where nutrient intake is represented as an ordinal variable (Rosner 1996). More work is needed to extend this model to the cumulative average setting such as in Eqs. 1, 4, and 5. Similar issues apply to extending misclassification models for nominal categorical exposure variables (e.g., Greenland 1980) to the cumulative average setting.

A possible extension for the cumulative average model is to use a more general weighted average j=1KwjXtj of a vector of covariates X measured over time with more weight given to reported intake on the most recent surveys. A possible weight wj at time point tj is exp[λ(tKtj)]/[ell]=1Kexp[λ(tKt[ell])], where tK is the current time point, tj, j < K are previous time points when the FFQ was administered. When λ = 0, the weighted average gives equal weight to all previous surveys and is equivalent to a cumulative average. When λ [dbl greater-than sign] 0, the weighted average gives most weight to the most recent survey. It is also possible to allow λ < 0 in which case reported intake at distant time points is given more weight which might be a reasonable way to introduce a lag effect. Indeed, for the “baseline” model (Model 3) relating disease incidence to baseline exposures (e.g., nutritional intakes obtained from the 1980 FFQ questionnaire in NHS), λ [double less-than sign] 0. This generalized cumulative average model is also an extension of Wu and Ware’s (1979, Sect. 3, Eq. 3) hierarchical model. We can generalize the measurement error correction method proposed in this article to this generalized cumulative average model.

Another assumption is that the diet record provides an unbiased estimate of true intake with no correlated error between DR and FFQ measurements. To account for correlated error in a cumulative average setting one would also need simultaneous biomarker measurements as part of the validation study data obtained on at least two points in time (Rosner et al. 2008).

It would be conceptually straightforward to work out the likelihood method. However, the likelihood method involves multiple integrations that integrate out the unobserved true exposures and hence is computationally intensive (Crouch and Spiegelman 1990). Hence, it would be prohibitively expensive to use likelihood method in a data set of the size of the Nurses’ Health Study. The simulation study in Sect. 4 indicates that the regression calibration method produces approximately unbiased estimates and approximately correct coverage under conditions similar to those seen in Nurses’ Health Study data.

In conclusion, we have presented a method for correcting for measurement error with nutrient intake measured on a continuous scale over multiple surveys. To use this methodology, one needs at least two validation studies that are concurrent with the main study at two different points in time with some subjects participating in both validation studies. It of course would be desirable to have validation study data at additional time points to further validate some of the assumptions mentioned above.

Supplementary Material

tables and equations

Acknowledgments

We acknowledge the support of NHS program project CA87969 and also CA50597 in performing this work. We thank the editors and referees for their invaluable comments and suggestions. We acknowledge programming support of Rong Chen.

Footnotes

Electronic supplementary material The online version of this article (doi:10.1007/s10985-009-9124-6) contains supplementary material, which is available to authorized users.

References

  • Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM. Measurement error in nonlinear models: a modern perspective. 2. Chapman & Hall; Boca Raton, FL: 2006.
  • Crouch EA, Spiegelman D. The evaluation of integrals of the form f(t)exp(t2)dt: application to logistic-normal models. J Am Stat Assoc. 1990;85:464–469.
  • Cupples LA, D’Agostino RB, Anderson K, Kannel WB. Comparison of baseline and repeated measure covariate techniques in the Framingham Heart Study. Stat Med. 1988;7:205–218. [PubMed]
  • D’Agostino RB, Lee M-LT, Belanger AJ, Cupples LA, Anderson K, Kannel W. Relation of pooled logistic regression to time dependent Cox regression analysis: the Framingham Heart Study. Stat Med. 1990;9:1501–1515. [PubMed]
  • Greenland S. The effect of misclassification in the presence of covariates. Am J Epidemiol. 1980;112:564–569. [PubMed]
  • Hu FB, Stampfer MJ, Manson JE, Rimm E, Colditz GA, Rosner BA, Hennekens CH, Willett WC. Dietary fat intake and the risk of coronary heart disease in women. N Engl J Med. 1997;337:1491–1499. [PubMed]
  • Kahn HA, Dawber TR. The development of coronary heart disease in relation to sequential biennial measures of cholesterol in the Framingham Study. J Chron Dis. 1966;19:611–620. [PubMed]
  • Kim EHJ, Willett WC, Colditz GA, Hankinson SE, Stampfer MJ, Hunter DJ, Rosner B, Holmes MD. Dietary fat and risk of postmenopausal breast cancer in a 20-year follow-up. Am J Epidemiol. 2006;164:990–997. [PubMed]
  • Li Y, Lin X. Functional inference in frailty measurement error models for clustered survival data using the simex approach. J Am Stat Assoc. 2003;98:191–203.
  • Prentice RL. Covariate measurement errors and parameter estimation in a failure time regression model. Biometrika. 1982;69:331–342.
  • Rosner BA. Measurement error models for ordinal exposure variables measured with error. Stat Med. 1996;15:293–303. [PubMed]
  • Rosner B, Willett WC, Spiegelman D. Correction of logistic regression relative risk estimates and confidence intervals for systematic within-person measurement error. Stat Med. 1989;8:1051–1069. [PubMed]
  • Rosner B, Spiegelman D, Willett WC. Correction of logistic regression relative risk estimates and confidence intervals for measurement error: the case of multiple covariates measured with error. Am J Epidemiol. 1990;132:734–745. [PubMed]
  • Rosner B, Michels KB, Chen Y-H, Day NE. Measurement error correction for nutritional exposures with correlated measurement error: use of the method of triads in a longitudinal setting. Stat Med. 2008;27:3466–3489. [PMC free article] [PubMed]
  • Spiegelman D, McDermott A, Rosner B. Regression calibration method for correcting measurement-error bias in nutritional epidemiology. Am J Clin Nutr. 1997;65(suppl):1179S–1186S. [PubMed]
  • Willett W. Nutritional epidemiology. 2. Oxford University Press; New York: 1998.
  • Willett WC, Sampson L, Stampfer MJ, Rosner B, Bain C, Witschi J, Hennekens CH, Speizer FE. Reproducibility and validity of a semiquantitative food frequency questionnaire. Am J Epidemiol. 1985;122:51–65. [PubMed]
  • Willett WC, Sampson L, Browne ML, Stampfer MJ, Rosner B, Hennekens CH, Speizer FE. The use of a self-administered questionnaire to assess diet four years in the past. Am J Epidemiol. 1988;127:188–199. [PubMed]
  • Wu M, Ware JH. On the use of repeated measurements in regression analysis with dichotomous responses. Biometrics. 1979;35:513–521. [PubMed]
  • Yi GY. A simulation-based marginal method for longitudinal data with dropout and mismeasured covariates. Biostatistics. 2008;9:501–512. [PMC free article] [PubMed]
  • Yi GY, He W. Methods for bivariate survival data with mismeasured covariates under an accelerated failure time model. Commun Stat A Theory Methods. 2006;35:1539–1554.
  • Yi GY, Lawless JF. A corrected likelihood method for the proportional hazards model with covariates subject to measurement error. J Stat Plan Infer. 2007;137:1816–1828.