We have presented a method of predicting an individual’s usual intake of an episodically consumed food and relating it to a health outcome. The method is based on regression calibration prediction applied to short-term repeat observations of intake that contain measurement error and excess zeros, under two important assumptions. First, the fact of short-term consumption is assumed to be correctly classified. Second, the reported intake on consumption days is assumed unbiased for true intake. In our method, information from the main dietary instrument may be combined with that from another longer-term, presumably less precise and even biased, report using an auxiliary instrument. We have demonstrated, through real data and simulations, that the gain from combining two instruments may be substantial, with increases in the precision of the predicted usual intake and of the estimated diet-health outcome relationship.
In our applications, the main instrument was a 24HR and the auxiliary instrument a FFQ. Unfortunately, the assumption of unbiasedness of the main instrument does not strictly apply to the 24HR. Recent biomarker studies (Kipnis et al., 2003
) have shown that, for total energy, the 24HR also involves systematic error related to true usual intake. Such biases in reporting energy intake indicate bias also in the reporting of at least some energy-contributing foods. On the other hand, these same studies confirmed that the bias in 24HR reports is considerably less than that in FFQs. Thus, in the absence of any accurate biomarker for most foods and nutrients, using the 24HR in our proposed method may provide the best available approximation.
Our method appears to fill a gap in the analytic tools of nutritional epidemiologists estimating food and health outcome associations. Use of 24HRs alone is known to be problematic when there is a large number of zero values, whereas use of the FFQ alone is marred by the large reporting biases of this instrument. Our examples have demonstrated that the proposed method is feasible to implement and produces nearly unbiased estimates of associations of intakes of episodically consumed foods with health outcomes. The method outperformed the “naïve” approach even without the FFQ in the calibration model, giving an estimate with a much reduced MSE. However, use of the FFQ greatly increased the precision of the estimate.
As shown in Section 3, use of the FFQ will not have a large impact for all foods. Probably the most important factor that determines the impact of the FFQ is the overall probability to consume the food on a given day. For foods with a relatively low probability of consumption (e.g., fish and dark green vegetables in ), the FFQ will most likely provide a larger increase in efficiency. However, a larger sample size (or, alternatively, more repeat 24HRs) is required to obtain reliable estimates of the model parameters when the consumption probability is very low. This is because a substantial number of individuals with at least two consumption days are needed to estimate properly the within-person variance in the second part of the model. In our NHANES example, there were 57 women (out of 1605) who consumed fish on both days. We would not expect reliable fits for very rarely consumed foods (e.g., organ meats or yogurt in NHANES) with considerably fewer than 50 individuals with two positive intakes and indeed we have encountered some convergence problems in simulations of such cases.
In our two-part model, the first part specifies the probability of the point mass at zero, and the second part conditionally
models the continuous variable given that it is positive. Another potential approach to modeling semicontinuous data with measurement error was proposed by Li, Shao, and Palta (2005)
. It is based on the sample selection model that posits an underlying continuous variable censored by a random mechanism. Using our notation, true long-term and reported intakes are specified as Ti
= max (0, Vi
) and Rij
= max (0, Vi
+ εi j
), respectively, with the underlying variable
. The use of the same linear function of covariates and the same random effect to specify the censoring mechanism and the positive observations makes this model less flexible than ours. Its advantage is formal modeling of never-consumers.
Our two-part model assumes that each food is ultimately consumed by all individuals, so that Ti
> 0. This derives from specifying the random effect in the probability part as a continuous variable. In a similar situation, Olsen and Schafer (2001)
suggested a two-part mixture for the distribution of this random effect, where the status of a “teetotaler” is specified by a latent class classification variable, but did not provide any details of fitting such a model.
We considered adding a third part to our model, which specifies for each person the probability to be a never-consumer by using fixed-effect logistic regression on a vector of covariates X3i. We have fitted this model to the data on fish intake in EATS among 515 women, including 30 who reported zero intakes on the FFQ. An indicator variable of whether fish consumption was reported on the FFQ was used as a covariate in X3i. In a simulation study similar to the one described in Section 5 (but this time simulating never-consumers), we investigated cases where the number of 24HRs was 2, 4, or 6. With only two 24HRs, the model fit was unstable in 64 out of 250 simulated data sets, although the problem disappeared when we increased the number of 24HRs to four or more. Modeling never-consumers is an area for further research, but, with only two 24HRs, the two-part model seems the most feasible approach.
Our methodology is suitable for analysis of a particular food and its relationship with a health outcome that involves no other dietary factors. An extension to a multivariate case with several foods and nutrients requires conditioning in formula (20
) on potentially correlated random effects for all considered dietary factors simultaneously and is another area for future research.
Although we concentrated on dietary surveys, the proposed method can also be applied to cohort studies of associations between episodically consumed foods and disease. Currently, most such studies use a FFQ as the main dietary-assessment instrument, while a more precise short-term reference instrument is available only in a calibration substudy. In such cases, the regression calibration is based on estimating
which involves conditioning on the FFQ and other covariates, but not on the 24HR (and therefore random effects) as in formulas (6
). This simplifies the method and, more importantly, allows its application to a multivariate case with several foods and nutrients by considering regression calibration of each dietary factor, one at a time.
In the future, as automated 24HRs become available, our methodology could combine multiple administrations of this instrument with the FFQ to achieve more precise results.