Home | About | Journals | Submit | Contact Us | Français |

**|**Int J Biostat**|**PMC3406506

Formats

Article sections

- Abstract
- INTRODUCTION
- Data and Model
- Empirical Analysis: Methods
- Results
- Discussion
- Supplementary Materials
- References

Authors

Related links

Int J Biostat. 2011 January 1; 7(1): Article 1.

Published online 2011 January 6. doi: 10.2202/1557-4679.1267

PMCID: PMC3406506

Saijuan Zhang, Texas A&M University;

Copyright © 2011 The Berkeley Electronic Press. All rights reserved

This article has been cited by other articles in PMC.

There has been great public health interest in estimating usual, i.e., long-term average, intake of episodically consumed dietary components that are not consumed daily by everyone, e.g., fish, red meat and whole grains. Short-term measurements of episodically consumed dietary components have zero-inflated skewed distributions. So-called two-part models have been developed for such data in order to correct for measurement error due to within-person variation and to estimate the distribution of usual intake of the dietary component in the univariate case. However, there is arguably much greater public health interest in the usual intake of an episodically consumed dietary component adjusted for energy (caloric) intake, e.g., ounces of whole grains per 1000 kilo-calories, which reflects usual dietary composition and adjusts for different total amounts of caloric intake. Because of this public health interest, it is important to have models to fit such data, and it is important that the model-fitting methods can be applied to all episodically consumed dietary components.

We have recently developed a nonlinear mixed effects model (Kipnis, et al., 2010), and have fit it by maximum likelihood using nonlinear mixed effects programs and methodology (the SAS NLMIXED procedure). Maximum likelihood fitting of such a nonlinear mixed model is generally slow because of 3-dimensional adaptive Gaussian quadrature, and there are times when the programs either fail to converge or converge to models with a singular covariance matrix. For these reasons, we develop a Monte-Carlo (MCMC) computation of fitting this model, which allows for both frequentist and Bayesian inference. There are technical challenges to developing this solution because one of the covariance matrices in the model is patterned. Our main application is to the National Institutes of Health (NIH)-AARP Diet and Health Study, where we illustrate our methods for modeling the energy-adjusted usual intake of fish and whole grains. We demonstrate numerically that our methods lead to increased speed of computation, converge to reasonable solutions, and have the flexibility to be used in either a frequentist or a Bayesian manner.

This paper is about the important public health problem of understanding the distribution of episodically consumed dietary component intakes in terms of their energy-adjusted amounts, and relating this to diet-disease relationships. Before commenting in more detail, we first discuss the literature for simpler problems that are also of interest.

In nutritional surveillance and nutritional epidemiology, there is considerable interest in understanding the distribution of usual dietary intake, which is defined as long-term daily average intake. In addition, of interest is the regression of this intake on measured covariates, which is needed to correct diet-disease relationships for measurement error in assessing diet. If the dietary component of interest is ubiquitously consumed, as most nutrients are, the data are continuously distributed and methods are well-established for solving both problems. See for example Nusser, et al. (1997) for surveillance and Carroll, et al. (2006) for measurement error modeling.

Another class of dietary components is those which are episodically consumed, as is true of most foods, e.g., fish, red meat, dark green vegetables, whole grains. When consumption is measured by a short-term instrument such as a 24 hour recall, hereafter denoted by 24hr, the episodic nature of these dietary components means that their reported intake may either equal zero on a non-consumption day, or is positive on a day the component is consumed. In many studies, non-consumption days predominate for several episodically consumed foods of interest. For example, in our data example, for fish and whole grains, 65% and 12% reported no consumption on both of two administrations of the 24hr, respectively. Thus, data on episodically consumed dietary components are zero-inflated data with measurement error. Recently, Tooze, et al. (2006) for nutritional surveillance and Kipnis, et al. (2009) for nutritional epidemiology have reported so-called two-part methods, which are actually nonlinear mixed effects models, for analyzing episodically consumed dietary components in the univariate case. These methods are known commonly as the “NCI method” because many of the co-authors of these papers are members of the National Cancer Institute (NCI), and because SAS routines based upon the NLMIXED procedure are available at http://riskfactor.cancer.gov/diet/usualintakes/, an NCI web site. Other two-part models in different contexts are described for example in Olsen and Schafer (2001), Tooze, et al. (2002) and Li, et al. (2005).

We are interested in the more complex public health problem of understanding the usual intake of an episodically consumed dietary component adjusted for energy intake (caloric intake), along with the distribution of usual intake of energy. This is critical because it addresses the issue of dietary component composition, and makes comparable diets of individuals whose usual intakes of energy are very different. As an example, the U.S. Department of Agriculture’s Healthy Eating Index-2005 (www.cnpp.usda.gov/HealthyEatingIndex.htm) is a measure of diet quality that assesses conformance to Federal dietary guidance. One component of that index is the number of ounces of whole grains consumed per 1000 kilocalories: there are other items in the HEI-2005 that deal with episodically consumed dietary components, and all of them are adjusted for energy intake. The data needed to compute such variables are thus the usual intake of the dietary component consumed and the usual amount of calories consumed, and (possibly normalized) ratios of them.

Recently, Kipnis, et al. (2010) have developed a model for an episodically consumed dietary component and energy, see Section 2. They fit this model using nonlinear mixed effects models with likelihoods computed by adaptive Gaussian quadrature using the SAS procedure NLMIXED. However, as described in Section 2 and documented in Section 4, this form of computation can be slow, and can have serious convergence issues. This is extremely problematic, because of the importance of the problem and the fact that solutions will find wide use in the nutrition community, but only if they are numerically stable.

In this paper, we take an alternative Markov Chain Monte Carlo (MCMC) approach to computation, which is faster and numerically more stable. There are many good introductory papers reviewing MCMC, such as Casella, et al. (1992), Chib, et al. (1995) and Kass, et al. (1998). Effectively, we exploit the well-known fact (Lehmann and Casella, 1998, Chapter 6.8) that in fully parametric regular models of the type we study, Bayesian posterior means of parameters are asymptotically equivalent to their corresponding maximum likelihood estimators. To implement an MCMC approach in our problem, there are technical issues that have to be overcome, including the fact that one of the covariance matrices in the model of Kipnis, et al. (2010) is patterned. Besides fitting the model, our main focus in this paper is to discuss how to use the parameter estimates to then estimate the distributions of the usual intake of energy and energy-adjusted usual intake of dietary components.

In Section 2, we describe the model of Kipnis, et al. (2010). In Section 2, we also briefly outline some of the details of our implementation, although the technical details are given in the Appendix. In Sections 3 and 4, we take up the analysis of the NIH-AARP Study of Diet and Health (http://dietandhealth.cancer.gov/) as an illustration of our model and method. Section 5 gives concluding remarks.

In practice, the response data often come from repeated 24hr. Necessarily, due to cost and logistical reasons, the number of recalls is limited, and is rarely greater than 2. In a 24hr, what is observed is whether a dietary component is consumed, and if it is consumed, the reported amount. In addition, the amount of energy reported to be consumed is also available. Thus, for person *i* = 1, ..., *n*, and for the *k* = 1, ..., *m** _{i}* repeats of the 24hr, the data are

*Y*_{i}_{1}= Indicator of whether the episodically consumed dietary component is consumed._{k}*Y*_{i}_{2}= Amount of the dietary component consumed as reported by the 24hr, which equals zero if the dietary component is not consumed._{k}*Y*_{i}_{3}= Amount of energy consumed as reported by the 24hr._{k}

There are also generally covariates such as age category, ethnic status and in many cases the results of reported intakes from a food frequency questionnaire. We will generically call these covariates **X**.

Here we describe the nonlinear mixed effects latent variable model of Kipnis, et al. (2010). There are *i* = 1, ..., *n* individuals and *k* = 1, ...,*m** _{i}* repeats of the 24hr. Also, the observed data have three parts, relating to whether the episodically consumed dietary component is consumed, the amount if it is consumed, and the amount of energy. Also with the observed data, we will have covariates for the individual, generically called

To be more precise, for the *i ^{th}* individual there are covariates (

The model of Kipnis, et al. (2010) uses a latent variable approach. Let (*W _{i}*

(1)

where (*U _{i}*

The observed data are related to the latent variables as follows:

(2)

(3)

(4)

where *I*(·) is the indicator function and *g ^{−}*

(5)

where Φ(·) is the standard normal distribution function. The probit model is commonly used to model a relationship between a binary dependent variable and one or more independent variables. The probit link was used in Kipnis, et al. (2010) to allow the day-to-day variation in whether a food is consumed to be correlated with the amount of energy consumed, and in such a way that the day-to-day variation random variables (*ε _{i}*

Under the assumption that the 24hr is unbiased for usual (mean) intake, the usual intake of the dietary component and energy are given as *T _{Fi}* =

(6)

(7)

where the (*j*, *k*) element of **Σ*** _{ε}* is denoted as

The Taylor series approximation to computing expectations of inverses of the Box-Cox transformation is used here because it was used by Kipnis, et al. (2009, 2010). More precise quadrature formulae can be used, and we have done so, finding almost no numerical changes. The computational convenience of the approximation makes it attractive.

There are two restrictions necessary in the specification of **Σ*** _{ε}*. First, following Kipnis, et al. (2009, 2010), we set

(8)

The difficulty with parameterizations such as (8) is that (*s*_{13}, *s*_{23}, *s*_{22}, *s*_{33}) cannot be left unconstrained, or else (8) need not be a covariance matrix. Define
and *s*_{23} = *ρ*_{23}(*s*_{22}*s*_{33})^{1}^{/}^{2}. Then the determinant
. Since **Σ*** _{ε}* is a covariance matrix, its determinant must be non-negative, and hence we cannot allow the correlations (

The zero entries in (8) are not required, although they are implicit in the two part model used in the original papers involving only the episodically consumed dietary component and not energy (Tooze, et al., 2006; Kipnis, et al., 2009) and they make intuitive sense in our context. We have chosen to use this restriction for these reasons and especially so that the marginal model for the episodically consumed dietary component is the same as that in the literature.

Kipnis, et al. (2010) explore a sample selection model (Heckman, 1976, 1979; Leung and Yu, 1996; Kyriazidou, 1997; Min and Agresti, 2002) that does not have this restriction. They found that such a sample selection model can be very unstable in our context, with the components of **Σ _{u}** and

It is very important to allow for **Σ*** _{ε}* being non-diagonal. The term

It is possible in principle to fit model (1)–(8) using nonlinear mixed effects software. Kipnis, et al. (2010) use the SAS procedure PROC NLMIXED. However, we have found that such implementation is slow and not very stable, with many issues of convergence. NLMIXED uses adaptive Gaussian quadrature to integrate the likelihood over the distribution of random effects. NLMIXED can have convergence problems, especially when there are too many, or too few, zeros. What typically happens is that corr(*U _{i}*

Hence, for stability and speed, we have turned to a Bayesian approach for fitting the model described by equations (1)–(8). We emphasize that the Markov Chain Monte Carlo computation can either be thought of as a strictly Bayesian computation with ordinary Bayesian inference, or as a means of developing *frequentist* estimators of the crucial parameters, based on the well-known fact that in parametric models such as ours, the posterior mean of the parameters is a consistent and asymptotically normally distributed frequentist estimator, see for example Lehmann and Casella (1998, Chapter 6.8).

Our computational algorithm, described in detail in the appendix, uses Gibbs sampling with some Metropolis-Hastings steps. We have implemented this approach in both Matlab and R, and it is fast enough for practical use. In the NIH-AARP Diet and Health Study described in Section 3, with a sample size of 899, for a burn-in of 1, 000 steps followed by 10, 000 MCMC iterations, our Matlab and R programs take approximately 2 minutes and 11.7 minutes on an Intel(R) Xeon(TM) CPU with 3.73GHz and 7.8GB of RAM in a Linux system, respectively. For a burn-in of 5, 000 steps followed by 15, 000 MCMC iterations, our Matlab and R programs take approximately 3 minutes and 17.5 minutes, respectively. Both programs are available from the first author.

We have also developed an implementation in WinBUGS with a BUGS model called from R by using the package R2WinBUGS. Details are available from the third author. As to be expected, the WinBUGS code is much slower than the custom programs, taking approximately 5 hours (Pentium computer with 3.5GHz CPU and 1.99GB of RAM in a Windows system) for a burn-in of 1, 000 steps followed by 10, 000 MCMC samples. We are also currently developing a SAS macro for use by the nutritional community. On various test data sets, the WinBUGS, R, SAS and Matlab code gave very similar answers. In our empirical work, we use the Matlab code.

There are important data conventions that we use. These are described in detail in the Appendix. For example, in Section A.1, we mention that covariates are always standardized to have sample mean zero and sample variance one. The reason is a matter of scaling: energy intake is in terms of calories, which are typically in the 1,000’s, so that the corresponding regression parameters, without standardization, with the FFQ energy as a covariate, would necessarily be tiny, making it hard to develop a plausible prior distribution. As described in Section A.1, we also standardize the responses for numerical stability and weaken dependence upon the prior distributions, and in Section A.2 we describe why this standardization makes sense. We have fit our method with various different prior distributions, and there is very little sensitivity to prior specification.

Covariates are important for estimating the distribution of usual intakes, for at least three reasons.

- First, as a matter of model specification. Consider abstractly the simple linear regression model
*Y*=*β*_{0}+*β*_{1}*X*+*ε*: given*X*,*ε*might be normally distributed, but if*X*is not simultaneously normally distributed, then removing it from the model would give a model*Y*=*κ*_{0}+*ξ*, and*ξ*would not be normally distributed, and our model assumptions would be violated. - Subar, et al. (2006) studied using food frequency questionnaire (FFQ) data as covariates to estimate the distributions of individual usual intakes of episodically consumed dietary components. They found strong and consistent relationships between FFQ and 24hr. This supports the postulate that FFQ data may provide important covariate information in supplementing 24hr for estimating usual intake of dietary components. Besides FFQ, there are some other clinical covariates such as gender, age, body mass index (BMI), etc. that may be associated with usual intake. Thus, our covariates included an intercept, age, BMI, the FFQ for energy intake and the FFQ for the dietary component of interest. They are used to reduce the error with which the usual intake is estimated, and to make more plausible our distributional assumptions.
- Kipnis, et al. (2009) state in their abstract “
*One feature of the proposed method is that additional covariates potentially related to usual intake may be used to increase the precision of estimates of usual intake and of diet-health outcome associations*”. In their introduction they state “*In Section 3*,*using data from the Eating at Americas Table Study (EATS)*,*we quantify the increased precision obtained from including a FFQ report as a covariate*”.

A referee has asked whether the *β*-coefficients for the covariates are interpretable, and whether it would be of interest to make inferences about whether the covariates are associated with usual intake. Because energy adjusted usual intakes involve three *β*-coefficients for each covariates, interpretation of any one of them is difficult. Whether a particular covariate is associated with usual intake is a mildly interesting question, but if far less important than estimating distributions of energy-adjusted usual intakes.

We performed a simulation study that was based upon our empirical study given in Section 3, in order to ascertain whether the methodology results in reasonably unbiased estimates of (*β*_{1}, *β*_{2}, *β*_{3}, **Σ _{u}**,

The mean of the posterior means of (*β*_{1}, *β*_{2}, *β*_{3}) was unbiased overall and are not reported here. The mean of the posterior means of (**Σ _{u}**,

Crucially, for the main purposes of estimating the distribution of usual intakes, the posterior means were essentially unbiased for estimating **Σ*** _{u}*. As seen in the Appendix,

We give here only the results of a single simulation because what we have shown above are representative of other simulations we have done. For example, we have simulated cases where the off-diagonal elements of **Σ*** _{u}* were zero and cases where some of them were negative. We have also simulated cases that the diagonal elements of

We have not displayed the simulation results for the Proc NLMIXED procedure because in those cases that it converges, it is very nearly unbiased, just like our method.

The NIH-AARP Diet and Health Study, see http://dietandhealth.cancer.gov/ and Schatzkin, et al. (2001), has two components, the main study with diet assessed by a Food Frequency Questionnaire (FFQ) and a calibration substudy with additional diet assessment by two 24hr. We considered a part of the main study that consists of *n _{p}* = 142, 364 women, who contributed an FFQ as well as relevant demographic characteristics. The data used were the same as in Sinha, et al. (2010). The covariates

In addition to the primary sample, there was a subsample of *n _{v}* = 899 women in the calibration sub-study who completed an FFQ and demographic characteristics, so that there are

We illustrate our computational algorithm using data from both the two 24hr and the FFQ for whole grains, fish and energy intake, along with covariates. Following Kipnis, et al. (2009, 2010), the FFQ values for fish, whole grain and energy intake were transformed using *λ* = 0.25, *λ* = 0.33 and *λ* = 0.00, respectively. The 24hr used *λ* = 0.50, *λ* = 0.33 and *λ* = 0.33, respectively.

The MCMC calculations result in samples from the posterior distribution of
, **Σ*** _{u}*,

- Usual dietary component intake is
*T*_{Fi} - Usual energy intake is
*T*= 𝓖_{Ei}_{2}{**X**_{i3},*β*_{3},*U*_{i}_{3}, ∑_{ε}(3, 3)}, see (7).

For both usual dietary component intake and usual energy intake, 24hr samples are available for *i* = *n _{p}* + 1, ...,

We are going to write the variable of interest as (*T _{Fi}*,

for *i* = 1, ..., *n _{p}* +

Estimation of the distribution of 𝓠 across the population is easily accomplished by a Monte-Carlo computation. This is a different Monte-Carlo computation than the MCMC, and is performed after the MCMC has been done. Specifically, for a large *B*, where we took *B* = 5, 000, and for *b* = 1, ..., *B* generate (*U _{bi}*

taken across *i* = 1, ..., *n _{v}* +

Standard errors and confidence intervals for the distribution of usual intake can be formed easily by bootstrapping. We used 400 bootstrap samples in our numerical work.

For bootstrap confidence intervals, it is often recommended to use at least 399 bootstrap samples, as we have done, see for example Davidson and MacKinnon (1999). We have experimented with using up to 1, 000 bootstrap samples, but this significantly increases computing time without changing the basic results in any material way.

As described below, Bayesian inference on the distribution of usual intake depends on estimating the distribution of the covariates. The distribution of usual intake (*T _{F}*,

Then the distribution of usual intake is

We suggest approximating this using Monte-Carlo integration, as follows. Again, let *B* be large where we took *B* = 1, 000, and for *b* = 1, ...*B*, let **u*** _{b}* = Normal(0,

The posterior distribution of *F*(*v*|𝓑, **Σ*** _{u}*,

In the NIH-AARP Diet and Health Study, with a sample size of *n _{p}* +

The posterior distribution of *F*(*v*|𝓑, **Σ*** _{u}*,

Along with illustrating the distributions of usual intakes of the dietary components adjusted for energy, we also compared our results with NLMIXED.

We used a burn-in of 5,000 steps followed by 15,000 MCMC samples. We saved every 10* ^{th}* sample to reduce autocorrelation.

In Table 1 we present summary statistics (mean, standard deviation and selected percentiles) of the usual intakes as well as the usual intakes adjusted for energy. Figures 1 and and22 give density estimates for usual intake and energy adjusted intake of fish and whole grains, respectively: a similar plot for usual energy intake was also produced but not displayed here. The evident skewness of the usual intakes of fish and whole grains is expected, as are the somewhat less skewed nature of the energy adjusted intakes.

Density estimates for fish. The solid line is the density estimate for usual intake in the unit of oz. The dashed line is the density estimate for usual intake per 1000 kilo-calories.

Density estimates for whole grains. The solid line is the density estimate for usual intake in the unit of cups. The dashed line is the density estimate for usual intake per 1000 kilo-calories.

Estimated distributions of the usual intake for Whole Grains, Fish and Energy and the estimated distributions of energy-adjusted usual intake for Whole Grains and Fish, for women. The 5^{th} percentile of the distribution is labeled as 5^{th}, etc. For energy-adjusted **...**

We bootstrapped the validation and primary data sets separately 400 times, see Remark 6, reran the analysis, and formed bootstrap confidence intervals. Since the distribution of the covariates *X* is essentially known because of the size of the primary study, this bootstrap simply reflects the uncertainty in the parameter estimates as they propagate through to usual intakes. To give a graphical summary including uncertainty, in Figure 3 we plot the actual estimated percentiles of the distribution of adjusted fish intake against the percentile number, as well as the 95% pointwise bootstrap confidence interval for these percentiles.

In Table 1 we also give the Bayesian analysis for energy-adjusted fish intake. As seen there, the Bayesian analysis posterior means of the distribution of energy-adjusted fish intake is nearly identical to the frequentist analysis. The same thing was found for all the columns in Table 1.

In addition, posterior credible interval lengths were almost equivalent to those of the frequentist method and are not displayed here.

We described in Section 2.4 some of the motivation for our computational approach. In this section, we show documentation of those claims.

First, in Table 2, we describe aspects of the analysis for women of whole grains, fish and dark-green vegetables, using the AARP data set. The first line in the table is the number of minutes of computation for the nonlinear mixed effects program and our MCMC approach. It can be seen that the MCMC approach is considerably faster. While not displayed here, for Milk for men, which had only 12% reported non-consumption on the 24hr, the nonlinear mixed effects program took 200 minutes, while ours took only 4 minutes. This illustrates our claim concerning speed of computation.

Comparison between two computational methods, “NLMIXED” and “MCMC”, to fit the bivariate nonlinear mixed effects model, for whole grains, fish and dark-green vegetables. Displayed are the estimates of correlations among **...**

A second aspect is that we claimed that sometimes the nonlinear mixed effects analysis of Kipnis, et al. (2010) suffered from convergence to a singular covariance matrix estimate for **Σ*** _{u}*. This occurred for dark-green vegetables, see Table 2, where it was estimated that the correlation between (

A third aspect of the comparison is that we claimed that when the method of Kipnis, et al. (2010) converged to a reasonable answer, our results were in general agreement with theirs. This is borne out in Table 2, where we have listed the standard errors of the estimates using the Hessian for the nonlinear mixed effects analysis, and using the MCMC samples for our method. The estimates are quite similar with the exception of corr(*U _{i}*

While it may seem obvious, it is useful to clarify what we mean by the term “*convergence*”. We are not meaning asymptotic rates of convergence, because these are the standard *n*^{1}^{/}^{2}-type one sees in parametric models. We are also not talking about theoretical rates of numerical convergence, e.g., how fast is convergence of the Proc NLMIXED procedure in terms of number of iterations. Instead, for us the term convergence has the meaning that Proc NLMIXED announces that it has converged to a solution with a nonsingular Hessian. Of course, our method, being based on proper priors, converges in the usual MCMC sense.

Understanding the distribution of energy-adjusted usual intake of episodically consumed dietary components is of considerable public health importance, having implications for basic understanding of both dietary component composition and policy. Being able to correct for measurement error due to within-person variation in short-term assessment of intake, when investigating diet-disease relationships in cohort studies, is equally important. Because of the importance of these problems, models and fitting methods for addressing them will find wide use in the nutrition community. Thus, it is not only important that the models are reasonable, but that the fitting methods be reasonably fast, that they converge, and that the answers from the fitting methods usually make sense. The main point of this paper has been to show that an MCMC approach satisfies these criteria, and has the potential to be used widely in the nutrition community. The fact that the MCMC approach can be used in a frequentist sense is a new insight for nutritional epidemiology, which is decidedly frequentist in orientation, although the MCMC model fitting can also allow Bayesian inference.

There is an enormous literature on measurement error models, both parametric and nonparametric, for estimating distributions (e.g., Fan, 1991; Wand, 1998; Johnson, et al., 2007; Staudenmeyer, et al., 2008; Delaigle, et al, 2008 among many others) and in regression (Ferrari, et al., 2004, e.g., Liang and Wang, 2005 among many others). Many more references are given in Carroll, et al. (2006). However, none of these papers deal with our topic of episodically consumed and hence zero-inflated dietary components along with continuous components that involve skewness, a structured covariance matrix, correlations of random effects, and usual intakes on the original data scale.

An issue of practically much less importance is that the model of Kipnis, et al. (2010) in equation (6) assumes that each food is consumed by all individuals. Kipnis, et al. (2009) address this issue, by adding a fixed effect regression so as to model never-consumers. They show that even without energy in the model, and with only two 24hr as is standard for such data, their method was numerically very unstable. Our method easily handles such an extension, but its practical implications are not particularly clear when, for example, in other studies, less than 0.5% of subjects claimed on the FFQ never to eat fish or whole grains.

User-friendly SAS macros are being written for distribution to the nutrition community. These programs will also allow sampling weights, so that they can be used in population-based survey samples, and will thus be of interest both nationally and internationally. We are presently working on extending the methods to analyze multiple foods and nutrients simultaneously, with allowance for survey weights, so that analysis of dietary patterns and dietary composite scores can be undertaken.

Click here to view.^{(393K, doc)}

Click here to view.^{(61K, pdf)}

This preprint was requested by reviewer #3

Fig. 1 for the requested preprint.

Standardization is important in MCMC applications both for numerical stability and to allow fairly off-the-shelf prior distributions to make sense. Prior to analysis, we standardized the covariates to have mean 0.0 and variance 1.0. The observed, transformed non-zero 24hr were standardized to have mean 0.0 and variance 2.0. More precisely, we first transformed the non-zero dietary component data as *Z _{i}*

We will first describe the algorithm used in terms of the *Q _{ijk}*, and then in Section A.11, we describe the back-transformation method that we used to obtain estimation and inference for usual intake.

Having the total variability of the non-zero transformed responses equal to 2.0 is extraordinarily convenient. Effectively, this means that var(*U _{ij}*) + var(

Because the data were standardized, following the discussion of Remark 8, we used the following conventions.

- The priors for all
*β*were normal with mean zero and variance 100._{j} - The prior for
**Σ**was exchangeable with diagonal entries all equal to 1.0 and correlations 0.50. There was 5 degrees of freedom in the inverse Wishart prior, i.e.,_{u}*m*= 5. Thus, the prior is IW{(_{u}*m*− 3 − 1)_{u}**Ω**,_{u}*m*}._{u} - The priors for
*s*_{22}and*s*_{33}were Uniform[0,3]. This range is reasonable because of the standardization. - The priors for (
*γ*,*θ*) were uniform on their range.

We experimented with different priors for **Σ*** _{u}*, e.g., setting the correlations equal to 0.0, setting the diagonal elements equal to 0.5, etc. The results were essentially unchanged when these were done.

While we observe * _{ik}*, in the MCMC we need to generate the latent variables

- For energy,
*Q*_{i}_{3}=_{k}*W*_{i}_{3}, no data need to be generated._{k} - For the amounts,
*Q*_{i}_{2}, we just simply set_{k}*W*_{i}_{2}=_{k}*Q*_{i}_{2}._{k} - For consumption, we generate
= (_{i}*U*_{i}_{1},*U*_{i}_{2},*U*_{i}_{3})^{T}as normally distribution with mean zero and covariance matrix given as the prior covariance matrix for**Σ**. We then also compute , where Z_{u}= Normal(0, 1) are generated independently. We then set_{ik}*W*_{i}_{1}=_{k}*z*_{ik}Q_{i}_{1}−_{k}*z*(1 −_{ik}*Q*_{i}_{1})._{k} - We then updated
by a single application of the updates given in Section A.9._{ik}

The loglikelihood of the complete data is

The complete conditionals for (*γ*, *θ*, *s*_{22}, *s*_{33}) do not have an explicit form, so we use a Metropolis-Hastings within Gibbs sampler to generate them in turn. Since **Σ*** _{ε}* is determined by

For convenience, we set *γ* to be discrete with 41 equally-spaced values on its range. Let *γ _{t}* be the current value. The candidate value

where {•} means that the term before *f*(·) is transposed and substituted. If the candidate *y* is accepted, then *γ _{t}*

This is done exactly as for *γ*, except now

If the candidate *y* is accepted, then *θ _{t}*

Suppose the current value of *s*_{22} is *s*_{22,}
* _{t}*. A candidate value y is generated from the Uniform distribution of length 0.4 with mean

If the candidate is accepted, then *s*_{22,}
_{t}_{+1} = *y*. Otherwise, *s*_{22,}
_{t}_{+1} = *s*_{22,}
* _{t}*.

This is the same as that for *s*_{22}, except now

If the candidate is accepted, then *s*_{33,}
_{t}_{+1} = *y*. Otherwise, *s*_{33,}
_{t}_{+1} = *s*_{33,}
* _{t}*.

By “rest”, we mean all the observable data, latent variables and parameters other than the one in question. By inspection, the complete conditional for **Σ*** _{u}* is

Let the elements of
be
. For any *j*, except for irrelevant constants,

which implies [*β** _{j}* |rest] ~ Normal(𝓒

Except for irrelevant constants, and remembering that *j* = 1, ..., 3,

which implies [* _{i}*|rest] ~ Normal(𝓒

Here we do the complete conditional for *W** _{ik}* with = 1. Except for irrelevant constants,

where

If we use the notation TN_{+}(*μ*, *σ*, *c*) for a normal random variable with mean *μ*, standard deviation *σ* is truncated from the left at *c*, and TN_{−}(*μ*, *σ*, *c*) is truncated from the right at *c*, then it follows that with *μ* = 𝓒_{2}𝓒_{1} and
,

Generating TN_{+}(0, 1, *c*) is easy: if *c* < 0, simply do rejection sampling of a Normal(0, 1) until you get one that is > *c*. If *c* > 0, there is an adaptive rejection scheme (Robert, 1995). The “truncated normal” was used because the latent variable *W _{i}*

For *p* = 2, the variable *W _{ipk}* is not observed when

where

Therefore,

Here we show how to go from the transformed and standardized data to usual intakes. We first consider energy, where we used the transformation

When *λ _{E}* = 0, the back-transformation is

When *λ _{E}* ≠ 0, the back-transformation is

(A.1)

(A.2)

Define

As in Kipnis, et al. (2009), the usual intake of energy for person *i* is

Similarly, a person’s usual intake of the dietary component on the original scale is defined as

**Author Notes:** This paper forms part of the Ph.D. dissertation of the first author at Texas A&M University. The research of Zhang, Perez and Carroll was supported by a grant from the National Cancer Institute (R37-CA057030). This publication is based in part on work supported by Award Number KUS-CI-016-04, made by King Abdullah University of Science and Technology (KAUST).

Saijuan Zhang, Texas A&M University.

Susan M. Krebs-Smith, National Cancer Institute.

Douglas Midthune, National Cancer Institute.

Adriana Perez, University of Texas School of Public Health.

Dennis W. Buckman, Information Management Services, Inc.

Victor Kipnis, National Cancer Institute.

Laurence S. Freedman, Gertner Institute for Epidemiology and Public Health Research.

Kevin W. Dodd, National Cancer Institute.

Raymond J Carroll, Texas A&M University.

- Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM. Measurement Error in Nonlinear Models: A Modern Perspective. Second Edition. Chapman and Hall; CRC Press; 2006.
- Casella G, George E. Explaining the Gibbs Sampler. The American Statistician. 1992;46:167–174. doi: 10.2307/2685208. [Cross Ref]
- Chib S, Greenberg E. Understanding the Metropolis-Hastings Algorithm. The American Statistician. 1995;49:327–335. doi: 10.2307/2684568. [Cross Ref]
- Davidson R, MacKinnon JG. Bootstrap testing in nonlinear models. International Economic Review. 1999;40:487–508. doi: 10.1111/1468-2354.00026. [Cross Ref]
- Delaigle A, Hall P, Meister A. On Deconvolution with repeated measurements. Annals of Statistics. 2008;36:665–685. doi: 10.1214/009053607000000884. [Cross Ref]
- Fan J. On the optimal rates of convergence for nonparametric deconvolution problems. Annals of Statistics. 1991;19:1257–1272. doi: 10.1214/aos/1176348248. [Cross Ref]
- Ferrari P, Kaaks R, Fahey MT, Slimani N, Day NE, Pera G, Boshuizen HC, Roddam A, Boeing H, Nagel G, Thiebaut A, Orfanos P, Krogh P, Braaten T, Riboli E. Within- and between-cohort variation in measured macronutrient intakes, taking account of measurement errors, in the European Prospective Investigation Into Cancer and Nutrition Study. American Journal of Epidemiology. 2004;160:814–822. doi: 10.1093/aje/kwh280. [PubMed] [Cross Ref]
- Heckman J. The common structure of statistical models of truncation, sample selection, and limited dependent variables and a simple estimator for such models. Annals of Economics and Social Management. 1976;5:475–592.
- Heckman J. Sample selection bias as a specification error. Econometrica. 1979;47:153–161. doi: 10.2307/1912352. [Cross Ref]
- Johnson BA, Herring AH, Ibrahim JG, Siega-Riz AM. Structured measurement error in nutritional epidemiology: Applications in the pregnancy, infection, and nutrition (PIN) study. Journal of the American Statistical Association. 2007;102:856–866. doi: 10.1198/016214506000000771. [PMC free article] [PubMed] [Cross Ref]
- Kass RE, Carlin BP, Gelman A, Neal RM. Markov Chain Monte Carlo in Practice: A Roundtable Discussion. Statistical Science. 1998;52(2):93–100.
- Kipnis V, Midthune D, Buckman DW, Dodd KW, Guenther PM, Krebs-Smith SM, Subar AF, Tooze JA, Carroll RJ, Freedman LS. Modeling data with excess zeros and measurement error: application to evaluating relationships between episodically consumed foods and health outcomes. Biometrics. 2009;65:1003–1010. doi: 10.1111/j.1541-0420.2009.01223.x. [PMC free article] [PubMed] [Cross Ref]
- Kipnis V, Freedman LS, Carroll RJ, Midthune D. A bivariate measurement error model for an episodically consumed dietary component and energy: application to epidemiology. 2010. Preprint.
- Kyriazidou E. Estimation of a panel data sample selection model. Econometrica. 1997;65:1335–1364. doi: 10.2307/2171739. [Cross Ref]
- Lehmann EL, Casella G. Theory of Point Estimation. Springer; New York: 1998.
- Leung SF, Yu S. On the choice between sample selection and two-part models. Journal of Econometrics. 1996;72:197–229. doi: 10.1016/0304-4076(94)01720-4. [Cross Ref]
- Li L, Shao J, Palta M. A longitudinal measurement error model with a semicontinuous covariate. Biometrics. 2005;61:824–830. doi: 10.1111/j.1541-0420.2005.00342.x. [PubMed] [Cross Ref]
- Liang H, Wang NS. Partially linear single-index measurement error models. Statistica Sinica. 2005;15:99–116.
- Min Y, Agresti A. Modeling nonnegative data with clumping at zero: a survey. Journal of the Iranian Statistical Society. 2002;1–2:7–33.
- Nusser SM, Fuller WA, Guenther PM. Estimating usual dietary intake distributions: Adjusting for measurement error and nonnormality in 24-hour food intake data. In: Lyberg L, Biemer P, Collins M, Deleeuw E, Dippo C, Schwartz N, Trewin D, editors. Survey Measurement and Process Quality. New York: Wiley; 1997. 1997. pp. 670–689.
- Olsen MK, Schafer JL. A two-part random-effects model for semicontinuous longitudinal data. Journal of the American Statistical Association. 2001;96:730–745. doi: 10.1198/016214501753168389. [Cross Ref]
- Robert CP. Simulation of truncated normal variables. Statistics and Computing. 1995;5:121–125. doi: 10.1007/BF00143942. [Cross Ref]
- Schatzkin A, Subar AF, Thompson FE, Harlan LC, Tangrea J, Hollenbeck AR, Hurwitz PE, Coyle L, Schussler N, Michaud DS, Freedman LS, Brown CC, Midthune D, Kipnis V. Design and serendipity in establishing a large cohort with wide dietary intake distributions: the National Institutes of Health-AARP Diet and Health Study. American Journal of Epidemiology. 2001;154:1119–25. doi: 10.1093/aje/154.12.1119. [PubMed] [Cross Ref]
- Sinha S, Mallick BK, Kipnis V, Carroll RJ. Semiparametric Bayesian analysis of nutritional epidemiology data in the presence of measurement error. Biometrics. 2010. to appear. [PMC free article] [PubMed]
- Staudenmayer J, Ruppert D, Buonaccorsi J. Density estimation in the presence of heteroskedastic measurement error. Journal of the American Statistical Association. 2008;103:726–736. doi: 10.1198/016214508000000328. [Cross Ref]
- Subar AF, Dodd KW, Guenther PM, Kipnis V, Midthune D, McDowell M, Tooze JA, Freedman LS, Krebs-Smith SM. The food propensity questionnaire: concept, development, and validation for use as a covariate in a model to estimate usual food intake. Journal of the American Dietetic Association. 2006;106(10):1556–63. doi: 10.1016/j.jada.2006.07.002. [PubMed] [Cross Ref]
- Tooze JA, Grunwald GK, Jones RH. Analysis of repeated measures data clumping at zero. Statistical Methods in Medical Research. 2002;11:341–355. doi: 10.1191/0962280202sm291ra. [PubMed] [Cross Ref]
- Tooze JA, Midthune D, Dodd KW, Freedman LS, Krebs-Smith SM, Subar AF, Guenther PM, Carroll RJ, Kipnis V. A new statistical method for estimating the usual intake of episodically consumed foods with application to their distribution. Journal of the American Dietetic Association. 2006;106:1575–1587. doi: 10.1016/j.jada.2006.07.003. [PMC free article] [PubMed] [Cross Ref]
- Wand MP. Finite sample performance of deconvolving kernel density estimators. Statistics and Probability Letters. 1998;37:131–139. doi: 10.1016/S0167-7152(97)00110-7. [Cross Ref]

Articles from The International Journal of Biostatistics are provided here courtesy of **Berkeley Electronic Press**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's Canada Institute for Scientific and Technical Information in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |