The Adventist Health Study (AHS)–2 is a national cohort study.9
The data reported here pertain to the first 19,611 subjects enrolled, mainly from the western United States. Subjects are Seventh-day Adventists older than 29 years of age. About 50% of Adventists are vegetarian and many do not drink coffee. A long questionnaire (48 pages) was designed for this study9–11
and takes 1.5–3 hours to complete (30–60 min for the FFQ). The FFQ section (13 pages) from which items for these analyses were selected contains 130 specified foods, each item having between 7 and 9 possible frequency responses. The first response category, labeled “never or rarely” is specifically identified in methods described later. A standard portion size is given for each item and the subject chooses “standard,” “one-half or less than standard,” or “one-half or more than standard.”
We identified about 80 “key foods.” Based on previous pilot work,12
these included all sets of 4–5 foods that contributed most strongly to validity correlation coefficients for each of 18 FFQ indices of nutrients, vitamins, and minerals.
A random 20% of all subjects with 1 or more of these key items missing (the “sample missing population”) were contacted by telephone for the purpose of guiding multiple imputation.8
In this sample, missing data for particular foods are also a random 20% sample of missing data for each food. Averaging across key foods, it was possible to fill in 92% of these missing data. The average time between the original questionnaire and the follow-up phone call was about 1 year. Subjects were asked to recall their diet at the time of the original questionnaire. We assume that data reported by telephone are not systematically different from those which would have originally been reported on the questionnaire if not omitted.
Data from those successfully contacted in the 20% “sample–missing” subjects are replicated 4 times and combined with data from subjects with “initially complete” information (n = 9165) to provide what we reasonably assume are approximately unbiased estimates of completed data for the total population (“estimated complete population”).
For each food, 2 statistical tests were performed to evaluate the distribution of filled in data that were initially missing. The first analysis tests the hypothesis that this distribution does not differ from that of estimated complete data for the total population. The test actually compared 2 independent data sets for each food: the initially complete data for that food, and the initially missing for that food. The second test excluded subjects who responded never or rarely, thus evaluating whether the distributions among remaining response categories differ. The dietary variables in this table were chosen a priori to include a representative range of foods eaten less or more commonly by this population.
To evaluate whether covariates could predict the proportion of filled in initially missing data that were actually zero, we used a logistic regression with a binomial rather than Bernoulli distribution function, among only those in the sample–missing population. The ith subject contributed 1 vector observation of length Ni, of zeroes/nonzeroes indicating final disposition of each data point, where Ni is the number of initially missed key variables. The link function was logistic and the error distribution overdispersed “binomial” with a dispersion coefficient (σ2) of 1.92, which was incorporated into statistical tests.
A log-linear analysis was used to identify demographic factors mat predict the number of missing variables (Ni
) among subjects in the total population as initially observed, excluding any filled in data. All 2-way interactions and 3-way interactions that do not include the variable N
, and age*education*N
, are included, and these provided a satisfactory model fit with deviance
We now derive a formula to describe the effects of zero imputation of missing data. Let missing data among nonzero values of dietary variables, Xj
, be completely at random, as is approximately the case in our data. Note that the pattern of missing data among zero–valued data does not affect the calculations, as they are automatically again replaced by zeroes in the imputation. Let Pnzmj
be the proportion of missing data among nonzero values of dietary variable Xj
be the overall mean of Xj
its variance. After imputing zeroes for missing data we relabel the Xj
. It can be shown that the regression of some dependent variable Y
on the sum of several X′j
), has a beta coefficient β′ that measures the slope of this regression given by
To sum dietary variables, Xj, is a common procedure when estimating intake from a food group or a particular nutrient.
A ratio of interest is β′/β, where β is the “true” coefficient obtained when Pnzmj
= 0. Equation 1
assumes that missing data are assorted independently between the Xj
. While this is unlikely, it is a conservative assumption because a positive correlation between missing data from different X
variables will further decrease the ratio β′/β. However, in practical situations the covariance terms are generally much less likely than the
terms to create bias.
Where there is only 1 X
in the independent variable (the interest is say in 1 food), then
. This demonstrates the importance of
in relation to
when predicting bias after zero imputation. A similar equation for the ratio of biased to true squared correlation coefficients (ρ2
) between Y