|Home | About | Journals | Submit | Contact Us | Français|
Missing data are a common problem in nutritional epidemiology. Little is known of the characteristics of these missing data, which makes it difficult to conduct appropriate imputation.
We telephoned, at random, 20% of subjects (n = 2091) from the Adventist Health Study–2 cohort who had any of 80 key variables missing from a dietary questionnaire. We were able to obtain responses for 92% of the missing variables.
We found a consistent excess of “zero” intakes in the filled-in data that were initially missing. However, for frequently consumed foods, most missing data were not zero, and these were usually not distinguishable from a random sample of nonzero data. Older, black, and less-well-educated subjects had more missing data. Missing data are more likely to be true zeroes in older subjects and those with more missing data. Zero imputation for missing data may create little bias except for more frequently consumed foods, in which case, zero imputation will be suboptimal if there is more than 5%–10% missing.
Although some missing data represent true zeroes, much of it does not, and data are usually not missing at random. Automatic imputation of zeroes for missing data will usually be incorrect, although there is a little bias unless the foods are frequently consumed. Certain identifiable subgroups have greater amounts of missing data, and require greater care in making imputations.
Missing data are a problem in food frequency questionnaires (FFQ), especially when the questionnaires are long. Yet the factors underlying missing dietary data, and the nature of those missing data, are poorly characterized. The methods used to deal with missing data in nutritional epidemiology are frequently not reported, but a common approach is to assume that participants who did not answer a question left it blank because they did not eat that food. Therefore a value of zero is imputed.1,2 Other approaches are to impute the means of nonmissing values,3 or to use the indicator variable method,4,5 which is unsatisfactory in multivariate analyses.6 Multiple imputation is a more attractive solution,7 particularly if steps are taken to ensure that data are approximately missing at random.8 We explore the characteristics of missing data and of subjects who omit these data.
The Adventist Health Study (AHS)–2 is a national cohort study.9 The data reported here pertain to the first 19,611 subjects enrolled, mainly from the western United States. Subjects are Seventh-day Adventists older than 29 years of age. About 50% of Adventists are vegetarian and many do not drink coffee. A long questionnaire (48 pages) was designed for this study9–11 and takes 1.5–3 hours to complete (30–60 min for the FFQ). The FFQ section (13 pages) from which items for these analyses were selected contains 130 specified foods, each item having between 7 and 9 possible frequency responses. The first response category, labeled “never or rarely” is specifically identified in methods described later. A standard portion size is given for each item and the subject chooses “standard,” “one-half or less than standard,” or “one-half or more than standard.”
We identified about 80 “key foods.” Based on previous pilot work,12 these included all sets of 4–5 foods that contributed most strongly to validity correlation coefficients for each of 18 FFQ indices of nutrients, vitamins, and minerals.
A random 20% of all subjects with 1 or more of these key items missing (the “sample missing population”) were contacted by telephone for the purpose of guiding multiple imputation.8 In this sample, missing data for particular foods are also a random 20% sample of missing data for each food. Averaging across key foods, it was possible to fill in 92% of these missing data. The average time between the original questionnaire and the follow-up phone call was about 1 year. Subjects were asked to recall their diet at the time of the original questionnaire. We assume that data reported by telephone are not systematically different from those which would have originally been reported on the questionnaire if not omitted.
Data from those successfully contacted in the 20% “sample–missing” subjects are replicated 4 times and combined with data from subjects with “initially complete” information (n = 9165) to provide what we reasonably assume are approximately unbiased estimates of completed data for the total population (“estimated complete population”).
For each food, 2 statistical tests were performed to evaluate the distribution of filled in data that were initially missing. The first analysis tests the hypothesis that this distribution does not differ from that of estimated complete data for the total population. The test actually compared 2 independent data sets for each food: the initially complete data for that food, and the initially missing for that food. The second test excluded subjects who responded never or rarely, thus evaluating whether the distributions among remaining response categories differ. The dietary variables in this table were chosen a priori to include a representative range of foods eaten less or more commonly by this population.
To evaluate whether covariates could predict the proportion of filled in initially missing data that were actually zero, we used a logistic regression with a binomial rather than Bernoulli distribution function, among only those in the sample–missing population. The ith subject contributed 1 vector observation of length Ni, of zeroes/nonzeroes indicating final disposition of each data point, where Ni is the number of initially missed key variables. The link function was logistic and the error distribution overdispersed “binomial” with a dispersion coefficient (σ2) of 1.92, which was incorporated into statistical tests.
A log-linear analysis was used to identify demographic factors mat predict the number of missing variables (Ni) among subjects in the total population as initially observed, excluding any filled in data. All 2-way interactions and 3-way interactions that do not include the variable N, and age*education*N, are included, and these provided a satisfactory model fit with deviance .
We now derive a formula to describe the effects of zero imputation of missing data. Let missing data among nonzero values of dietary variables, Xj, be completely at random, as is approximately the case in our data. Note that the pattern of missing data among zero–valued data does not affect the calculations, as they are automatically again replaced by zeroes in the imputation. Let Pnzmj be the proportion of missing data among nonzero values of dietary variable Xj, µj be the overall mean of Xj, and its variance. After imputing zeroes for missing data we relabel the Xj as X′j. It can be shown that the regression of some dependent variable Y on the sum of several X′j, (j = 1,…J), has a beta coefficient β′ that measures the slope of this regression given by
To sum dietary variables, Xj, is a common procedure when estimating intake from a food group or a particular nutrient.
A ratio of interest is β′/β, where β is the “true” coefficient obtained when Pnzmj = 0. Equation 1 assumes that missing data are assorted independently between the Xj. While this is unlikely, it is a conservative assumption because a positive correlation between missing data from different X variables will further decrease the ratio β′/β. However, in practical situations the covariance terms are generally much less likely than the terms to create bias.
Where there is only 1 X in the independent variable (the interest is say in 1 food), then . This demonstrates the importance of in relation to when predicting bias after zero imputation. A similar equation for the ratio of biased to true squared correlation coefficients (ρ2) between Y and X is .
From the total cohort of 19,611 subjects there were 2091 in the sample missing population, and we were able to contact 1928 of these (92%). The demographic characteristics of the total and sample missing populations are shown in Table 1. As expected, when those with no missing key variables in the total population are excluded (note values in parentheses), the proportions in categories of missing key variables approximately agree between the 2 populations.
Table 2 shows that true values of initially missing data for frequently eaten foods are mostly nonzero. Thus an “automatic” zero imputation here may be problematic. The third and fourth columns of Table 2 are the estimated proportions of zeroes in the initial missing and estimated complete data, respectively. For every food tested, the distribution of values in the initially missing data is always different from that for estimated complete data (see in fifth column of Table 2). When the zero intake category is excluded, however, the distribution of initially missing data among other categories is not clearly different from expected values for most foods. Nevertheless, the evidence of excess zeroes among the initially missing data is clear for every food (tested by the difference between and as a statistic).
Those participants with the largest numbers of missing items (Table 3) were more likely to be older, black, and less well educated. BMI (not shown) had only a trivial association with the number of missing items. On average, older subjects have a higher proportion of zeroes when missing values are filled in (Table 4). The greater the number of initially missing foods, the higher the proportion of these that are true zeroes, although beyond 25 initially missing key variables, the probability that a completed value is a zero declines. The statistical evidence for this nonlinearity is strong (t = 7.62).
Whether a zero imputation for missing data will create biases of practical importance depends on the variable. As is clear from equation 1 and the results described previously, if a food is eaten infrequently (or more precisely, does not have a mean intake that when squared is much greater than its variance), then a zero imputation will have relatively less influence on a regression coefficient.
The foods selected for Table 2 have an average missing frequency of 4% in the estimated complete nonzero data (Pnzm). As we have previously shown,8 when missing proportions are very small then the form of the imputation for missing data, within reason, has very little effect on any outcome. However, using the observed µ2/σ2 for these same variables but setting Pnzm = 0.10, the values of β′/β for apples, bananas, tomatoes, and broccoli (fairly commonly eaten foods) would be 0.84, 0.78, 0.81, and 0.81. Corresponding values of ρ′2/ρ2 are 0.76, 0.70, 0.73, and 0.73. These clearly are nontrivial changes resulting from zero imputation. On the other hand, coffee (an uncommonly consumed food in this population) has a small value of µ2/σ2; and even with Pnzm of 0.10, the value of β′/β = 0.985 and ρ′2/ρ2 = 0.887.
In another example, we simulated a situation in which the independent variable is the sum of 2 dietary variables (X1 and X2), and the nonzero values of these variables (before missing data are added) were generated using a lognormal distribution such that correlation (X1, X2) = 0.75. X1, with the addition of 10% true zeroes, had a mean frequency of 2.77 servings eaten per week and variance 3.14, with 99th percentile of 8.48. The second variable, with the addition of the 10% true zeroes, had mean = 7.24, variance = 15.63, and 99th percentile = 18.4. The final correlation (X1, X2) was 0.50. The second food was thus more commonly consumed and has a higher variance of intake. With 10% of the nonzero data for each of these variables independently missing, β′/β is 0.81 and ρ′2/ρ2 is 0.70.
There is an excess proportion of zeroes among missing data (Table 2). Even so averaging across all key foods, zero would be an accurate imputation only about 60% of the time. For foods eaten more frequently, a zero imputation will usually be incorrect as most initially missing values are nonzero and typically have a distribution that is similar to nonzero data that were not initially missing. Whether this will affect estimates of relative risk also depends on the proportion of data initially missing,8 how commonly this food is eaten, and the importance of the missing variable to that nutrient or food group.
Our data are consistent with those of others12,13 who note that a zero imputation will often (although not always) produce data that have a correlation with original (no missing) data exceeding 0.90. However, as we have shown, the influence on a β coefficient measuring the slope of a regression or the squared correlation coefficient of that regression may be somewhat greater. We have used a linear disease regression as the example around which to explore the effects of zero imputation. Theoretically the situation should be at least qualitatively similar for a logistic disease regression,14–16 although more work remains to be done.
Particular situations may make zero imputation an especially bad choice. Our recent need to identify vegans in this study was based on data from more than 20 variables. A zero imputation for missing data would have exaggerated the proportion of vegans, who consume no meat, dairy, or eggs. Further, if in equation 1 T (true intake of X) is substituted for Y, this regression is that related to the validity correlation for X. A further attenuation of even 10% because of zero imputation in this validity correlation coefficient would usually be considered undesirable, as it is usually already markedly attenuated by measurement error.17
Although there is a substantial literature about the characteristics of missing subjects18 in surveys, there are surprisingly few reports about the characteristics of missing data from those enrolled in nutritional epidemiologic studies. One report13 found that 76% had initially left at least 1 item blank. On follow-up (89% complete), 55% of such foods were consumed never or less than once per month—similar to our observations.
In a study from Sweden19 using a 56-item food frequency questionnaire, investigators were able to contact by telephone more than half of subjects with missing data, thus leaving smaller percentages of missing data. They also demonstrated that, for commonly consumed foods, initially missing data were infrequently true zeroes. Whether they imputed a zero or the observed median usually made little difference to estimates, though in some circumstances differences were larger.
Caan et al12 found that the probability of a perfect questionnaire response was inversely related to age, and differed by race. The correlation between nutrients calculated from questionnaires with and without missing data was usually high but, as expected, was more adversely affected when questionnaires with higher numbers of missing items were included. We previously found8 that zero imputation of missing data in a real data set often resulted in point estimates of regression coefficients that differed from the best estimator by 12%–18%.
Using guided multiple imputation8 to handle missing data will usually be a good choice if it is practical to contact a subsample to fill in missing data. This allows the missing-at-random assumption to be approximately satisfied, thus practically eliminating further biases and attenuation due to imputation.
It is clear that different subgroups of the population have different amounts of missing data. Indeed, in less–well–educated black subjects older than 80 years of age the model predicts that about 5 times as many would be placed in the “more than 20 missing” category as compared with young college-educated white subjects. Many known barriers interfere with the participation of black subjects in research.20
Missing data in older subjects are a little more likely to represent true zeroes. Perhaps their greater susceptibility to fatigue leads to skipping foods not eaten. Those with about 25 missing items had the greatest proportion that are true zeroes, perhaps suggesting that those who systematically skip items not eaten, on average do not eat about 25 of these key foods. When there are more than 25 missing items the proportion of true zeroes decreases again, suggesting that these tend to be participants who are less committed or less able to complete the questionnaire carefully.
In summary, a systematic zero imputation will produce biased results for commonly consumed foods although the magnitude of that bias will depend on the particular situation. Deleterious effects on bias and validity will often be quite small, but may be more severe. Given the challenges to validity before considering missing data, even modest additional error due to inaccurate imputation should be avoided if possible. The elderly, less–well–educated, and black subjects tend to have more missing data, and subsamples used to guide multiple imputation8 should ensure good representation from these groups.
Supported by NIH grant R01 CA094594.