Missing data is a common problem in epidemiological studies and the statistical implications of ignoring missing data are well known, including loss of statistical power and potentially biased estimates of association. The multiple imputation technique
1 is an approach whereby the investigator replaces each missing value with several plausible values sampled from a probability distribution, conducts multiple analyses for replicate datasets built from each plausible value, then combines the multiple results to account for the fact that the replacement data were imputed. Multiple imputation has been widely accepted and has been used to account for missing data in large national surveys and studies, including NHANES III,
2 National Assessment of Educational Progress,
3 Children’s Mental Health Initiative,
4 and the Framingham Heart Study;
5 however, detailed accounts of the application of multiple imputation and particularly the evaluation and validation of the methods are not often published. This paper demonstrates a practical implementation of multiple imputation and is vital for investigators of the Agricultural Health Study (AHS).
The AHS is a prospective cohort study designed to evaluate the effect of agriculturally related exposures on health outcomes. The study includes 57,310 licensed pesticide applicators from Iowa and North Carolina, as well as 32,345 spouses of licensed applicators, who are not included in this imputation. In Iowa, both private applicators, who are primarily farmers, and commercial applicators were included. In North Carolina, only private applicators were enrolled. Cancer incidence and mortality are obtained by annual linkage to state cancer and mortality registries and to the National Death Index. Exposure information is collected by questionnaire. In the Phase 1 enrollment period (1993–97), applicators provided information on the use of 50 specific pesticides through completion of two self-administered questionnaires that included information on demographics, health history, and lifetime farming and pesticide use practices.
6–8 The study was approved by the Institutional Review Boards of the National Institutes of Health (Bethesda, Maryland) and its contractors. From the enrollment data, two exposure metrics were developed; the first was lifetime days of pesticide use, calculated as the product of years of use of each specific pesticide and average number of days used per year. The second metric, intensity-weighted lifetime days of use, incorporated information about factors that might impact exposure, such as the use of personal protective equipment, whether the applicator mixed pesticides, performed equipment repair, and methods of application.
9 Five years later in Phase 2 (1999–2005), we administered a computer-assisted telephone interview questionnaire that described pesticide use since enrollment. Specifically, participants were asked about the last year that they applied pesticides, which was denoted as the Phase 2 reference year, and the type and frequency of use of specific pesticides. A total of 36,342 (63%) of the original participants completed the questionnaire; 8% had died between enrollment and the administration of Phase 2, 15% refused, and 14% could not be reached.
10 For epidemiological analyses, pesticide use information collected in Phase 2 was cumulatively added to information collected in Phase 1 for both aforementioned exposure metrics, using details of specific pesticide use.
When using pesticide exposure in an analysis, there are several ways to handle missing Phase 2 information, including omission of those subjects, simple imputation (e.g., mean value substitution), or ignoring non-response in Phase 2 and implicitly assume zero pesticide exposure after Phase 1, which would be erroneous for most participants who did not complete the Phase 2 questionnaire. To correct for this potential bias, a data-driven multiple imputation for the 20,968 applicators (37%) who did not complete the Phase 2 questionnaire was employed. This paper describes the complex, multi-step process used to impute missing information on pesticide use from Phase 2 and an evaluation of the imputation procedure based on a holdout subset of participants with complete data (i.e., individuals who completed both Phase 1 and Phase 2). We also discuss the assumptions and advantages of multiple imputations.