PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
Circ Cardiovasc Qual Outcomes. Author manuscript; available in PMC 2011 January 1.
Published in final edited form as:
PMCID: PMC2818781
NIHMSID: NIHMS166887

Missing Data Analysis Using Multiple Imputation: Getting to the Heart of the Matter

Abstract

Missing data are a pervasive problem in health investigations. We describe some background of missing data analysis and criticize ad-hoc methods which are prone to serious problems. We then focus on multiple imputation, in which missing cases are first filled in by several sets of plausible values to create multiple completed datasets, then standard complete-data procedures are applied to each completed dataset, and finally the multiple sets of results are combined to yield a single inference. We introduce the basic concepts and general methodology, and provide some guidance for application. For illustration, we use a study assessing the effect of cardiovascular diseases on hospice discussion for late stage lung cancer patients.

Keywords: Efficiency, Likelihood-based approach, Missingness mechanism, Nonresponse bias, Weighting

1 Introduction

The empirical basis of health services and outcomes research largely rests on statistical analysis of data collected in studies. However, it is typical that not all planned observations are made. The reasons for missing data are numerous. Subjects may have missed a visit for a practical or administrative reason, or data may not have been collected on a particular time because of equipment failure. Subjects may drop out from studies because side effects associated with the treatment prohibited them from continued study participation, or they may not report their health outcome because they are too sick. Missing data can also arise from study designs. For example, different survey forms may be used in one study, and therefore some variables in one survey are collected for some but not all patients.

Missing data can be a serious impediment for data analysis. For example, Huskamp et al.1 investigated patterns of hospice discussion with providers by late stage lung cancer patients. They used data collected from a multi-site cohort study of care for patients with lung or colorectal cancer by the Cancer Care Outcomes Research and Surveillance (CanCORS) Consortium2. Astypical in large health or social studies, there exist a substantial amount of missing data in CanCORS database and the missing cases display no systematic pattern. In our illustrative example (Section 4), in which the hospice study data are used and the analytic goal is to use a logistic regression to assess the effect of the patients’ cardiovascular disease status on their tendency of hospice discussion, the fractions of missing observations range from 0.04% to 19.48%for the variables including both the outcome and predictors. Simply removing the patients with missing cases from the analysis would result in a loss of around 30% of the sample, raising serious concerns about the validity of the results.

In this paper, we review multiple imputation3 approach to missing data problems in the context of cross-sectional data analysis. Section 2 introduces some background. Section 3 reviews multiple imputation. Section 4 uses the hospice study example to illustrate the methods. Finally, Section 5 concludes with a discussion.

2 Background

2.1 Missing data mechanism

Table 1 shows a few lines of the dataset used in the hospice study example. Typically a dataset in analytic form can be characterized as a rectangular matrix (row=subjects, column=variables), and the missing data are the elements that we do not observe in this matrix, marked by “?” in Table 1.

Table 1
Missing data matrix

The missingness pattern of a dataset can be represented by a missing indicator matrix of 1’s and 0’s shaped like the data matrix, with 1’s for missing values and 0’s for observed ones. We think of missingness as consequences of random process that can be characterized by missingness models. For example, a model relating missingness of myocardial infarction to other variables in the dataset may suggest that older patients with stroke are more likely to have nonresponse. Three broad types of missingness mechanisms4, moving from the simplest to the most general, are:

  1. Missing completely at random (MCAR): A variable is MCAR if the probability of missingness is independent of any characteristics of the subjects. For example, each survey respondent decides whether to answer the “age” question by rolling a die and refusing to answer it if a “1” shows up (i.e., with a probability of 1/6). But most missingness is not completely at random. In the hospice study, for example, older patients are more likely than younger ones to have nonresponse on either income or insurance questions.
  2. Missing at random (MAR): A more general assumption, MAR, is that the probability a variable is missing depends only on observed variables. For instance, older patients might be more likely to miss “insurance” than younger patients, and then “insurance” is MAR if the study has collected information on age for all patients in the survey.
  3. Not missing at random (NMAR): Missingness is no longer “at random” if its probability depends on variables that are incomplete. A common example is that people with higher income are less likely to reveal them, that is, the nonresponse probability for the income variable depends on its values which can be missing.

Understanding of which class the missing data mechanism falls into is key to making correct statistical inferences. MAR can never be proved or falsified using data alone, as NMAR assumption asserts something is not available to the researchers from the observed data. It is possible, however, to test if data are MCAR in many situations: if meaningful differences exist between those with and without missing data for some variables, this provide evidence against MCAR.

Under MAR (including MCAR as a special case), we can ignore the missingness models and focus on the missing-data models, which describes the predictive relationship between the incomplete variables and observed ones (e.g., a model relating missing myocardial infarction to other variables in the dataset may indicate that older patients are more likely to have myocardial infarction). Under NMAR, however, missingness models generally have to be specified in order to obtain the correct inferences3,5, although using MAR models including more variables may achieve close results68.

2.2 Ad-hoc missing data methods

2.2.1 Complete-case analysis

A common missing data approach is complete-case analysis (CC), which uses only subjects who have all variables observed and is also the default option in many statistical software. When data are MCAR, CC analysis results are unbiased. When data are MAR but not MCAR, it is permissible to exclude the missing observations provided that a regression model controls for all the variables that affect the probability of missingness9. But CC analysis generally has major deficiencies5,10. The results can be biased when data are not MCAR. In addition, the reduction of statistical power by discarding cases is a major drawback. For example, suppose data are MCAR across 20 variables and the missingness fraction is 5% for each variable. Using CC analysis will lose close to two third of the subjects because the fully observed subjects only account for (1–5%)20 ≈ 36% of the original sample.

2.2.2 Ad-hoc imputation

Imputation methods fill in missing values to maintain the full sample so that standard software can be easily used to analyze completed data. In addition, the researcher using the imputed data can concentrate on substantive questions of interest rather than incomplete-data problems11.

However, many ad-hoc imputation methods (e.g., mean imputation and treating missing data as a separate category) are based on missing data models with implausible assumptions. Furthermore, these methods impute the missing data only once and then proceed to the completed data analysis. These single imputation strategies generally underestimate the standard errors of estimates because choosing a single imputation pretends that we know the unobserved value with certainty, when actually it is unknown but estimated by the imputation method.

2.3 Some principled approaches

2.3.1 Nonresponse weighting

Nonresponse weighting12 is a principled approach for making the subjects included in the analysis representative of the original sample. For example, suppose that 100% of whites and 50% of African Americans responded in a survey. If there are large differences between whites and African Americans in the variable of interest, then the sample mean from the observed cases would be biased from the average of the complete data. Assuming MCAR for African Americans, weighting their observations by 2(=1/.5), that is, each respondent represents two cases from the original sample, and then calculating the weighted average of the observed cases would obtain a more accurate estimate. In more general scenarios, the weights are the inverse of the predicted probabilities of response estimated from the missingness models of incomplete variables.

Weighting might be best suitable for unit nonresponse in a survey (i.e., cases sampled for the survey but not participating in an interview, such as noncontacts and refusers). On the other hand, by including only subjects with complete data, it ignores partial information from subjects with incomplete data and can thus lead to reduction of efficiency. In addition, weighting becomes considerably less tractable with multiple missing variables when there is no regular pattern for missing data13,14. Furthermore, since weights are estimated from the proposed models, this extra level of prediction will introduce more uncertainty to the inference. Sometimes extreme estimates of weights (if the predicted probabilities are close to 0 or 1) can lead to erratic variance estimates.

2.3.2 Likelihood-based methods

Another principled approach is to maximize the likelihood (ML) function of incomplete data, with the missing data values removed from the complete-data likelihood by a process of summation or integration (Appendix). The resulting parameter estimates are most efficient because all observed data are used. Principles and examples for applying ML to incomplete-data problems can be found in5.

In many cases, incomplete likelihood functions typically have a complicated form; special computational techniques such as EM algorithm15 may be needed to maximize them. Computational aspects of ML with missing data were reviewed by16. Typically special software needs to be developed for a particular problem, given the fact that ML is usually problem-specific. Thus, the technical difficulties involved in constructing a likelihood model and carrying out computation is less appealing for most practitioners.

3 Multiple Imputation

3.1 Concept

Under MAR, the multiple imputation3 approach seeks to retain the advantages of ML estimates while also allowing the uncertainty due to imputation, which is ignored in single imputation, to be incorporated into the completed-data analysis. It involves creating more than one set of replacements for the missing values based on plausible models for data, therefore generating multiple completed datasets for analysis (Figure 1). The statistical reasoning behind multiple imputation is that the observed-data likelihood can be approximated by the average of the completed-data likelihood over unknown missing values (Appendix). That is, multiple imputation analysis that combines the likelihood-based analysis from each completed dataset is approximately equivalent to the analysis based on the observed-data likelihood, while the imputation uncertainty is reflected by the variation across the multiple completed datasets.

Figure 1
The scheme of multiple imputation, where ? indicates missing data

3.2 Analysis procedure

The analysis of multiply imputed data proceeds as follows:

  1. Analyze each completed dataset separately using a suitable software package designed for complete data (e.g., SAS, STATA, or R).
  2. Extract the point estimate and standard error from each analysis.
  3. Combine the multiple sets of point estimates and standard errors to obtain a single point estimate, standard error, and the associated confidence interval or p-value.

The combining rules in Step 3 contain some formulas for calculating the average of the estimates across multiple imputations and the variances of the estimates, both within and between imputations (Appendix). They have been incorporated into imputation packages (Section 3.4) for automatic calculations.

3.3 Imputation models

Plausible imputation should give reasonable predictions for the missing data, and the variability among them must reflect an appropriate degree of uncertainty. Rubin3 recommends that imputations be created through Bayesian arguments: specify a parametric model for the complete data under MAR, assume a prior distribution for the unknown model parameters, and simulate multiple independent draws from the conditional distribution of missing values given observed data by Bayes’ theorem. A simple example for univariate missing outcome is given in Appendix.

Various imputation models have been developed within more general and complicated contexts. See17 for a summary and references. In general, the strategy of building imputation models falls into two categories:

  1. Joint modeling. The joint modeling approach partitions the observations into groups of identical missing data patterns and imputes the missing entries with each pattern according to a joint model for the variables that is common to all observations. Some classic examples include multivariate normal models for continuous variables, log-linear models for categorical variables, general location models for a mixture of continuous and categorical variables16, and mixed-effects models for repeated measurements or multilevel data18,19. These methods start by specifying a parametric multivariate density for the data given model parameters. Under an appropriate prior distribution for the parameters, it is possible to derive the appropriate submodel for each missing data pattern, from which imputations are drawn.
    The joint modeling approach is theoretically sound, but may lack the flexibility needed to represent complex data structures arising in many studies. For example, the CanCORS data consist of a large number of variables having a variety of distributional forms, subject to certain logical or consistency bounds imposed by study questionnaires, and displaying unsystematic missingness patterns. In such a case, the joint modeling strategy is difficult to implement because the typical specifications of multivariate distributions are not sufficiently flexible to accommodate these features.
  2. Sequential regression multiple imputation (SRMI)20,21 (also referred to as the multiple imputation by chained equations). In SRMI, multivariate data are characterized by separate conditional models for each incomplete variable. That is, the imputation model is specified separately for each variable, with other variables as predictors. At each step of the SRMI algorithm, imputations are generated for the missing values of one variable, these imputed values are then used in the imputation of the next variable, and this process repeats until it reaches convergence.

Compared to the joint modeling approach, an appealing feature of SRMI is that it is relatively easy to accommodate complex data features in univariate regression models. Constructing these regression models can follow common guidelines of regression modeling applied to the data at hand. For continuous variables, the model may involve a linear regression model or its robust extensions22. Dichotomous variables may be modeled by logistic regression, and categorical variables with more than two categories by polytomous models. Poisson models can be used for incomplete count data, and two-part models for a variable with a mixture of point mass and continuous values. Detailed information can be found in the manuals for the related software (Section 3.4).

3.4 Software

Some popular imputation software include:

  1. SAS: PROC MI uses regression methods and propensity scores for imputation. PROC MIANALYZE combines estimates output from various complete-data procedures.
  2. S-plus: The missing data library supports different models for multivariate normal (“impGauss”), categorical variables (“impLogin”), and the conditional Gaussian (“impCgm”) for imputation involving both continuous and categorical variables.
  3. R: It supports libraries such as “norm”, “cat”, “mix”, and “pan” for imputing data under multivariate normal models, loglinear models, general location models, and linear mixed models, respectively. In addition, libraries including “mi” and “Hmisc” impute data in more complex scenarios and provide tools for diagnostics.
  4. IVEware: Imputation and Variance Estimation software for SRMI, callable by SAS (http://www.isr.umich.edu/src/smp/ive).
  5. MICE: Multiple Imputation by Chained Equations, library available in both S-plus and R (http://web.inter.nl.net/users/S.van.Buuren/mi/html/mice.htm).
  6. ICE: SRMI library available in STATA.

Descriptions of other imputation software and more comprehensive reviews appear in2325.

3.5 Diagnostics

Checking of imputation models is important because it can identify model defects and facilitate model improvement. As in complete-data analysis, one possible strategy is to check regression modeling assumptions such as normality and homoscedasticity of the regression residuals on the incomplete data. Graphical diagnostics can be used 26,27(see also R library “mi”). More advanced Bayesian strategies assess the similarity between observed data and their replicates drawn from the imputation model 28. Sensitivity analysis under different imputation models is also helpful.

3.6 Practical guidance

This section summarizes some of the key steps involved in a typical multiple imputation project for practitioners.

  1. Understand the analytic objective and identify the data structure and study design.
  2. Make appropriate assumptions for missing data mechanism.
  3. Identify variables to be included in imputation. The general strategy is to include at least all variables involved in the planned analysis. For example, when imputing missing predictors, the outcome variables should be included in imputation to retain the association between the outcome and predictors. In addition, variables not used in the analysis yet having strong correlation with incomplete variables might be included.
  4. Construct the imputation model. It is important to seek a balance between sophistication and feasibility of models. For most empirical analyses, we recommend using existing models in the literature or those provided by available software.
  5. Use the appropriate imputation package for implementation.
  6. Carry out imputation diagnostics and sensitivity analysis.
  7. Post-imputation data processing. For example, imputed values might be outside the range of observed data, making rounding and truncation necessary 29,30.
  8. Combine completed-data estimates from multiple datasets and report the results.
  9. Flag the imputations in the completed data for better reference.

4 Example

4.1 Study background: CanCORS

The CanCORS consortium is funded by the National Cancer Institute and the Veteran’s Administration to examine services and outcomes of care delivered to population-based cohorts of diagnosed patients from 2003 to 2005 with lung and colorectal cancer in multiple regions of the country. It consists of seven study sites. Each site identified appropriate samples to obtain combined cohorts of approximately 5000 patients diagnosed with each cancer. CanCORS collected data from multiple sources including patient surveys and medical records. The database contains information about the care received during different stages of illness, including diagnosis, treatment, surveillance for recurrent disease, and palliation, as well as data on various clinical and patient-reported outcomes and patient preferences and behaviors.

4.2 Missing data problem in hospice study

Huskamp et al.1 examined patterns of cancer hospice care, which includes a broad array of palliative and support services for individuals with terminal illness. It identified patient characteristics and preferences that are associated with patient reports in the baseline survey and medical records that they had discussed hospice with a care provider. The outcome variable is patients’ hospice discussion, and predictors include patients’ clinical and sociodemographic characteristics. Particularly, patients’ comorbidity scale variable, which summarizes the severity of their co-existing aliments, was included as a predictor.

We use this study to construct a simplified illustrative example concerning the association between patients’ cardiovascular disease variables and hospice discussion. These predictors, including myocardial infarction, heart failure, stroke, and diabetes, were obtained from the baseline survey but were not used in the original analysis from1. The study subsample (n =2474) consists of all advanced lung cancer patients (stage IIIB or IV). Table 2 describes the variables from the analytic subsample; some of them have a substantial amount of missing data, and the missing items exhibit no systematic pattern.

Table 2
Variables for hospice care analysis

4.3 Logistic regression analysis with missing data

The substantive analysis is a logistic regression for hospice discussion and predictors include all other variables in the subsample. We carry out a multiple imputation analysis, using the SRMI strategy implemented in IVEware. In this dataset, all the variables are categorical and some of them are ordinal (e.g., income, education, and age). In IVEware we classify all of the variables involved in the imputation as categorical, and thus binary or general logit models are used to fit each conditional regression model for imputation. We choose to present the results from imputed data after running the program for 5 iterations to achieve convergence. We also apply the CC analysis and the missing data indicator method as the ad-hoc approaches for comparative purposes.

Table 3 shows the results from each method. The regression estimates from CC and SRMI are somewhat different, and the latter produces smaller standard errors than the former for all regressors, illustrating the superior efficiency in the multiple imputation analysis. At the 5% level, predictors associated with Hispanic ethnicity, divorced/separated marital status, and age 81+ group are non-significant under CC but significant under SRMI, while the predictor associated with a history of myocardial infraction (significant under CC) becomes non-significant under SRMI. In this case, CC discards close to 30% of the subjects. When the assumption of MCAR is violated, as in our example, CC removes cases in a non-random fashion and could distort the joint distribution among the variables. As a result, it could both bias point estimates and indicate standard errors, and thus misidentify significant predictors. The results from the missing data indicator method are overall similar to those from SRMI, although the former also discards around 4% of the subjects with the missing outcome variable (i.e., hospice discussion).

Table 3
Hospice care analysis results

From the substantive point of view, the multiple imputation analysis results do not appear to suggest a significant association between late stage lung cancer patients’ cardiovascular disease status and their tendency to talk about hospice. This is consistent with Huskamp et al.1 in which the original analysis did not identify patients’ comorbidity as a significant predictor of hospice discussion.

To assess the fit of the SRMI models used, we perform posterior predictive checking 28 to examine the deviation of analysis results of interest (i.e., logistic regression coefficients and their standard errors), computed from the completed data with imputations, from same quantities calculated from simulated copies of the completed data under the model. Large deviations would indicate model inadequacy for the targeting analysis. Our model assessment shows that the deviation is small (results not shown), suggesting that the SRMI models are adequate for the logistic regression analysis.

In addition, we carry out some sensitivity analysis using alternative modeling strategies. When using the SRMI, another modeling option is to treat income, education, and age as continuous to capture the underlying ordering of these variables. Their corresponding conditional regression models are thus linear normal models. After rounding the continuous imputations to the nearest allowed integer values, the logistic regression analysis results (not shown) are similar to those from the option treating all variables as categorical. We also apply the joint modeling strategy using a general location model. Specifically, we treat race, marital status, and insurance as nominal variables and assume that they follow a loglinear model with conditional independence. We treat other variables (binary or ordinal) as continuous with multivariate normal distributions conditional on the categorical variables, and round the imputations prior to the completed-data analysis. This approximation to the joint distribution 16 is implemented using the library “mix” in R. Estimates for the logistic model (not shown) are also rather similar to those obtained using the SRMI strategy. The sensitivity analysis results increase confidence on our missing data inferences.

5 Discussion

In this paper, we focus on missing data problems and multiple imputation for cross-sectional regression analysis assuming MAR. Methods have been developed for more complicated designs (e.g., longitudinal or spacial studies) or missingness mechanism (i.e., NMAR). The relevant discussion is beyond the scope of this paper, and some of the topics can be found in31,32. In addition, Many other analytic problems can be viewed and solved from the perspective of incomplete data and multiple imputation. Examples include causal inferences on potential outcomes, measurement error problems, and confidential use of public database33,34.

In our opinion, multiple imputation is a principled and practical approach to missing data problems. This approach involves an initial investment in multiply imputing the missing values. Once multiply imputed, complete-data software can then be used to repeatedly analyze the completed datasets, extract the point estimates and their standard errors, and combine them using simple rules. Though this method requires additional storage and extra steps of repeated analysis and combining estimates, in the grand theme of health services and outcomes investigators, it is a minor step, especially owing to the availability of software for creating multiple imputations and performing analysis.

The multiple imputation approach can be used for a single researcher analyzing a particular incomplete dataset for a unique goal. It also fits well for a setting involving a large dataset with multiple researchers using different portions of the dataset for various aims 35. In such a scenario, imputation by the data producer allows the incorporation of specialized knowledge about the reasons for missing data in the imputation procedure, including confidential information that cannot be released to the public or other variables in the imputation process that may not be used in substantive analysis by a particular researcher. Moreover, the nonresponse problem is solved in the same way for all users so that analyses will be consistent across users. Related examples include imputation projects for the Fatal Accident Reporting System 36, census industry and occupation codes 37, the National Health and Nutrition Examination Survey 16, the National Health Interview Survey 38, and CanCORS 39.

We encourage investigators and practitioners keep themselves updated about the current development of the multiple imputation methods and software. However, it is important that missing data be considered not solely a data analysis problem, but also a study design and implementation issue. That is, we shall strive to prevent missing data in the first place!

Supplementary Material

Supp1

Acknowledgments

Sources of Funding

The work was supported by the grant U01-CA93344 from the National Cancer Institute.

The author thanks Alan M. Zaslavsky and Sharon-Lise T. Normand for their helpful suggestions.

Footnotes

Conflict of Interest Disclosure

None

References

1. Huskamp HA, Keating NL, Malin JL, Zaslavsky AM, Weeks JC, Earle CC. Discussions with physicians about hospice among patients with metastatic lung cancer. Archives of Internal Medicine. 2009;169:954–962. [PMC free article] [PubMed]
2. Ayanian JZ, Chrischilles EA, Fletcher RH, Fouad MN, Harrington DP. Understanding cancer treatment and outcomes: the Cancer Care Outcomes Research and Surveillance Consortium. Journal of Clinical Oncology. 2003;22:2292–2296. [PubMed]
3. Rubin DB. Multiple Imputation for Nonresponse in Surveys. New York: Wiley; 1987.
4. Rubin DB. Inference and missing data (with discussion) Biometrika. 1976;63:581–592.
5. Little RJA, Rubin DB. Statistical Analysis of Missing Data. 2. New York: Wiley; 2002.
6. Rubin DB. Multiple imputation after 18+ years (with discussion) Journal of the American Statistical Association. 1996;91:473–489.
7. Collins LM, Schafer JL, Kam CM. A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods. 2001;6:330–351. [PubMed]
8. Schafer JL. Multiple imputation with multivariate problems when the imputation and analysis models differ. Statistica Neerlandica. 2003;57:19–35.
9. Little RJA. Regression with missing X’s: a review. Journal of the American Statistical Association. 1992;87:1227–1237.
10. Belin TR, Hu MY, Young AS, Grusky O. Using multiple imputation to incorporate cases with missing items in a mental health study. Health Services and Outcomes Research Methodology. 2000;1:7–22.
11. Schafer JL. Multiple imputation: a primer. Statistical Methods in Medical Research. 1999;8:3–15. [PubMed]
12. Cochran WG. Sampling Techniques. New York: Wiley; 1977.
13. Ibrahim JG, Chen MH, Lipsitz SR, Herring AH. Missing data methods for generalized linear model models: a comparative review. Journal of the American Statistical Association. 2005;100:332–346.
14. Carpenter JR, Kenward MG, Vansteelandt S. A comparison of multiple imputation and doubly robust estimation for analyses with missing data. Journal of the Royal Statistical Society, Series A (Statistics in Society) 2006;169:571–584.
15. Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm (with discussion) Journal of the Royal Statistical Society, Series B (Statistical Methodology) 1977;39:1–38.
16. Schafer JL. Analysis of Incomplete Multivariate Data. London: Chapman and Hall; 1997.
17. van Buuren S. Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research. 2007;16:219–242. [PubMed]
18. Liu M, Taylor JMG, Belin TR. Multiple imputation and posterior simulation for multivariate missing data in longitudinal studies. Biometrics. 2000;56:1157–1163. [PubMed]
19. Schafer JL, Yucel R. Computational strategies for multivariate linear-mixed models with missing values. Journal of Computational and Graphical Statistics. 2002;11:437–457.
20. van Buuren S, Boshuizen HC, Knook DL. Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in Medicine. 1999;18:681–694. [PubMed]
21. Raghunathan TE, Lepkowski JM, VanHoewyk J, Solenberger P. A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodology. 2001;27:85–95.
22. Schenker N, Taylor JMG. Partially parametric techniques for multiple imputation. Computational Statistics and Data Analysis. 1996;22:425–446.
23. Harel O, Zhou XH. Multiple imputation: review of theory, implementation, and software. Statistics in Medicine. 2007;26:3057–3077. [PubMed]
24. Horton NJ, Kleiman KP. Much ado about nothing: a comparison of missing data methods and software to fit incomplete data regression models. The American Statistician. 2007;61:79–90. [PMC free article] [PubMed]
25. Yu LM, Burton A, Riverto-Arias O. Evaluation of software for multiple imputation of semicontinuous data. Statistical Methods in Medical Research. 2007;16:243–258. [PubMed]
26. Gelman AE, Mechelen IV, Verbeke G, Heitjan DF, Meulders M. Multiple imputation for model checking: completed-data plots with missing and latent data. Biometrics. 2005;61:74–85. [PubMed]
27. Abayomi K, Gelman AE, Levy M. Diagnostics for multivariate imputations. Journal of the Royal Statistical Society: Series C (Applied Statistics) 2008;57:273–291.
28. Gelman AE, Carlin JB, Stern HS, Rubin DB. Bayesian Data Analysis. 2. London: Chapman and Hall; 2004.
29. Horton NJ, Lipsitz SR, Parzen M. A potential bias when rounding in multiple imputation. The American Statistician. 2003;57:229–232.
30. Yucel RM, He Y, Zaslavsky AM. Using calibration to improve rounding in imputation. The American Statistician. 2008;62:125–129.
31. Molenberghs G, Kenward MG. Missing Data in Clinical Studies. West Sussex: Wiley; 2007.
32. Daniels MJ, Hogan JW. Missing Data in Longitudinal Studies: Strategies for Bayesian Modeling and Sensitivity Analysis. Boca Raton: Chapman and Hall; 2008.
33. Gelman AE, Meng XL. Applied Bayesian Modeling and Causal Inference from Incomplete Data Perspective. New York: Wiley; 2004.
34. Reiter JP, Raghunathan TE. The multiple adaptations of multiple imputation. Journal of the American Statistical Association. 2007;102:1462–1471.
35. Rubin DB. Multiple imputations in sample surveys - a phenomenological Bayesian approach to nonresponse. Proceedings of the Survey Research Methods Section of the American Statistical Association. 1978;1:20–34.
36. Heitjan DF, Little RJA. Multiple imputation for the Fatal Accident Reporting System. Journal of the Royal Statistical Society: Series C (Applied Statistics) 1991;40:13–29.
37. Schenker N, Treiman DJ, Weidman L. Analyses of public use decennial census data with multiply imputed industry and occupation codes. Journal of the Royal Statistical Society: Series C (Applied Statistics) 1993;42:545–556. [PubMed]
38. Schenker N, Raghunathan TE, Chiu PL, Makuc DM, Zhang G, Cohen AJ. Multiple imputation for missing income data in the National Health Interview Survey. Journal of the American Statistical Association. 2006;101:924–933.
39. He Y, Zaslavsky AM, Harrington DP, Catalano P, Landrum MB. Multiple imputation in a large-scale complex survey: a practical guide. Statistical Methods in Medical Research. 2009 In Press. [PMC free article] [PubMed]