|Home | About | Journals | Submit | Contact Us | Français|
Missing data are a pervasive problem in health investigations. We describe some background of missing data analysis and criticize ad-hoc methods which are prone to serious problems. We then focus on multiple imputation, in which missing cases are first filled in by several sets of plausible values to create multiple completed datasets, then standard complete-data procedures are applied to each completed dataset, and finally the multiple sets of results are combined to yield a single inference. We introduce the basic concepts and general methodology, and provide some guidance for application. For illustration, we use a study assessing the effect of cardiovascular diseases on hospice discussion for late stage lung cancer patients.
The empirical basis of health services and outcomes research largely rests on statistical analysis of data collected in studies. However, it is typical that not all planned observations are made. The reasons for missing data are numerous. Subjects may have missed a visit for a practical or administrative reason, or data may not have been collected on a particular time because of equipment failure. Subjects may drop out from studies because side effects associated with the treatment prohibited them from continued study participation, or they may not report their health outcome because they are too sick. Missing data can also arise from study designs. For example, different survey forms may be used in one study, and therefore some variables in one survey are collected for some but not all patients.
Missing data can be a serious impediment for data analysis. For example, Huskamp et al.1 investigated patterns of hospice discussion with providers by late stage lung cancer patients. They used data collected from a multi-site cohort study of care for patients with lung or colorectal cancer by the Cancer Care Outcomes Research and Surveillance (CanCORS) Consortium2. Astypical in large health or social studies, there exist a substantial amount of missing data in CanCORS database and the missing cases display no systematic pattern. In our illustrative example (Section 4), in which the hospice study data are used and the analytic goal is to use a logistic regression to assess the effect of the patients’ cardiovascular disease status on their tendency of hospice discussion, the fractions of missing observations range from 0.04% to 19.48%for the variables including both the outcome and predictors. Simply removing the patients with missing cases from the analysis would result in a loss of around 30% of the sample, raising serious concerns about the validity of the results.
In this paper, we review multiple imputation3 approach to missing data problems in the context of cross-sectional data analysis. Section 2 introduces some background. Section 3 reviews multiple imputation. Section 4 uses the hospice study example to illustrate the methods. Finally, Section 5 concludes with a discussion.
Table 1 shows a few lines of the dataset used in the hospice study example. Typically a dataset in analytic form can be characterized as a rectangular matrix (row=subjects, column=variables), and the missing data are the elements that we do not observe in this matrix, marked by “?” in Table 1.
The missingness pattern of a dataset can be represented by a missing indicator matrix of 1’s and 0’s shaped like the data matrix, with 1’s for missing values and 0’s for observed ones. We think of missingness as consequences of random process that can be characterized by missingness models. For example, a model relating missingness of myocardial infarction to other variables in the dataset may suggest that older patients with stroke are more likely to have nonresponse. Three broad types of missingness mechanisms4, moving from the simplest to the most general, are:
Understanding of which class the missing data mechanism falls into is key to making correct statistical inferences. MAR can never be proved or falsified using data alone, as NMAR assumption asserts something is not available to the researchers from the observed data. It is possible, however, to test if data are MCAR in many situations: if meaningful differences exist between those with and without missing data for some variables, this provide evidence against MCAR.
Under MAR (including MCAR as a special case), we can ignore the missingness models and focus on the missing-data models, which describes the predictive relationship between the incomplete variables and observed ones (e.g., a model relating missing myocardial infarction to other variables in the dataset may indicate that older patients are more likely to have myocardial infarction). Under NMAR, however, missingness models generally have to be specified in order to obtain the correct inferences3,5, although using MAR models including more variables may achieve close results6–8.
A common missing data approach is complete-case analysis (CC), which uses only subjects who have all variables observed and is also the default option in many statistical software. When data are MCAR, CC analysis results are unbiased. When data are MAR but not MCAR, it is permissible to exclude the missing observations provided that a regression model controls for all the variables that affect the probability of missingness9. But CC analysis generally has major deficiencies5,10. The results can be biased when data are not MCAR. In addition, the reduction of statistical power by discarding cases is a major drawback. For example, suppose data are MCAR across 20 variables and the missingness fraction is 5% for each variable. Using CC analysis will lose close to two third of the subjects because the fully observed subjects only account for (1–5%)20 ≈ 36% of the original sample.
Imputation methods fill in missing values to maintain the full sample so that standard software can be easily used to analyze completed data. In addition, the researcher using the imputed data can concentrate on substantive questions of interest rather than incomplete-data problems11.
However, many ad-hoc imputation methods (e.g., mean imputation and treating missing data as a separate category) are based on missing data models with implausible assumptions. Furthermore, these methods impute the missing data only once and then proceed to the completed data analysis. These single imputation strategies generally underestimate the standard errors of estimates because choosing a single imputation pretends that we know the unobserved value with certainty, when actually it is unknown but estimated by the imputation method.
Nonresponse weighting12 is a principled approach for making the subjects included in the analysis representative of the original sample. For example, suppose that 100% of whites and 50% of African Americans responded in a survey. If there are large differences between whites and African Americans in the variable of interest, then the sample mean from the observed cases would be biased from the average of the complete data. Assuming MCAR for African Americans, weighting their observations by 2(=1/.5), that is, each respondent represents two cases from the original sample, and then calculating the weighted average of the observed cases would obtain a more accurate estimate. In more general scenarios, the weights are the inverse of the predicted probabilities of response estimated from the missingness models of incomplete variables.
Weighting might be best suitable for unit nonresponse in a survey (i.e., cases sampled for the survey but not participating in an interview, such as noncontacts and refusers). On the other hand, by including only subjects with complete data, it ignores partial information from subjects with incomplete data and can thus lead to reduction of efficiency. In addition, weighting becomes considerably less tractable with multiple missing variables when there is no regular pattern for missing data13,14. Furthermore, since weights are estimated from the proposed models, this extra level of prediction will introduce more uncertainty to the inference. Sometimes extreme estimates of weights (if the predicted probabilities are close to 0 or 1) can lead to erratic variance estimates.
Another principled approach is to maximize the likelihood (ML) function of incomplete data, with the missing data values removed from the complete-data likelihood by a process of summation or integration (Appendix). The resulting parameter estimates are most efficient because all observed data are used. Principles and examples for applying ML to incomplete-data problems can be found in5.
In many cases, incomplete likelihood functions typically have a complicated form; special computational techniques such as EM algorithm15 may be needed to maximize them. Computational aspects of ML with missing data were reviewed by16. Typically special software needs to be developed for a particular problem, given the fact that ML is usually problem-specific. Thus, the technical difficulties involved in constructing a likelihood model and carrying out computation is less appealing for most practitioners.
Under MAR, the multiple imputation3 approach seeks to retain the advantages of ML estimates while also allowing the uncertainty due to imputation, which is ignored in single imputation, to be incorporated into the completed-data analysis. It involves creating more than one set of replacements for the missing values based on plausible models for data, therefore generating multiple completed datasets for analysis (Figure 1). The statistical reasoning behind multiple imputation is that the observed-data likelihood can be approximated by the average of the completed-data likelihood over unknown missing values (Appendix). That is, multiple imputation analysis that combines the likelihood-based analysis from each completed dataset is approximately equivalent to the analysis based on the observed-data likelihood, while the imputation uncertainty is reflected by the variation across the multiple completed datasets.
The analysis of multiply imputed data proceeds as follows:
The combining rules in Step 3 contain some formulas for calculating the average of the estimates across multiple imputations and the variances of the estimates, both within and between imputations (Appendix). They have been incorporated into imputation packages (Section 3.4) for automatic calculations.
Plausible imputation should give reasonable predictions for the missing data, and the variability among them must reflect an appropriate degree of uncertainty. Rubin3 recommends that imputations be created through Bayesian arguments: specify a parametric model for the complete data under MAR, assume a prior distribution for the unknown model parameters, and simulate multiple independent draws from the conditional distribution of missing values given observed data by Bayes’ theorem. A simple example for univariate missing outcome is given in Appendix.
Various imputation models have been developed within more general and complicated contexts. See17 for a summary and references. In general, the strategy of building imputation models falls into two categories:
Compared to the joint modeling approach, an appealing feature of SRMI is that it is relatively easy to accommodate complex data features in univariate regression models. Constructing these regression models can follow common guidelines of regression modeling applied to the data at hand. For continuous variables, the model may involve a linear regression model or its robust extensions22. Dichotomous variables may be modeled by logistic regression, and categorical variables with more than two categories by polytomous models. Poisson models can be used for incomplete count data, and two-part models for a variable with a mixture of point mass and continuous values. Detailed information can be found in the manuals for the related software (Section 3.4).
Some popular imputation software include:
Checking of imputation models is important because it can identify model defects and facilitate model improvement. As in complete-data analysis, one possible strategy is to check regression modeling assumptions such as normality and homoscedasticity of the regression residuals on the incomplete data. Graphical diagnostics can be used 26,27(see also R library “mi”). More advanced Bayesian strategies assess the similarity between observed data and their replicates drawn from the imputation model 28. Sensitivity analysis under different imputation models is also helpful.
This section summarizes some of the key steps involved in a typical multiple imputation project for practitioners.
The CanCORS consortium is funded by the National Cancer Institute and the Veteran’s Administration to examine services and outcomes of care delivered to population-based cohorts of diagnosed patients from 2003 to 2005 with lung and colorectal cancer in multiple regions of the country. It consists of seven study sites. Each site identified appropriate samples to obtain combined cohorts of approximately 5000 patients diagnosed with each cancer. CanCORS collected data from multiple sources including patient surveys and medical records. The database contains information about the care received during different stages of illness, including diagnosis, treatment, surveillance for recurrent disease, and palliation, as well as data on various clinical and patient-reported outcomes and patient preferences and behaviors.
Huskamp et al.1 examined patterns of cancer hospice care, which includes a broad array of palliative and support services for individuals with terminal illness. It identified patient characteristics and preferences that are associated with patient reports in the baseline survey and medical records that they had discussed hospice with a care provider. The outcome variable is patients’ hospice discussion, and predictors include patients’ clinical and sociodemographic characteristics. Particularly, patients’ comorbidity scale variable, which summarizes the severity of their co-existing aliments, was included as a predictor.
We use this study to construct a simplified illustrative example concerning the association between patients’ cardiovascular disease variables and hospice discussion. These predictors, including myocardial infarction, heart failure, stroke, and diabetes, were obtained from the baseline survey but were not used in the original analysis from1. The study subsample (n =2474) consists of all advanced lung cancer patients (stage IIIB or IV). Table 2 describes the variables from the analytic subsample; some of them have a substantial amount of missing data, and the missing items exhibit no systematic pattern.
The substantive analysis is a logistic regression for hospice discussion and predictors include all other variables in the subsample. We carry out a multiple imputation analysis, using the SRMI strategy implemented in IVEware. In this dataset, all the variables are categorical and some of them are ordinal (e.g., income, education, and age). In IVEware we classify all of the variables involved in the imputation as categorical, and thus binary or general logit models are used to fit each conditional regression model for imputation. We choose to present the results from imputed data after running the program for 5 iterations to achieve convergence. We also apply the CC analysis and the missing data indicator method as the ad-hoc approaches for comparative purposes.
Table 3 shows the results from each method. The regression estimates from CC and SRMI are somewhat different, and the latter produces smaller standard errors than the former for all regressors, illustrating the superior efficiency in the multiple imputation analysis. At the 5% level, predictors associated with Hispanic ethnicity, divorced/separated marital status, and age 81+ group are non-significant under CC but significant under SRMI, while the predictor associated with a history of myocardial infraction (significant under CC) becomes non-significant under SRMI. In this case, CC discards close to 30% of the subjects. When the assumption of MCAR is violated, as in our example, CC removes cases in a non-random fashion and could distort the joint distribution among the variables. As a result, it could both bias point estimates and indicate standard errors, and thus misidentify significant predictors. The results from the missing data indicator method are overall similar to those from SRMI, although the former also discards around 4% of the subjects with the missing outcome variable (i.e., hospice discussion).
From the substantive point of view, the multiple imputation analysis results do not appear to suggest a significant association between late stage lung cancer patients’ cardiovascular disease status and their tendency to talk about hospice. This is consistent with Huskamp et al.1 in which the original analysis did not identify patients’ comorbidity as a significant predictor of hospice discussion.
To assess the fit of the SRMI models used, we perform posterior predictive checking 28 to examine the deviation of analysis results of interest (i.e., logistic regression coefficients and their standard errors), computed from the completed data with imputations, from same quantities calculated from simulated copies of the completed data under the model. Large deviations would indicate model inadequacy for the targeting analysis. Our model assessment shows that the deviation is small (results not shown), suggesting that the SRMI models are adequate for the logistic regression analysis.
In addition, we carry out some sensitivity analysis using alternative modeling strategies. When using the SRMI, another modeling option is to treat income, education, and age as continuous to capture the underlying ordering of these variables. Their corresponding conditional regression models are thus linear normal models. After rounding the continuous imputations to the nearest allowed integer values, the logistic regression analysis results (not shown) are similar to those from the option treating all variables as categorical. We also apply the joint modeling strategy using a general location model. Specifically, we treat race, marital status, and insurance as nominal variables and assume that they follow a loglinear model with conditional independence. We treat other variables (binary or ordinal) as continuous with multivariate normal distributions conditional on the categorical variables, and round the imputations prior to the completed-data analysis. This approximation to the joint distribution 16 is implemented using the library “mix” in R. Estimates for the logistic model (not shown) are also rather similar to those obtained using the SRMI strategy. The sensitivity analysis results increase confidence on our missing data inferences.
In this paper, we focus on missing data problems and multiple imputation for cross-sectional regression analysis assuming MAR. Methods have been developed for more complicated designs (e.g., longitudinal or spacial studies) or missingness mechanism (i.e., NMAR). The relevant discussion is beyond the scope of this paper, and some of the topics can be found in31,32. In addition, Many other analytic problems can be viewed and solved from the perspective of incomplete data and multiple imputation. Examples include causal inferences on potential outcomes, measurement error problems, and confidential use of public database33,34.
In our opinion, multiple imputation is a principled and practical approach to missing data problems. This approach involves an initial investment in multiply imputing the missing values. Once multiply imputed, complete-data software can then be used to repeatedly analyze the completed datasets, extract the point estimates and their standard errors, and combine them using simple rules. Though this method requires additional storage and extra steps of repeated analysis and combining estimates, in the grand theme of health services and outcomes investigators, it is a minor step, especially owing to the availability of software for creating multiple imputations and performing analysis.
The multiple imputation approach can be used for a single researcher analyzing a particular incomplete dataset for a unique goal. It also fits well for a setting involving a large dataset with multiple researchers using different portions of the dataset for various aims 35. In such a scenario, imputation by the data producer allows the incorporation of specialized knowledge about the reasons for missing data in the imputation procedure, including confidential information that cannot be released to the public or other variables in the imputation process that may not be used in substantive analysis by a particular researcher. Moreover, the nonresponse problem is solved in the same way for all users so that analyses will be consistent across users. Related examples include imputation projects for the Fatal Accident Reporting System 36, census industry and occupation codes 37, the National Health and Nutrition Examination Survey 16, the National Health Interview Survey 38, and CanCORS 39.
We encourage investigators and practitioners keep themselves updated about the current development of the multiple imputation methods and software. However, it is important that missing data be considered not solely a data analysis problem, but also a study design and implementation issue. That is, we shall strive to prevent missing data in the first place!
Sources of Funding
The work was supported by the grant U01-CA93344 from the National Cancer Institute.
The author thanks Alan M. Zaslavsky and Sharon-Lise T. Normand for their helpful suggestions.
Conflict of Interest Disclosure