Multiple imputation has emerged as an appropriate and flexible way of handling missing data. Complete-case methods, which simply discard observations with any missing data, generally make the usually unrealistic assumption that the data are MCAR, or at least MAR within categories defined by the variables included in the analysis model (16
). Some researchers avoid imputation approaches because of fears of “making up data.” In fact, complete-case analyses require stronger assumptions than does imputation.
Multiple imputation methods work by imputing (or filling in) the missing values with reasonable predictions multiple times. This step creates a set of “complete” data sets with no missing values. The analysis is then run separately on each data set, and the results are combined across data sets by using the multiple imputation combining rules (5
). The resulting estimates account for both within- and between-imputation uncertainty, reflecting the fact that the imputed values are not the known true values. Doing so results in correct standard error estimates and coverage rates, as compared with single imputation methods or simply including a missing data indicator for each variable in the model (17
The original approaches to creating multiple imputations generally assumed a large, joint model for all of the variables, for example, multivariate normality (6
). More recently, a more flexible method called multiple imputation by chained equations (MICE) has been developed (19
). MICE cycles through the variables, modeling each conditional on the others. The imputations themselves are predicted values from these regression models, with the appropriate random error included. The procedure is as follows: first, the variable with the least missingness (variable 1) is imputed conditional on all variables with no missingness. The variable with the second least missingness is then imputed conditional on the variables with no missing values and variable 1, and so on. After all of the variables have been cycled through in this way (one “iteration”), there are no longer any missing values in the data. This process is then repeated using this data set with no missing values.
Raghunathan et al. (20
) recommend 10 iterations for each imputation. The idea is that, at the end of 10 iterations, the imputations should have stabilized such that the order in which variables were imputed no longer matters. The imputed values at the end of the 10th iteration, combined with the observed data, constitute one imputed data set. This entire process is then repeated to create multiple imputed data sets, such that, to create 10 complete data sets, a total of 10 × 10 iterations are performed.
A strength of MICE is that each variable can be modeled by using a model tailored to its distribution, such as Poisson, logistic, or Gaussian. MICE can also incorporate additional data challenges, such as bounds or variables defined for only a subset of the sample. Generally, 5–10 imputations are created, resulting in 5–10 “complete” data sets, although recent work has indicated that more imputations may be beneficial (21
). For the CMHI, we created 10 imputed data sets. We implemented MICE by using the IVEWare package for SAS software (20
); IVEWare is also available as a stand-alone package, and MICE packages also exist for Stata (22
) and S-Plus (23
Although MICE is very useful in practice, it does lack the theoretical justification of some other imputation approaches. In particular, a drawback of MICE is that fitting the series of conditional models does not necessarily imply a proper joint distribution, which could lead to inconsistencies across models, where, for example, the model for variable 2 given variable 1 may not be consistent with the model for variable 1 given variable 2. Initial research has indicated that this drawback is not generally an issue in applied problems (3
), but this is an area of ongoing statistical research. Another drawback, discussed further below, is the need to include many interactions to preserve associations in the data; for example, to preserve all 3-way interactions, all of the 2-way interactions must be included in all regression models, which is often not feasible.
Complications in implementing MICE
A number of complications are encountered when actually implementing MICE, particularly with large data sets. These complications include model selection and computing limitations. Ideally, the model for each variable to be imputed should fit the data well and be as general as possible, in the sense of including as many predictors and interactions as possible, as discussed above. In practice, this step is sometimes difficult to accomplish.
One strategy is to use stepwise selection to choose the model for each variable at each iteration (20
). This process will include in the regression models those variables most predictive of the variable being imputed, for example, including a certain number of variables (those most predictive) or those leading to some minimum additional R
-squared value (20
). The exact model for each variable may change across iterations but should stabilize as the imputations themselves stabilize.
An important consideration in implementing MICE procedures is that the model used to create the imputations should be more general than the analysis model in terms of including all interactions that will be examined in the analyses (1
). This step will prevent the analyses from missing associations that actually exist. For example, if there is particular interest in the relation between gender and internalizing symptoms, then that relation should be included in the imputation model. In contrast, if the variables are assumed to be independent (i.e., gender is not used to impute internalizing symptoms), then the analysis may find a lack of relation simply because the imputations were generated by assuming there was none. This is not a large issue for bivariate associations because variables that have associations in the observed data will be selected by the stepwise selection procedures. However, for crucial 3-way interactions (e.g., between race, gender, and internalizing symptoms in the CMHI study), it is important to include the three 2-way interactions between those 3 variables as possible predictors.
Because of computational limitations, it was not possible to include a large number of interactions in the CMHI imputation models. There is particular interest in disparities in care and in interactions between race and gender with mental health needs and services. We thus included interactions between race, age, sex, income, and referral source as potential predictors. The stepwise selection models used a relatively liberal inclusion criterion of incorporating variables or interactions that added at least 0.01 to the R-squared value, resulting in approximately 6–10 variables in each regression model. Because interactions are included as if they are separate variables (i.e., IVEWare does not know that they are interactions), the models are not necessarily hierarchical in that the individual variables may not be selected even if their interaction is.
An additional issue with the CMHI data was how to handle sites. At one extreme, we could perform the imputations by completely ignoring the sites. Doing so assumes that the associations between variables are the same across sites. The other extreme would be to impute separately for each site, assuming completely different models within each site. We chose a middle route, recently recommended by Graham (1
). This method treated the site indicators just as any other variable in that each site indicator could be selected by a stepwise model, if it was an important predictor of the variable under consideration. The imputation process did not otherwise account for the clustering within sites; however, any analysis using the multiply imputed data can (and should) account for that clustering.
Analyzing multiply imputed data
After the imputations are created, users have a set of complete data sets. To analyze the resulting data, the analysis is run separately within each complete data set, then the results are combined by using the multiple imputation combining rules (5
). A strength of multiple imputation is that researchers can run any model in the complete data sets, for example, a hierarchical linear model to account for clustering within sites in the CMHI data (20
). The overall estimate is the average of the estimates from each of the complete data sets. The variance of that overall estimate is a function of the variance within each complete data set and the variance across the data sets. Many multiple imputation analysis functions available in common statistical software packages perform this combining for the user, making the analysis not much more difficult than analyses on a single, complete data set (11