Clear and complete documentation of the imputation procedures, including data processing and model building, is crucial in preparation of large databases. It helps analysts to understand the mechanisms behind the imputation and conduct with their analysis, allows the imputer to modify or update the imputation program when necessary, and maintains continuity over the lifetime of a complex project.
4.1 Initial data processing
The released database had to be deidentified before being accessed and used by the analysts, as in many other studies. Prior to imputation, all potential identifiers, such as the contact information of patients, surrogates, and care providers, were removed from the datasets. The subject’s birth date was replaced by age groups. Dates of post-diagnosis medical procedures were transformed to the relative time (in days) to the patient’s diagnosis date and the latter was removed from the datasets as well. Furthermore, prior to model-based imputation, we deterministically imputed values known from their logical relationship with other responses.
4.2 Strategies for block missing and skipped items
Although the survey variables are organized into topic-related sections, imputing sections (or other divisions) of the survey separately may ignore important associations among variables, as the best predictors for a variable can be in both the same and different sections. Imputing the survey as a whole also makes the MAR assumption more plausible. In addition, although the study used several different survey forms, there are a large number of overlapping items and hence it would be inefficient to impute them separately. We concatenated all of the surveys to create a combined rectangular dataset and imputed all included items simultaneously.
Based on input from clinical investigators, concatenation and imputation were carried out separately for patients with lung and colorectal cancer to avoid complex interactions, since the two groups have different disease etiology and care patterns. We did not further stratify the imputation because interactions with other factors did not appear to be equally important.
The concatenated datasets contain 5149 subjects with 793 variables for lung cancer and 4921 subjects with 798 variables for colorectal cancer. The concatenation procedure causes block missingness in the combined dataset for the variables that are not used in all survey forms. Some of these missing data may not be used in any meaningful analysis, such as followup survey items for patients who died before attempted contact. Others may be of potential use, depending on the context of the analysis. For example, questions about patients’ incomes were omitted from the brief survey, but the missing values are meaningful for the brief survey participants and could be imputed and used if the sample for an analysis involving the income variables includes that group. We first imputed all block missing values, and after imputation, we restored missingness for the block missing data that are not eligible for analyses or not used by investigators. The validity of such massive imputations can be checked by comparison to nonresponse weighting
26 (Section 5.3).
Values for which item nonresponse is due to skip patterns in surveys would be excluded from most analyses, since skip patterns are designed to avoid collecting information that is not meaningful, such as the severity of radiation therapy side effects for a patient who did not have radiation therapy. In contrast, values for which item nonresponse due to drop-out or is coded as “don’t know” or “refused” in general could be included in analyses and hence should be imputed.
Imputation in the presence of skip patterns presents particular challenges due to the fact that although the skipped item is MAR (missingness always depends on the observed skip response), skipping is deterministic and therefore the imputation model for the skipped items is inestimable without further assumptions. Such assumptions should be based on a scientifically justifiable understanding of the relationship between the screener item and the skipped item. For example, considering the following five questions: (i) have you had surgery for your cancer? (ii) how many surgeries have you had for your cancer? (iii) how involved were you in the decision about surgery? (iv) have you had any problems with your self-care? (v) how satisfied are you with your communication with your doctor? Questions (ii) and (iii) are skipped for patients who have had no surgery yet, i.e., answering NO on (i). For these patients, we deterministically impute 0 for question (ii) before the other imputation calculations. In effect we regarded items (i) and (ii) as together eliciting a response for the number of surgeries received. If we treated these skips as missing data and imputed positive values, ignoring the relationship with (i), it would bias imputed responses to question (iv) for the skipped patients if patients receiving more surgery have more problems with self-care. On the other hand, relationships of patient characteristics to satisfaction with doctor communication might be mediated in part through involvement in decisions about surgery. If we remove this mediated effect by deterministically imputing a single value to item (iii) for patients having no surgery, we would not accurately reproduce the marginal relationship underlying prediction of missing values of item (v) from patient characteristics. While we could posit that this relationship is different for those who did not have surgery, it would be impractical with limited data and analytic resources to investigate all such interactions. Instead, we treat responses to (iii) that are missing due to skips as missing data and impute them. After imputation, these imputed values are restored to skips in the analytic datasets. While a few imputations were deterministic as for item (ii), most of the skipped items were handled similarly to item (iii).
4.3 Specifying imputation models
The flexibility of SRMI allows imputing variables of various distributional types. We classified each variable as “categorical”, “continuous”, “mixed”, or “transferred”, following IVEware’s syntax rules; examples of these are shown in . The variables involved in the imputation modeling (i.e., those which are imputed and/or act as predictors) belong to the first three categories (Section 3.2), while the ones excluded from the imputation process are “transferred”. “Categorical” variables are nominal variables whose response levels do not have an obvious ordering. “Continuous” variables include both truly continuous variables (in the original or transformed scale) and ordinal ones. We treated the latter as “continuous” because IVEware has no option for directly modeling ordinal variables. Treating them as “categorical” would lose the ordering of responses, and it can be difficult to fit the general logit model for ordinal variables with more than a few response categories. The distribution of a “mixed” (semi-continuous) variable consists of a point mass at zero and a continuous positive part. For “continuous” variables, we forced the draws of imputations to fall within the ranges shown from the observed data, using the “bounds” option in IVEware. For ordinal variables, we rounded the fractional imputed numbers to the nearest integer after imputation to make them consistent with the original data formats.
| Table 2Examples of variable types in IVEware |
Due to the large number of variables included in the datasets, it is virtually impossible to include all of them in each prediction equation of SRMI. IVEware has options for automatic model selection. First, users can set thresholds for the minimum marginal R2 increment in the stepwise selection, meaning that a variable will only be selected if the increase of R (ΔR2) is greater than the threshold. With a smaller ΔR2 criterion, more predictors will be selected. But the models will be more complex, producing less stable fits and slowing the imputation computations.
The IVEware documentation does not provide guidance for setting the Δ
R2 criterion. We tested different values (i.e., Δ
R2=0.1, 0.01, and 0.001) and compared the corresponding model selection and fitting results. In our data, setting Δ
R2 = 0.1 tended to select a very small number (1 or 2) of predictors for most of the missing outcomes, and therefore may underidentify important predictors in the imputation. On the other hand, setting Δ
R2 = 0.001 selected many more predictors, but many of the regression coefficient estimates were rather extreme (e.g., violating the rule of thumb that logistic regression coefficients should be within the interval (−5, 5)
27 with large standard errors and unstable across iterations. Some of the large logistic coefficients indicated likely deterministic relationships between variables, such as speaking Mandarin at home and not of Latino and Hispanic origin, and thus helped us to identify items that should be imputed deterministically. Apart from these cases, large logistic coefficients could be due to overfitting models to small samples. Model selection with Δ
R2 = 0.01 appeared to yield more reasonable results than either of the above alternatives and hence was used in our implementation. In addition, we limited the maximum number of predictors for the variables with small number of observed cases to prevent overfitting models for these variables, using the “maxpred” option, which we set equal to 1 for variables with less than 50 observed values.
We examined the selected predictors for missing outcomes and the associated regression coefficients at different iterations (i.e., 3, 5, 10, 15, and 20) of the Gibbs-like chain and found most of them remain similar after running the program for several (3 to 5) iterations. We also applied some simple analyses using multiply imputed datasets obtained from those iterations, and the results were close to each other. In addition, we applied SRMI to only those variables that were involved in these simple analyses and collected imputations after a long run of the Gibbs-like chain (e.g., several thousand iterations). The corresponding analysis results were similar to those obtained from the imputations of the whole dataset after a few iterations. These provided some evidence for the plausibility of the imputed values of the latter. The multiple imputations released to the consortium investigators were obtained from running the program with different initial seeds for 5 iterations.
4.4 Simple diagnostics
There are few diagnostic tools for imputation modeling, especially for large-scale datasets. However, we performed some simple ad-hoc diagnostics to obtain an overall summary of the performance of multiple imputation. An example of model assessment targeted to specific analyses appears in Section 5.2. We estimated the marginal means and pairwise correlation coefficients of all continuous and binary variables after applying SRMI, and compared them with those obtained using AC. In general, the differences of point estimates can go in either direction. On the other hand, standard errors for the means from multiple imputation are not expected to be significantly larger than those from AC, while multiple-imputation standard errors for the correlation coefficients are most likely to be smaller than those from AC because using AC removes cases with either variable missing. summarizes the distribution of the standard error differences across all means and correlations for the two cancer types. For means, SRMI reduced the standard error for about 50% of the variables compared to AC. But for correlations, SRMI reduced the standard error for about 75% of pairs of the variables compared to AC. The summary results overall met our expectation.
| Table 3Relative differences (%) in standard error estimates |
The comparison also helped us to identify a handful of variables for which using SRMI led to substantial increases in variance estimates of either the mean or correlation. Further examination showed that they generally fall into two classes.
- Categorical variables with a very unbalanced distribution of classes (e.g., a 1/0 binary variable of which over 95% of the cases are 0). For these variables, the data are often too sparse to support conventional maximum likelihood estimation of the logistic regression model used in IVEware. As a result, the regression estimates are often unstable, introducing large between-imputation variability, and hence greatly increase the total variance. A solution might be to use a fully Bayesian logistic imputation approach which uses an informative prior distribution for the regression parameters and draws from the exact posterior distribution.27,28
- Variables that need to be massively imputed because they are not used in all survey forms but need to be used in analyses involving data from multiple surveys. These imputations tend to be more sensitive to model misspecifications.
There appears to have no easy one-step solution for improving the imputations for all these variables following the generic SRMI scheme. Instead, our strategy is to document these “troubling” variables, and to handle them in an analysis-specific manner. That is, if such variables are involved in a specific analysis and the use of the generic SRMI causes any severe problem, as noticed and reported by the analysts, we would use a more tailored imputation model for them within the context of that analysis. This requires communication and collaboration between the imputer and analysts. We believe such strategy is especially important for imputing large-scale data since it is often necessary for practitioners to find a balance between model validity and feasibility to meet the time line of the whole project.
4.5 Processing and releasing multiply imputed datasets
The raw imputations were further processed before release, including creation of some composite variables (e.g., SF-12 scales) and merging of imputations with data from other sources. The released datasets include five completed sets and the original incomplete one, as well as the imputation documentation for easy check and comparison.
New imputations are updated as raw data collection and processing are updated periodically, as is typical in multi-site cohort studies. In addition, the analysts send the imputer their questions and inputs encountered in performing specific imputation analyses, and these are considered in updates of the imputation program.