|Home | About | Journals | Submit | Contact Us | Français|
Missing data are ubiquitous in clinical epidemiological research. Individuals with missing data may differ from those with no missing data in terms of the outcome of interest and prognosis in general. Missing data are often categorized into the following three types: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). In clinical epidemiological research, missing data are seldom MCAR. Missing data can constitute considerable challenges in the analyses and interpretation of results and can potentially weaken the validity of results and conclusions. A number of methods have been developed for dealing with missing data. These include complete-case analyses, missing indicator method, single value imputation, and sensitivity analyses incorporating worst-case and best-case scenarios. If applied under the MCAR assumption, some of these methods can provide unbiased but often less precise estimates. Multiple imputation is an alternative method to deal with missing data, which accounts for the uncertainty associated with missing data. Multiple imputation is implemented in most statistical software under the MAR assumption and provides unbiased and valid estimates of associations based on information from the available data. The method affects not only the coefficient estimates for variables with missing data but also the estimates for other variables with no missing data.
Despite implementation of standardized data collection forms, missing data are ubiquitous in clinical epidemiological research. Missing data occur in various data sources (databases, medical records, and patient reported data), study designs, data collection methods (paper-based and online registration forms), registration time (eg, pretreatment and posttreatment), and registration frequency (eg, one postoperative outcome measurement and several follow-up measurements). Missing data can occur for multiple reasons – loss to follow-up, failure to attend medical appointments, lack of measurements, failure to send or retrieve questionnaires, and inaccurate transfer of data from paper registration to an electronic database.1
Individuals with missing data may differ from those with complete data in terms of the outcome of interest and prognosis in general. For example, those who are healthier may be less likely to visit their doctor and hence less likely to have blood pressure recorded. Studies on self-reported data show that individuals who have missing data on one variable are often likely also to have missing data on other variables. Our previous research demonstrated that patients with missing data on smoking often have missing data on other lifestyle variables.2 Missing data can constitute considerable challenges in the analyses and interpretation of results and potentially weaken the validity of results and conclusions.3 Missing data are problematic because of the risk of bias, which depends on the type of missing data, the extent of the data that are missing, and the way of dealing with missing data in the analyses.4
The overall aim of this paper is to provide clinical epidemiological researchers with insights on the missing data. The specific aims of this paper are to: 1) describe methods often used for dealing with missing data in the analytic phase and highlight their shortfalls; 2) introduce multiple imputation as an alternative method, highlighting its advantages over “traditional” methods; and 3) discuss reporting of the results from multiple imputation analyses.
When individuals with missing data are a random subset of the study population, the probability of being missing is the same for all cases; missing data are denoted as MCAR.7 An example of MCAR is when a glass slide with biopsy material from a patient is accidentally broken such that pathology and histology tests cannot be performed, or when individuals had no blood pressure measured as the equipment was broken. Thus, under MCAR, missing data do not depend on either observed data or unobserved data.
In contrast to MCAR, the term MAR is counterintuitive. MAR occurs when the missingness depends on information we have already observed.7 For example, data in a depression survey can be said to be MAR, given gender if men are less likely than women to fill out the survey. Once gender is accounted for, the missingness does not depend on the level of their depression. Another example of MAR is when, in a study of weight, data on weight are less likely to be recorded for younger individuals, because they do not attend health care facilities as often as older individuals.
When the probability that data are missing depends on the unobserved data, such as the value of the observation itself, then the missing data are denoted as MNAR.7 For example, overweight or underweight individuals may be more likely to have their weight measured than individuals with normal weight, even after age is accounted for. Thus, the reason for missingness is related to unobserved characteristics of the individual, and thereby, data are MNAR. Another example is when individuals with severe depression, or adverse effects from antidepressant medication, are more or less likely to complete a survey on depression. A third example is when data on income are missing, and the probability of missingness is related to the level of income, eg, those with very low or high income refuse to report their income.
For the most part, in clinical epidemiological research, missing values are neither MCAR nor MNAR but MAR.7 Observed data can give us some indication of whether missing data are MCAR,8 but we are not able, from these data alone or simple test, to evaluate whether missing data are MAR or MNAR.7 By tabulating the characteristics of individuals with missing data against those without, we can evaluate whether data are likely to be missing conditioning on these characteristics. We illustrate this in an example where a number of individuals are lacking body mass index (BMI) measurements (Table 1). In this example, we can see that among smokers, the proportion of individuals with BMI observed is higher compared to nonsmokers. Similarly, among patients with known comorbidity prior surgery, the proportion of individuals with BMI observed is higher compared to that of those without known comorbidity. Thus, we can conclude that the data are not MCAR.
Graph theories have been helpful in a number of disciplines in the fields of mathematics, engineering, computer science, and biology to determine or evaluate the mechanisms of missingness.9 In epidemiological research, causal graphical models, such as directed acyclic graphs (DAGs), can be used to determine whether data are MAR, MNAR or MCAR, thereby informing the most appropriate analytic method to deal with missing values.
There are many ways to minimize the extent of missing data. It may be helpful to incorporate standardized rules to optimize data collection, such as training staff to collect and coordinate data collection, using well-defined data definitions, and incorporating logic and range checks for each data element. Pilot studies can help to identify variables particularly susceptible to missing values, and steps can be taken to improve completeness.10 Regular monitoring of data quality and completeness provides essential feedback to clinicians and researchers on the extent of missing data.11 Furthermore, when collecting information about the quality of life or other sensitive issues, patients may be asked to provide reasons for refusing to participate, such as a lack of time, problems understanding language, or lengthy or too intimate questionnaires. This information can be used in the analyses of data and interpretation of the results.
Several statistical approaches have been developed for dealing with missing data (Table 2). The most common methods can be classified into one of the following groups: 1) complete-case analyses, 2) missing indicator method, 3) single value imputation, and 4) sensitivity analyses incorporating worst-case and best-case scenarios. An alternative method of dealing with missing data in the analytic phase is multiple imputation.12,13 Alternatives to multiple imputation include likelihood-based approach and probability weighting;3 however, they are not the focus of this paper.
Complete-case analysis is the most widely used method to deal with missing data.13 This method, also known as “list-wise deletion”, involves excluding individuals with missing data from the analyses. It is popular because it is easy to implement and it is the default option in most statistical packages. However, the results of such analyses may yield biased estimates of associations, because complete cases are assumed to be a random sample of the whole population, ie, data are MCAR. That is not always the case, as often individuals with complete data are different from those with missing data, and missingness can depend on either observed data or unobserved data. By comparing data in a UK Primary Care Database with a population survey, Marston et al1 showed that the distributions of alcohol consumptions and smoking were different in the two data sources. This may suggest that data in these two variables are not MCAR. Complete-case analyses in this case may have serious consequences if the aim of a future study is to investigate an association between alcohol and postoperative complications. Another issue with complete-case analysis is that a large proportion of valuable research data are discarded, which affects the statistical power and precision of the estimates. In some cases, it may be reasonable to use complete-case analyses, such as when working with large datasets with few missing observations, because the risk of bias is minimal and the precision is still good.14
Under the missing indicator method, missing values are not imputed. Instead, for incomplete categorical variable(s), missing data are grouped into an additional “missing” category; in the aforementioned example, BMI could be categorized as underweight (BMI <18.5 kg/m2), normal (BMI 18.6–24.9 kg/m2), overweight (BMI 25–29.9 kg/m2), obese (BMI ≥30 kg/m2), and missing. For incomplete continuous variables, missing values are set to a fixed value (usually zero), and an extra indicator or dummy (1/0) variable is added to the main analytic model to indicate whether the value for that variable is missing. The method is popular because it retains the full dataset where no observations are excluded. However, even under the MCAR assumption and with very few missing observations, this method is still subject to bias.12 If the method is used for missing data on potential confounder variables, the estimates will be biased due to residual confounding. Figure 1 illustrates an example of a linear relationship between BMI categories and the outcome in a full dataset (on the left) and how the inclusion of a missing BMI data category biases the relationship between BMI and the outcome (on the right).
Under single value imputation, missing data are replaced by a single value, such as the mean score of the complete cases in the study sample (ie, mean imputation).13 For example, missing BMI values can be replaced with the sample mean BMI value calculated from individuals with observed BMI (Figures 2 and and3).3). Figures 2 and and33 illustrate normally distributed BMI values in a full dataset and how normally distributed data can be distorted in a dataset where 35% missing BMI values are replaced with the observed mean BMI value. In longitudinal studies where some variables are measured repeatedly, for example, yearly controls of glycated hemoglo-bin (HbA1c), the “last observation carried forward” approach can be used where missing values are replaced with the most recently observed value for a given variable. Another single imputation approach is regression-based single imputation of missing values (also known as predicted mean imputation), in which values of the missing observations are predicted using a regression model based on the complete cases.
In general, single imputation methods do not account for the uncertainty of missing data, and as a result, standard errors of the estimates are likely to be too small (thereby overestimating the precision of the results). This can potentially lead to Type 1 error (ie, identifying an association when none exists).12 Mean imputation also does not preserve the relationships between variables; it only preserves the mean of the observed data. Therefore, if the data are MCAR, the estimate of the mean remains unbiased.4,12 Under MCAR, if our aim is to estimate means (which is rarely the main focus of research studies), mean imputation will not bias the estimates; it will only bias the standard errors as mentioned previously. Since most of the research studies are interested in the relationship between variables and not just the mean, mean imputation should be avoided in general. It has been pointed out previously that last observation carried forward method can produce biased estimates in both directions even under MCAR and have warned against using this method as the first or only choice for handling missing data.3
This method involves the replacement of missing values with the worst or best value in the observed data.15 For example, analyses can be performed by replacing missing data with the highest or lowest observed value and running regression models afterward in order to examine the association of interest. The results of these two regression analyses can then be compared. When both analyses produce similar estimates of an association, it is rather straightforward to draw conclusions about the effect of missing data. However, analyses yielding opposing results can be difficult to interpret. If we have information on exposure but lack outcome data on some patients, we can replace missing data with the worst case (eg, death at the end of follow-up) or best case (patient is alive at the end of follow-up) and compare the results afterward. The usual procedure in smoking cessation studies is to assume that nonrespondents (missing smoking data) have resumed smoking.16 Thus, the data are analyzed as if all nonrespondents have returned to active smoking, which might not be a correct assumption. Barnes et al16 showed in a simulation study that this method yields biased estimates.
Multiple imputation4,5,17 solves the problem of “too small or too large” standard errors obtained using traditional methods of dealing with missing data presented in Table 2. The aim of multiple imputation is to provide unbiased and valid estimates of associations based on information from the available data ie, yielding estimates similar to those calculated from full data.3 Missing data and hence multiple imputation may affect not only the coefficient estimates for variables with missing data but also the estimates for other variables with no missing data.
Multiple imputation is widely recognized as the standard method to deal with missing data in many areas of research, and the method has become more popular with the increasing availability of software. A full description of multiple imputation is beyond the scope of this paper, but we provide a brief overview of its assumptions, implementation, and methodologies. More detailed description of the statistical theory of multiple imputation is provided by Rubin,18 Carpenter and Kenward,19 and Buuren.3
The multiple imputation procedure in most statistical software builds on the MAR assumption,20 but the method can handle both MCAR and MNAR.3 Although we cannot prove whether data are MAR, it is likely that in many situations, the MAR assumption is more plausible when more variables are included in the multiple imputation model.21,22
A statistical analysis using multiple imputation typically comprises of three major stages.
In the first stage, we select independent variables that may help to impute variables with missing data (Figure 4). This should include all variables that are in the subsequent analysis model (exposures, covariates, and outcome). In addition, we may want to include variables that help make the MAR assumption plausible; the so-called auxiliary variables. Including these variables may reduce bias and improve the precision of the estimates.
Then, we create multiple imputed datasets where the individual data may vary between datasets (Figure 5). Missing values in each dataset are drawn from the distribution of the missing data given in the observed data.18 As an example, the imputed values generated in the five imputed datasets for BMI are listed in Table 3. The table shows a variation of imputed values between imputed datasets and also between patients, reflecting the fact that we will never know what the “true” value was.
In the second stage, the association of interest is estimated in each of the imputed datasets using the chosen statistical method (eg, logistic regression) (Figure 5). Thus, coefficient estimates with corresponding standard errors can be calculated as a measure of association in each imputed dataset. There is variability both within and between the imputed datasets because of the uncertainty related to missing values.18
In the third stage, measures of association from each imputed dataset are combined by Rubin’s rules, with the corresponding standard errors (and hence the confidence intervals [CIs]) accounting for both the between- and within-imputation variations (Figure 5).19,23
Multiple imputation algorithms are implemented in all major statistical software (eg, SPSS, Stata, SAS, and R), which contain many detailed examples and step-by-step tutorials on both univariate and multivariate multiple imputations.3,24,25
Which variables should be included in the multiple imputation model?
As we emphasize earlier, all variables used in the subsequent analytic model need to be included in the imputation model (Figure 4). In addition, we can increase the precision and minimize the bias by including auxiliary variables in the imputation model. For auxiliary variables to have an impact, they would need to fulfill one of following criteria: 1) the auxiliary variable should be associated with the values of the incomplete variables, and 2) the auxiliary variable should be associated with the value of the incomplete variables and the likelihood of the data being missing. Auxiliary variables that are strongly associated with both the value and the miss-ingness are more likely to have an impact on the results of multiple imputation and reduce bias.19 Based on our knowledge of the data, research question, or literature, we may a priori know that several variables we believe make good auxiliary variables. If we are not sure, these relationships can be identified by setting up, 1) a logistic regression model with the missingness (as 0 or 1) being the outcome and auxiliary variables being the explanatory variables, or 2) a regression model with the incomplete variable as the outcome and auxiliary variables again as explanatory variables. In situations with many variables, multiple outcomes of interest, or large data sets, White et al23 suggested to run a small number of imputations (also one single imputation) and then explore the associations within that dataset and select variables. In some cases, multiple imputation may provide similar results to complete-case analysis, but we will not know beforehand. The similarity can occur due to the lack of predictive covari-ates in the imputation model.
Traditionally, it has been suggested that three to five imputed datasets are sufficient.3,26 The argument was that even with 50% missing information, five imputed data sets would produce point estimates that are 91% as efficient as those based on an infinite number of imputations.26 However, Graham et al27 showed that the statistical power and precision of estimates can be improved by creating many more imputed datasets depending on the amount of missing information and the tolerance for the loss of power. Later, Bodner28 and White et al23 suggested the rule of thumb in order to increase a level of reproducibility of the results in practice; the number of imputations should be similar to the percentage of incomplete cases. Buuren3 suggested a compromise solution, using five imputations for model building in the initial phase and increasing the number of imputations to the average percentage of missing data in the final phase of the analyses.
After reviewing 59 papers from the general medical journals from 2002 to 2007 using multiple imputations, Sterne et al4 suggested guidelines for reporting such analyses, extending the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) guidelines.29 The guidelines suggest reporting the results of both complete-case and multiple imputation methods, if possible, and particularly where there are differences in the results. Furthermore, the guidelines suggest to report the extent of missing data, the reasons for missingness, the assumptions for the multiple imputation model, and the number of imputed datasets and to specify the variables included in the multiple imputation model.
In this example, we evaluated the performances of complete-case analysis and multiple imputation and presented results in Table 4. This example, which resembles the association between the risk of blood transfusion within 7 days of hip fracture surgery in elderly patients and their BMI level at admission to the hospital, uses a dataset of 3,500 patients with no missing data. The model of interest is a logistic regression model of the odds of having blood transfusion (binary outcome – no/yes) conditional on patients’ BMI level (continuous exposure), adjusted for patients’ gender (binary variable – female/male), and age (binary variable – <75 or ≥ 75 years).
First, the model of interest is fitted to this dataset, referred to as “full data”, and parameter estimates (odds ratios) and associated 95% CIs and standard errors are recorded. Second, data in BMI are made MAR conditional on the outcome, gender, and age, using a missingness mechanism, which results in 767 patients (22%) with missing BMI values. Missing data in BMI are then handled using complete-case analysis and multiple imputation, and parameter estimates and associated standard errors are also recorded and compared with the full data results. Multiple imputation is performed using m=5 and m=30 imputed dataset, and the imputation model for BMI includes all variables in the model of interest (outcome, age, and gender). Odds ratio estimate for BMI under complete-case analysis is similar to the corresponding value in the full data (0.978 and 0.980, respectively), with comparable standard errors (0.0098 and 0.0085, respectively). Multiple imputation using 5 and 30 imputations produced similar results for BMI. Parameter estimates for other variables under complete-case analysis are biased in comparison to full-observed data, with generally higher standard errors. While the significance of gender is detected in the full-observed data and multiple imputation, the effect of gender is apparently disguised by the missing data in complete cases due to the large bias in point estimate, which leads to Type 2 error. Overall, multiple imputation produces unbiased estimates and correct standard errors under the MAR assumption of BMI.
This paper provides insights on the type of missing data, traditional methods, and multiple imputation as alternative methods to deal with missing data, including their shortfalls and advantages.
All authors contributed to the conception of the study, study design, and the discussion and interpretation of the results. ABP drafted and revised the article. All authors contributed to the manuscript for intellectual content and to drafting and critically revising the paper, gave final approval of the version to be published, and agree to be accountable for all aspects of the work.
The authors report no conflicts of interest in this work.