We are concerned with multiple imputation of the ratio of two variables, which is to be used as a covariate in a regression analysis. If the numerator and denominator are not missing simultaneously, it seems sensible to make use of the observed variable in the imputation model. One such strategy is to impute missing values for the numerator and denominator, or the log-transformed numerator and denominator, and then calculate the ratio of interest; we call this ‘passive’ imputation. Alternatively, missing ratio values might be imputed directly, with or without the numerator and/or the denominator in the imputation model; we call this ‘active’ imputation. In two motivating datasets, one involving body mass index as a covariate and the other involving the ratio of total to high-density lipoprotein cholesterol, we assess the sensitivity of results to the choice of imputation model and, as an alternative, explore fully Bayesian joint models for the outcome and incomplete ratio. Fully Bayesian approaches using Winbugs were unusable in both datasets because of computational problems. In our first dataset, multiple imputation results are similar regardless of the imputation model; in the second, results are sensitive to the choice of imputation model. Sensitivity depends strongly on the coefficient of variation of the ratio's denominator. A simulation study demonstrates that passive imputation without transformation is risky because it can lead to downward bias when the coefficient of variation of the ratio's denominator is larger than about 0.1. Active imputation or passive imputation after log-transformation is preferable. © 2013 The Authors. Statistics in Medicine published by John Wiley & Sons, Ltd.
missing data; multiple imputation; ratios; compatibility
Audits are often performed to assess the quality of clinical trial data, but beyond detecting fraud or sloppiness, the audit data is generally ignored. In earlier work using data from a non-randomized study, Shepherd and Yu (2011) developed statistical methods to incorporate audit results into study estimates, and demonstrated that audit data could be used to eliminate bias.
In this manuscript we examine the usefulness of audit-based error-correction methods in clinical trial settings where a continuous outcome is of primary interest.
We demonstrate the bias of multiple linear regression estimates in general settings with an outcome that may have errors and a set of covariates for which some may have errors and others, including treatment assignment, are recorded correctly for all subjects. We study this bias under different assumptions including independence between treatment assignment, covariates, and data errors (conceivable in a double-blinded randomized trial) and independence between treatment assignment and covariates but not data errors (possible in an unblinded randomized trial). We review moment-based estimators to incorporate the audit data and propose new multiple imputation estimators. The performance of estimators is studied in simulations.
When treatment is randomized and unrelated to data errors, estimates of the treatment effect using the original error-prone data (i.e., ignoring the audit results) are unbiased. In this setting, both moment and multiple imputation estimators incorporating audit data are more variable than standard analyses using the original data. In contrast, in settings where treatment is randomized but correlated with data errors and in settings where treatment is not randomized, standard treatment effect estimates will be biased. And in all settings, parameter estimates for the original, error-prone covariates will be biased. Treatment and covariate effect estimates can be corrected by incorporating audit data using either the multiple imputation or moment-based approaches. Bias, precision, and coverage of confidence intervals improve as the audit size increases.
The extent of bias and the performance of methods depend on the extent and nature of the error as well as the size of the audit. This work only considers methods for the linear model. Settings much different than those considered here need further study.
In randomized trials with continuous outcomes and treatment assignment independent of data errors, standard analyses of treatment effects will be unbiased and are recommended. However, if treatment assignment is correlated with data errors or other covariates, naive analyses may be biased. In these settings, and when covariate effects are of interest, approaches for incorporating audit results should be considered.
audit; bias; clinical trials; measurement error; multiple imputation
Multiple imputation is becoming increasingly popular for handling missing data. However, it is often implemented without adequate consideration of whether it offers any advantage over complete case analysis for the research question of interest, or whether potential gains may be offset by bias from a poorly fitting imputation model, particularly as the amount of missing data increases.
Simulated datasets (n = 1000) drawn from a synthetic population were used to explore information recovery from multiple imputation in estimating the coefficient of a binary exposure variable when various proportions of data (10-90%) were set missing at random in a highly-skewed continuous covariate or in the binary exposure. Imputation was performed using multivariate normal imputation (MVNI), with a simple or zero-skewness log transformation to manage non-normality. Bias, precision, mean-squared error and coverage for a set of regression parameter estimates were compared between multiple imputation and complete case analyses.
For missingness in the continuous covariate, multiple imputation produced less bias and greater precision for the effect of the binary exposure variable, compared with complete case analysis, with larger gains in precision with more missing data. However, even with only moderate missingness, large bias and substantial under-coverage were apparent in estimating the continuous covariate’s effect when skewness was not adequately addressed. For missingness in the binary covariate, all estimates had negligible bias but gains in precision from multiple imputation were minimal, particularly for the coefficient of the binary exposure.
Although multiple imputation can be useful if covariates required for confounding adjustment are missing, benefits are likely to be minimal when data are missing in the exposure variable of interest. Furthermore, when there are large amounts of missingness, multiple imputation can become unreliable and introduce bias not present in a complete case analysis if the imputation model is not appropriate. Epidemiologists dealing with missing data should keep in mind the potential limitations as well as the potential benefits of multiple imputation. Further work is needed to provide clearer guidelines on effective application of this method.
Missing data; Multiple imputation; Fully conditional specification; Multivariate normal imputation; Non-normal data
Multiple imputation (MI) is becoming increasingly popular as a strategy for handling missing data, but there is a scarcity of tools for checking the adequacy of imputation models. The Kolmogorov-Smirnov (KS) test has been identified as a potential diagnostic method for assessing whether the distribution of imputed data deviates substantially from that of the observed data. The aim of this study was to evaluate the performance of the KS test as an imputation diagnostic.
Using simulation, we examined whether the KS test could reliably identify departures from assumptions made in the imputation model. To do this we examined how the p-values from the KS test behaved when skewed and heavy-tailed data were imputed using a normal imputation model. We varied the amount of missing data, the missing data models and the amount of skewness, and evaluated the performance of KS test in diagnosing issues with the imputation models under these different scenarios.
The KS test was able to flag differences between the observations and imputed values; however, these differences did not always correspond to problems with MI inference for the regression parameter of interest. When there was a strong missing at random dependency, the KS p-values were very small, regardless of whether or not the MI estimates were biased; so that the KS test was not able to discriminate between imputed variables that required further investigation, and those that did not. The p-values were also sensitive to sample size and the proportion of missing data, adding to the challenge of interpreting the results from the KS test.
Given our study results, it is difficult to establish guidelines or recommendations for using the KS test as a diagnostic tool for MI. The investigation of other imputation diagnostics and their incorporation into statistical software are important areas for future research.
Missing data; Multiple imputation; Model checking; Kolmogorov-Smirnov test; Diagnostics; Simulations
Multiple imputation is a strategy for the analysis of incomplete data such that the impact of the missingness on the power and bias of estimates is mitigated. When data from multiple studies are collated, we can propose both within-study and multilevel imputation models to impute missing data on covariates. It is not clear how to choose between imputation models or how to combine imputation and inverse-variance weighted meta-analysis methods. This is especially important as often different studies measure data on different variables, meaning that we may need to impute data on a variable which is systematically missing in a particular study. In this paper, we consider a simulation analysis of sporadically missing data in a single covariate with a linear analysis model and discuss how the results would be applicable to the case of systematically missing data. We find in this context that ensuring the congeniality of the imputation and analysis models is important to give correct standard errors and confidence intervals. For example, if the analysis model allows between-study heterogeneity of a parameter, then we should incorporate this heterogeneity into the imputation model to maintain the congeniality of the two models. In an inverse-variance weighted meta-analysis, we should impute missing data and apply Rubin's rules at the study level prior to meta-analysis, rather than meta-analyzing each of the multiple imputations and then combining the meta-analysis estimates using Rubin's rules. We illustrate the results using data from the Emerging Risk Factors Collaboration.
missing data; multiple imputation; meta-analysis; individual participant data; Rubin's rules
Multiple imputation is a commonly used method for handling incomplete covariates as it can provide valid inference when data are missing at random. This depends on being able to correctly specify the parametric model used to impute missing values, which may be difficult in many realistic settings. Imputation by predictive mean matching (PMM) borrows an observed value from a donor with a similar predictive mean; imputation by local residual draws (LRD) instead borrows the donor’s residual. Both methods relax some assumptions of parametric imputation, promising greater robustness when the imputation model is misspecified.
We review development of PMM and LRD and outline the various forms available, and aim to clarify some choices about how and when they should be used. We compare performance to fully parametric imputation in simulation studies, first when the imputation model is correctly specified and then when it is misspecified.
In using PMM or LRD we strongly caution against using a single donor, the default value in some implementations, and instead advocate sampling from a pool of around 10 donors. We also clarify which matching metric is best. Among the current MI software there are several poor implementations.
PMM and LRD may have a role for imputing covariates (i) which are not strongly associated with outcome, and (ii) when the imputation model is thought to be slightly but not grossly misspecified. Researchers should spend efforts on specifying the imputation model correctly, rather than expecting predictive mean matching or local residual draws to do the work.
Multiple imputation; Imputation model; Predictive mean matching; Local residual draws; Missing data
Nowadays, more and more clinical scales consisting in responses given by the patients to some items (Patient Reported Outcomes - PRO), are validated with models based on Item Response Theory, and more specifically, with a Rasch model. In the validation sample, presence of missing data is frequent. The aim of this paper is to compare sixteen methods for handling the missing data (mainly based on simple imputation) in the context of psychometric validation of PRO by a Rasch model. The main indexes used for validation by a Rasch model are compared.
A simulation study was performed allowing to consider several cases, notably the possibility for the missing values to be informative or not and the rate of missing data.
Several imputations methods produce bias on psychometrical indexes (generally, the imputation methods artificially improve the psychometric qualities of the scale). In particular, this is the case with the method based on the Personal Mean Score (PMS) which is the most commonly used imputation method in practice.
Several imputation methods should be avoided, in particular PMS imputation. From a general point of view, it is important to use an imputation method that considers both the ability of the patient (measured for example by his/her score), and the difficulty of the item (measured for example by its rate of favourable responses). Another recommendation is to always consider the addition of a random process in the imputation method, because such a process allows reducing the bias. Last, the analysis realized without imputation of the missing data (available case analyses) is an interesting alternative to the simple imputation in this context.
Environmental epidemiology, when focused on the life course of exposure to a specific pollutant, requires historical exposure estimates that are difficult to obtain for the full time period due to gaps in the historical record, especially in earlier years. We show that these gaps can be filled by applying multiple imputation methods to a formal risk equation that incorporates lifetime exposure. We also address challenges that arise, including choice of imputation method, potential bias in regression coefficients, and uncertainty in age-at-exposure sensitivities.
During time periods when parameters needed in the risk equation are missing for an individual, the parameters are filled by an imputation model using group level information or interpolation. A random component is added to match the variance found in the estimates for study subjects not needing imputation. The process is repeated to obtain multiple data sets, whose regressions against health data can be combined statistically to develop confidence limits using Rubin’s rules to account for the uncertainty introduced by the imputations. To test for possible recall bias between cases and controls, which can occur when historical residence location is obtained by interview, and which can lead to misclassification of imputed exposure by disease status, we introduce an “incompleteness index,” equal to the percentage of dose imputed (PDI) for a subject. “Effective doses” can be computed using different functional dependencies of relative risk on age of exposure, allowing intercomparison of different risk models. To illustrate our approach, we quantify lifetime exposure (dose) from traffic air pollution in an established case–control study on Long Island, New York, where considerable in-migration occurred over a period of many decades.
The major result is the described approach to imputation. The illustrative example revealed potential recall bias, suggesting that regressions against health data should be done as a function of PDI to check for consistency of results. The 1% of study subjects who lived for long durations near heavily trafficked intersections, had very high cumulative exposures. Thus, imputation methods must be designed to reproduce non-standard distributions.
Our approach meets a number of methodological challenges to extending historical exposure reconstruction over a lifetime and shows promise for environmental epidemiology. Application to assessment of breast cancer risks will be reported in a subsequent manuscript.
Exposure; Air pollution; Traffic; Benzo(a)pyrene; PAH; Multiple imputation; Epidemiology; In-migration; Dose
Intermediate outcome variables can often be used as auxiliary variables for the true outcome of interest in randomized clinical trials. For many cancers, time to recurrence is an informative marker in predicting a patient’s overall survival outcome, and could provide auxiliary information for the analysis of survival times.
To investigate whether models linking recurrence and death combined with a multiple imputation procedure for censored observations can result in efficiency gains in the estimation of treatment effects, and be used to shorten trial lengths.
Recurrence and death times are modeled using data from 12 trials in colorectal cancer. Multiple imputation is used as a strategy for handling missing values arising from censoring. The imputation procedure uses a cure model for time to recurrence and a time-dependent Weibull proportional hazards model for time to death. Recurrence times are imputed, and then death times are imputed conditionally on recurrence times. To illustrate these methods, trials are artificially censored 2-years after the last accrual, the imputation procedure is implemented, and a log-rank test and Cox model are used to analyze and compare these new data with the original data.
The results show modest, but consistent gains in efficiency in the analysis by using the auxiliary information in recurrence times. Comparison of analyses show the treatment effect estimates and log rank test results from the 2-year censored imputed data to be in between the estimates from the original data and the artificially censored data, indicating that the procedure was able to recover some of the lost information due to censoring.
The models used are all fully parametric, requiring distributional assumptions of the data.
The proposed models may be useful to improve the efficiency in estimation of treatment effects in cancer trials and shortening trial length.
Auxiliary Variables; Colon Cancer; Cure Models; Multiple Imputation; Surrogate Endpoints
Censoring that is dependent on covariates associated with survival can arise in randomized trials due to changes in recruitment and eligibility criteria to minimize withdrawals, potentially leading to biased treatment effect estimates. Imputation approaches have been proposed to address censoring in survival analysis; and while these approaches may provide unbiased estimates of treatment effects, imputation of a large number of outcomes may over- or underestimate the associated variance based on the imputation pool selected.
We propose an improved method, risk-stratified imputation, as an alternative to address withdrawal related to the risk of events in the context of time-to-event analyses.
Our algorithm performs imputation from a pool of replacement subjects with similar values of both treatment and covariate(s) of interest, that is, from a risk-stratified sample. This stratification prior to imputation addresses the requirement of time-to-event analysis that censored observations are representative of all other observations in the risk group with similar exposure variables. We compared our risk-stratified imputation to case deletion and bootstrap imputation in a simulated dataset in which the covariate of interest (study withdrawal) was related to treatment. A motivating example from a recent clinical trial is also presented to demonstrate the utility of our method.
In our simulations, risk-stratified imputation gives estimates of treatment effect comparable to bootstrap and auxiliary variable imputation while avoiding inaccuracies of the latter two in estimating the associated variance. Similar results were obtained in analysis of clinical trial data.
Risk-stratified imputation has little advantage over other imputation methods when covariates of interest are not related to treatment, although its performance is superior when covariates are related to treatment. Risk-stratified imputation is intended for categorical covariates, and may be sensitive to the width of the matching window if continuous covariates are used.
The use of the risk-stratified imputation should facilitate the analysis of many clinical trials, in which one group has a higher withdrawal rate that is related to treatment.
Censoring; Survival; Imputation; Randomized Trials; CREST; Time to Event
Genotype imputation methods are now being widely used in the analysis of genome-wide association studies. Most imputation analyses to date have used the HapMap as a reference dataset, but new reference panels (such as controls genotyped on multiple SNP chips and densely typed samples from the 1,000 Genomes Project) will soon allow a broader range of SNPs to be imputed with higher accuracy, thereby increasing power. We describe a genotype imputation method (IMPUTE version 2) that is designed to address the challenges presented by these new datasets. The main innovation of our approach is a flexible modelling framework that increases accuracy and combines information across multiple reference panels while remaining computationally feasible. We find that IMPUTE v2 attains higher accuracy than other methods when the HapMap provides the sole reference panel, but that the size of the panel constrains the improvements that can be made. We also find that imputation accuracy can be greatly enhanced by expanding the reference panel to contain thousands of chromosomes and that IMPUTE v2 outperforms other methods in this setting at both rare and common SNPs, with overall error rates that are 15%–20% lower than those of the closest competing method. One particularly challenging aspect of next-generation association studies is to integrate information across multiple reference panels genotyped on different sets of SNPs; we show that our approach to this problem has practical advantages over other suggested solutions.
Large association studies have proven to be effective tools for identifying parts of the genome that influence disease risk and other heritable traits. So-called “genotype imputation” methods form a cornerstone of modern association studies: by extrapolating genetic correlations from a densely characterized reference panel to a sparsely typed study sample, such methods can estimate unobserved genotypes with high accuracy, thereby increasing the chances of finding true associations. To date, most genome-wide imputation analyses have used reference data from the International HapMap Project. While this strategy has been successful, association studies in the near future will also have access to additional reference information, such as control sets genotyped on multiple SNP chips and dense genome-wide haplotypes from the 1,000 Genomes Project. These new reference panels should improve the quality and scope of imputation, but they also present new methodological challenges. We describe a genotype imputation method, IMPUTE version 2, that is designed to address these challenges in next-generation association studies. We show that our method can use a reference panel containing thousands of chromosomes to attain higher accuracy than is possible with the HapMap alone, and that our approach is more accurate than competing methods on both current and next-generation datasets. We also highlight the modeling issues that arise in imputation datasets.
Genotype imputation is an important tool for whole-genome prediction as it allows cost reduction of individual genotyping. However, benefits of genotype imputation have been evaluated mostly for linear additive genetic models. In this study we investigated the impact of employing imputed genotypes when using more elaborated models of phenotype prediction. Our hypothesis was that such models would be able to track genetic signals using the observed genotypes only, with no additional information to be gained from imputed genotypes.
For the present study, an outbred mice population containing 1,904 individuals and genotypes for 1,809 pre-selected markers was used. The effect of imputation was evaluated for a linear model (the Bayesian LASSO - BL) and for semi and non-parametric models (Reproducing Kernel Hilbert spaces regressions – RKHS, and Bayesian Regularized Artificial Neural Networks – BRANN, respectively). The RKHS method had the best predictive accuracy. Genotype imputation had a similar impact on the effectiveness of BL and RKHS. BRANN predictions were, apparently, more sensitive to imputation errors. In scenarios where the masking rates were 75% and 50%, the genotype imputation was not beneficial. However, genotype imputation incorporated information about important markers and improved predictive ability, especially for body mass index (BMI), when genotype information was sparse (90% masking), and for body weight (BW) when the reference sample for imputation was weakly related to the target population.
In conclusion, genotype imputation is not always helpful for phenotype prediction, and so it should be considered in a case-by-case basis. In summary, factors that can affect the usefulness of genotype imputation for prediction of yet-to-be observed traits are: the imputation accuracy itself, the structure of the population, the genetic architecture of the target trait and also the model used for phenotype prediction.
Genotype imputation; Genome-enabled prediction; Complex traits; Non-linear models
Multiple imputation (MI) provides an effective approach to handle missing covariate data within prognostic modelling studies, as it can properly account for the missing data uncertainty. The multiply imputed datasets are each analysed using standard prognostic modelling techniques to obtain the estimates of interest. The estimates from each imputed dataset are then combined into one overall estimate and variance, incorporating both the within and between imputation variability. Rubin's rules for combining these multiply imputed estimates are based on asymptotic theory. The resulting combined estimates may be more accurate if the posterior distribution of the population parameter of interest is better approximated by the normal distribution. However, the normality assumption may not be appropriate for all the parameters of interest when analysing prognostic modelling studies, such as predicted survival probabilities and model performance measures.
Guidelines for combining the estimates of interest when analysing prognostic modelling studies are provided. A literature review is performed to identify current practice for combining such estimates in prognostic modelling studies.
Methods for combining all reported estimates after MI were not well reported in the current literature. Rubin's rules without applying any transformations were the standard approach used, when any method was stated.
The proposed simple guidelines for combining estimates after MI may lead to a wider and more appropriate use of MI in future prognostic modelling studies.
Missing data often occur in cross-sectional surveys and longitudinal and experimental studies. The purpose of this study was to compare the prediction of self-rated health (SRH), a robust predictor of morbidity and mortality among diverse populations, before and after imputation of the missing variable “yearly household income.” We reviewed data from 4,162 participants of Mexican origin recruited from July 1, 2002, through December 31, 2005, and who were enrolled in a population-based cohort study. Missing yearly income data were imputed using three different single imputation methods and one multiple imputation under a Bayesian approach. Of 4,162 participants, 3,121 were randomly assigned to a training set (to derive the yearly income imputation methods and develop the health-outcome prediction models) and 1,041 to a testing set (to compare the areas under the curve (AUC) of the receiver-operating characteristic of the resulting health-outcome prediction models). The discriminatory powers of the SRH prediction models were good (range, 69–72%) and compared to the prediction model obtained after no imputation of missing yearly income, all other imputation methods improved the prediction of SRH (P<0.05 for all comparisons) with the AUC for the model after multiple imputation being the highest (AUC = 0.731). Furthermore, given that yearly income was imputed using multiple imputation, the odds of SRH as good or better increased by 11% for each $5,000 increment in yearly income. This study showed that although imputation of missing data for a key predictor variable can improve a risk health-outcome prediction model, further work is needed to illuminate the risk factors associated with SRH.
Self-rated health; Missing income data; Data imputation techniques; Mean substitution; Multiple imputation; Minority health
Multiple imputation (MI) was developed as a method to enable valid inferences to be obtained in the presence of missing data rather than to re-create the missing values. Within the applied setting, it remains unclear how important it is that imputed values should be plausible for individual observations. One variable type for which MI may lead to implausible values is a limited-range variable, where imputed values may fall outside the observable range. The aim of this work was to compare methods for imputing limited-range variables, with a focus on those that restrict the range of the imputed values.
Using data from a study of adolescent health, we consider three variables based on responses to the General Health Questionnaire (GHQ), a tool for detecting minor psychiatric illness. These variables, based on different scoring methods for the GHQ, resulted in three continuous distributions with mild, moderate and severe positive skewness. In an otherwise complete dataset, we set 33% of the GHQ observations to missing completely at random or missing at random; repeating this process to create 1000 datasets with incomplete data for each scenario.
For each dataset, we imputed values on the raw scale and following a zero-skewness log transformation using: univariate regression with no rounding; post-imputation rounding; truncated normal regression; and predictive mean matching. We estimated the marginal mean of the GHQ and the association between the GHQ and a fully observed binary outcome, comparing the results with complete data statistics.
Imputation with no rounding performed well when applied to data on the raw scale. Post-imputation rounding and imputation using truncated normal regression produced higher marginal means than the complete data estimate when data had a moderate or severe skew, and this was associated with under-coverage of the complete data estimate. Predictive mean matching also produced under-coverage of the complete data estimate. For the estimate of association, all methods produced similar estimates to the complete data.
For data with a limited range, multiple imputation using techniques that restrict the range of imputed values can result in biased estimates for the marginal mean when data are highly skewed.
Multiple imputation; Limited-range; Skewed data; Missing data; Rounding; Truncated regression
In prognostic studies, the lasso technique is attractive since it improves the quality of predictions by shrinking regression coefficients, compared to predictions based on a model fitted via unpenalized maximum likelihood. Since some coefficients are set to zero, parsimony is achieved as well. It is unclear whether the performance of a model fitted using the lasso still shows some optimism. Bootstrap methods have been advocated to quantify optimism and generalize model performance to new subjects. It is unclear how resampling should be performed in the presence of multiply imputed data.
The data were based on a cohort of Chronic Obstructive Pulmonary Disease patients. We constructed models to predict Chronic Respiratory Questionnaire dyspnea 6 months ahead. Optimism of the lasso model was investigated by comparing 4 approaches of handling multiply imputed data in the bootstrap procedure, using the study data and simulated data sets. In the first 3 approaches, data sets that had been completed via multiple imputation (MI) were resampled, while the fourth approach resampled the incomplete data set and then performed MI.
The discriminative model performance of the lasso was optimistic. There was suboptimal calibration due to over-shrinkage. The estimate of optimism was sensitive to the choice of handling imputed data in the bootstrap resampling procedure. Resampling the completed data sets underestimates optimism, especially if, within a bootstrap step, selected individuals differ over the imputed data sets. Incorporating the MI procedure in the validation yields estimates of optimism that are closer to the true value, albeit slightly too larger.
Performance of prognostic models constructed using the lasso technique can be optimistic as well. Results of the internal validation are sensitive to how bootstrap resampling is performed.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2288-14-116) contains supplementary material, which is available to authorized users.
Clinical prediction models; Model validation; Multiple imputation; Quality of life; Shrinkage
The purpose of this study was to assess an alternative statistical approach—multiple imputation—to risk factor redistribution in the national human immunodeficiency virus (HIV)/acquired immunodeficiency syndrome (AIDS) surveillance system as a way to adjust for missing risk factor information.
We used an approximate model incorporating random variation to impute values for missing risk factors for HIV and AIDS cases diagnosed from 2000 to 2004. The process was repeated M times to generate M datasets. We combined results from the datasets to compute an overall multiple imputation estimate and standard error (SE), and then compared results from multiple imputation and from risk factor redistribution. Variables in the imputation models were age at diagnosis, race/ethnicity, type of facility where diagnosis was made, region of residence, national origin, CD-4 T-lymphocyte cell count within six months of diagnosis, and reporting year.
In HIV data, male-to-male sexual contact accounted for 67.3% of cases by risk factor redistribution and 70.4% (SE=0.45) by multiple imputation. Also among males, injection drug use (IDU) accounted for 11.6% and 10.8% (SE=0.34), and high-risk heterosexual contact for 15.1% and 13.0% (SE=0.34) by risk factor redistribution and multiple imputation, respectively. Among females, IDU accounted for 18.2% and 17.9% (SE=0.61), and high-risk heterosexual contact for 80.8% and 80.9% (SE=0.63) by risk factor redistribution and multiple imputation, respectively.
Because multiple imputation produces less biased subgroup estimates and offers objectivity and a semiautomated approach, we suggest consideration of its use in adjusting for missing risk factor information.
Due to early colonoscopy for some participants, interval-censored observations can be introduced into the data of a colorectal polyp prevention trial. The censoring could be dependent of risk of recurrence if the reasons of having early colonoscopy are associated with recurrence. This can complicate estimation of the recurrence rate.
We propose to use midpoint imputation to convert interval-censored data problems to right censored data problems. To adjust for potential dependent censoring, we use information from auxiliary variables to define risk groups to perform the weighted Kaplan-Meier estimation to the midpoint imputed data. The risk groups are defined using two risk scores derived from two working proportional hazards models with the auxiliary variables as the covariates. One is for the recurrence time and the other is for the censoring time. The method described here is explored by simulation and illustrated with an example from a colorectal polyp prevention trial.
We first show that midpoint imputation under an assumption of independent censoring will produce an unbiased estimate of recurrence rate at the end of the trial, which is often the main interest of a colorectal polyp prevention trial, and then show in simulations that the weighted Kaplan-Meier method using the information from auxiliary variables based on the midpoint imputed data can improve efficiency in a situation with independent censoring and reduce bias in a situation with dependent censoring compared to the conventional methods, while estimating the recurrence rate at the end of the trial.
The research in this paper uses midpoint imputation to handle interval-censored observations and then uses the information from auxiliary variables to adjust for dependent censoring by incorporating them into the weighted Kaplan-Meier estimation. This approach can handle a situation with multiple auxiliary variables by deriving two risk scores from two working PH models. Although the idea of this approach might appear simple, the results do show that the weighted Kaplan-Meier approach can gain efficiency and reduce bias due to dependent censoring.
Methods to handle missing data have been an area of statistical research for many years. Little has been done within the context of pedigree analysis. In this paper we present two methods for imputing missing data for polygenic models using family data. The imputation schemes take into account familial relationships and use the observed familial information for the imputation. A traditional multiple imputation approach and multiple imputation or data augmentation approach within a Gibbs sampler for the handling of missing data for a polygenic model are presented.
We used both the Genetic Analysis Workshop 13 simulated missing phenotype and the complete phenotype data sets as the means to illustrate the two methods. We looked at the phenotypic trait systolic blood pressure and the covariate gender at time point 11 (1970) for Cohort 1 and time point 1 (1971) for Cohort 2. Comparing the results for three replicates of complete and missing data incorporating multiple imputation, we find that multiple imputation via a Gibbs sampler produces more accurate results. Thus, we recommend the Gibbs sampler for imputation purposes because of the ease with which it can be extended to more complicated models, the consistency of the results, and the accountability of the variation due to imputation.
Multiple imputation fills in missing data with posterior predictive draws
from imputation models. To assess the adequacy of imputation models, we can
compare completed data with their replicates simulated under the imputation
model. We apply analyses of substantive interest to both datasets and use
posterior predictive checks of the differences of these estimates to quantify
the evidence of model inadequacy. We can further integrate out the imputed
missing data and their replicates over the completed-data analyses to reduce
variance in the comparison. In many cases, the checking procedure can be easily
implemented using standard imputation software by treating re-imputations under
the model as posterior predictive replicates. Thus, it can be applied for
non-Bayesian imputation methods. We also sketch several strategies for applying
the method in the context of practical imputation analyses. We illustrate the
method using two real data applications and study its property using a
completed data; conservatism; imputation software; missing data; model uncongeniality; normal approximation; posterior predictive p-value; re-imputation
Missing data present a challenge to many research projects. The problem is often pronounced in studies utilizing self-report scales, and literature addressing different strategies for dealing with missing data in such circumstances is scarce. The objective of this study was to compare six different imputation techniques for dealing with missing data in the Zung Self-reported Depression scale (SDS).
1580 participants from a surgical outcomes study completed the SDS. The SDS is a 20 question scale that respondents complete by circling a value of 1 to 4 for each question. The sum of the responses is calculated and respondents are classified as exhibiting depressive symptoms when their total score is over 40. Missing values were simulated by randomly selecting questions whose values were then deleted (a missing completely at random simulation). Additionally, a missing at random and missing not at random simulation were completed. Six imputation methods were then considered; 1) multiple imputation, 2) single regression, 3) individual mean, 4) overall mean, 5) participant's preceding response, and 6) random selection of a value from 1 to 4. For each method, the imputed mean SDS score and standard deviation were compared to the population statistics. The Spearman correlation coefficient, percent misclassified and the Kappa statistic were also calculated.
When 10% of values are missing, all the imputation methods except random selection produce Kappa statistics greater than 0.80 indicating 'near perfect' agreement. MI produces the most valid imputed values with a high Kappa statistic (0.89), although both single regression and individual mean imputation also produced favorable results. As the percent of missing information increased to 30%, or when unbalanced missing data were introduced, MI maintained a high Kappa statistic. The individual mean and single regression method produced Kappas in the 'substantial agreement' range (0.76 and 0.74 respectively).
Multiple imputation is the most accurate method for dealing with missing data in most of the missind data scenarios we assessed for the SDS. Imputing the individual's mean is also an appropriate and simple method for dealing with missing data that may be more interpretable to the majority of medical readers. Researchers should consider conducting methodological assessments such as this one when confronted with missing data. The optimal method should balance validity, ease of interpretability for readers, and analysis expertise of the research team.
Whole brain fMRI analyses rarely include the entire brain because of missing data that result from data acquisition limits and susceptibility artifact, in particular. This missing data problem is typically addressed by omitting voxels from analysis, which may exclude brain regions that are of theoretical interest and increase the potential for Type II error at cortical boundaries or Type I error when spatial thresholds are used to establish significance. Imputation could significantly expand statistical map coverage, increase power, and enhance interpretations of fMRI results. We examined multiple imputation for group level analyses of missing fMRI data using methods that leverage the spatial information in fMRI datasets for both real and simulated data. Available case analysis, neighbor replacement, and regression based imputation approaches were compared in a general linear model framework to determine the extent to which these methods quantitatively (effect size) and qualitatively (spatial coverage) increased the sensitivity of group analyses. In both real and simulated data analysis, multiple imputation provided 1) variance that was most similar to estimates for voxels with no missing data, 2) fewer false positive errors in comparison to mean replacement, and 3) fewer false negative errors in comparison to available case analysis. Compared to the standard analysis approach of omitting voxels with missing data, imputation methods increased brain coverage in this study by 35% (from 33,323 to 45,071 voxels). In addition, multiple imputation increased the size of significant clusters by 58% and number of significant clusters across statistical thresholds, compared to the standard voxel omission approach. While neighbor replacement produced similar results, we recommend multiple imputation because it uses an informed sampling distribution to deal with missing data across subjects that can include neighbor values and other predictors. Multiple imputation is anticipated to be particularly useful for 1) large fMRI data sets with inconsistent missing voxels across subjects and 2) addressing the problem of increased artifact at ultra-high field, which significantly limit the extent of whole brain coverage and interpretations of results.
missing data; fMRI; group analysis; multiple imputation; replacement; neuroimaging methods
Multiple rare variants either within or across genes have been hypothesised to collectively influence complex human traits. The increasing availability of high throughput sequencing technologies offers the opportunity to study the effect of rare variants on these traits. However, appropriate and computationally efficient analytical methods are required to account for collections of rare variants that display a combination of protective, deleterious and null effects on the trait. We have developed a novel method for the analysis of rare genetic variation in a gene, region or pathway that, by simply aggregating summary statistics at each variant, can: (i) test for the presence of a mixture of effects on a trait; (ii) be applied to both binary and quantitative traits in population-based and family-based data; (iii) adjust for covariates to allow for non-genetic risk factors and; (iv) incorporate imputed genetic variation. In addition, for preliminary identification of promising genes, the method can be applied to association summary statistics, available from meta-analysis of published data, for example, without the need for individual level genotype data. Through simulation, we show that our method is immune to the presence of bi-directional effects, with no apparent loss in power across a range of different mixtures, and can achieve greater power than existing approaches as long as summary statistics at each variant are robust. We apply our method to investigate association of type-1 diabetes with imputed rare variants within genes in the major histocompatibility complex using genotype data from the Wellcome Trust Case Control Consortium.
Rapid advances in sequencing technology mean that it is now possible to directly assay rare genetic variation. In addition, the availability of almost fully sequenced human genomes by the 1000 Genomes Project allows genotyping at rare variants that are not present on arrays commonly used in genome-wide association studies. Rare variants within a gene or region may act to collectively influence a complex trait. Methods for testing these rare variants should be able to account for a combination of those that serve to either increase, decrease or have no effect on the trait of interest. Here, we introduce a method for the analysis of a collection of rare genetic variants, within a gene or region, which assesses evidence for a mixture of effects. Our method simply aggregates summary statistics at each variant and, as such, can be applied to both population and family-based data, to binary or quantitative traits and to either directly genotyped or imputed data. In addition, it does not require individual level genotype or phenotype data, and can be adjusted for non-genetic risk factors. We illustrate our approach by examining imputed rare variants in the major histocompatibility complex for association with type-1 diabetes using genotype data from the Wellcome Trust case Control Consortium.
To preserve patient anonymity, health register data may be provided as binned data only. Here we consider as example, how to estimate mean survival time after a diagnosis of metastatic colorectal cancer from Norwegian register data on time to death or censoring binned into 30 day intervals. All events occurring in the first three months (90 days) after diagnosis were removed to achieve comparability with a clinical trial. The aim of the paper is to develop and implement a simple, and yet flexible method for analyzing such interval censored and truncated data.
Considering interval censoring a missing data problem, we implement a simple multiple imputation strategy that allows flexible sensitivity analyses with respect to the shape of the censoring distribution. To allow identification of appropriate parametric models, a χ2-goodness-of-fit test--also imputation based--is derived and supplemented with diagnostic plots. Uncertainty estimates for mean survival times are obtained via a simulation strategy. The validity and statistical efficiency of the proposed method for varying interval lengths is investigated in a simulation study and compared with simpler alternatives.
Mean survival times estimated from the register data ranged from 1.2 (SE = 0.09) to 3.2 (0.31) years depending on period of diagnosis and choice of parametric model. The shape of the censoring distribution within intervals did generally not influence results, whereas the choice of parametric model did, even when different models fit the data equally well. In simulation studies both simple midpoint imputation and multiple imputation yielded nearly unbiased analyses (relative biases of -0.6% to 9.4%) and confidence intervals with near-nominal coverage probabilities (93.4% to 95.7%) for censoring intervals shorter than six months. For 12 month censoring intervals, multiple imputation provided better protection against bias, and coverage probabilities closer to nominal values than simple midpoint imputation.
Binning of event and censoring times should be considered a viable strategy for anonymizing register data on survival times, as they may be readily analyzed with methods based on multiple imputation.
To determine whether the imputation procedure used to replace missing data by the U.S. Census Bureau produces bias in the estimates of health insurance coverage in the Current Population Survey's (CPS) Annual Social and Economic Supplement (ASEC).
Eleven percent of the respondents to the monthly CPS do not take the ASEC supplement and the entire supplement for these respondents is imputed by the Census Bureau. We compare the health insurance coverage of these “full-supplement imputations” with those respondents answering the ASEC supplement. We then compare demographic characteristics of the two groups and model the likelihood of having insurance coverage given the data are imputed controlling for demographic characteristics. Finally, in order to gauge the impact of imputation on the uninsurance rate we remove the full-supplement imputations and reweight the data, and we also use the multivariate regression model to simulate what the uninsurance rate would be under the counter-factual simulation that no cases had the full-supplement imputation.
The noninstitutionalized U.S. population under 65 years of age in 2004.
Data Extraction Methods
The CPS-ASEC survey was extracted from the U.S. Census Bureau's FTP web page in September of 2004 (http://www.bls.census.gov/ferretftp.htm).
In the 2004 CPS-ASEC, 59.3 percent of the full-supplement imputations under age 65 years had private health insurance coverage as compared with 69.1 percent of the nonfull-supplement imputations. Furthermore, full-supplement imputations have a 26.4 percent uninsurance rate while all others have an uninsurance rate of 16.6 percent. Having imputed data remains a significant predictor of health insurance coverage in multivariate models with demographic controls. Both our reweighting strategy and our counterfactual modeling show that the uninsured rate is approximately one percentage point higher than it should be for people under 65 (i.e., approximately 2.5 million more people are counted as uninsured due to this imputation bias).
The imputed ASEC data are coding too many people to be uninsured. The situation is complicated by the current survey items in the ASEC instrument allowing all members of a household to be assigned coverage with the single press of a button. The Census Bureau should consider altering its imputation specifications and, more importantly, altering how it collects survey data from those who respond to the supplement.
Implications for Policy Delivery or Practice
The bias affects many different policy simulations, policy evaluations and federal funding allocations that rely on the CPS-ASEC data.
Primary Funding Source
The Robert Wood Johnson Foundation.
Health insurance coverage; current population survey; annual social and economic supplement; hotdeck imputation; item nonresponse; missing data