# Related Articles

Background

Conventional multiple-trait quantitative trait locus (QTL) mapping methods must discard cases (individuals) with incomplete phenotypic data, thereby sacrificing other phenotypic and genotypic information contained in the discarded cases. Under standard assumptions about the missing-data mechanism, it is possible to exploit these cases.

Results

We present an expectation-maximization (EM) algorithm, derived for recombinant inbred and F2 genetic models but extensible to any mating design, that supports conventional hypothesis tests for QTL main effect, pleiotropy, and QTL-by-environment interaction in multiple-trait analyses with missing phenotypic data. We evaluate its performance by simulations and illustrate with a real-data example.

Conclusion

The EM method affords improved QTL detection power and precision of QTL location and effect estimation in comparison with case deletion or imputation methods. It may be incorporated into any least-squares or likelihood-maximization QTL-mapping approach.

doi:10.1186/1471-2156-9-82

PMCID: PMC2639387
PMID: 19061502

Missing data are common in medical and social science studies and often pose a serious challenge in data analysis. Multiple imputation methods are popular and natural tools for handling missing data, replacing each missing value with a set of plausible values that represent the uncertainty about the underlying values. We consider a case of missing at random (MAR) and investigate the estimation of the marginal mean of an outcome variable in the presence of missing values when a set of fully observed covariates is available. We propose a new nonparametric multiple imputation (MI) approach that uses two working models to achieve dimension reduction and define the imputing sets for the missing observations. Compared with existing nonparametric imputation procedures, our approach can better handle covariates of high dimension, and is doubly robust in the sense that the resulting estimator remains consistent if either of the working models is correctly specified. Compared with existing doubly robust methods, our nonparametric MI approach is more robust to the misspecification of both working models; it also avoids the use of inverse-weighting and hence is less sensitive to missing probabilities that are close to 1. We propose a sensitivity analysis for evaluating the validity of the working models, allowing investigators to choose the optimal weights so that the resulting estimator relies either completely or more heavily on the working model that is likely to be correctly specified and achieves improved efficiency. We investigate the asymptotic properties of the proposed estimator, and perform simulation studies to show that the proposed method compares favorably with some existing methods in finite samples. The proposed method is further illustrated using data from a colorectal adenoma study.

PMCID: PMC3280694
PMID: 22347786

Doubly robust; Missing at random; Multiple imputation; Nearest neighbor; Nonparametric imputation; Sensitivity analysis

Methods to handle missing data have been an area of statistical research for many years. Little has been done within the context of pedigree analysis. In this paper we present two methods for imputing missing data for polygenic models using family data. The imputation schemes take into account familial relationships and use the observed familial information for the imputation. A traditional multiple imputation approach and multiple imputation or data augmentation approach within a Gibbs sampler for the handling of missing data for a polygenic model are presented.

We used both the Genetic Analysis Workshop 13 simulated missing phenotype and the complete phenotype data sets as the means to illustrate the two methods. We looked at the phenotypic trait systolic blood pressure and the covariate gender at time point 11 (1970) for Cohort 1 and time point 1 (1971) for Cohort 2. Comparing the results for three replicates of complete and missing data incorporating multiple imputation, we find that multiple imputation via a Gibbs sampler produces more accurate results. Thus, we recommend the Gibbs sampler for imputation purposes because of the ease with which it can be extended to more complicated models, the consistency of the results, and the accountability of the variation due to imputation.

doi:10.1186/1471-2156-4-S1-S42

PMCID: PMC1866478
PMID: 14975110

Many biological traits are discretely distributed in phenotype but continuously distributed in genetics because they are controlled by multiple genes and environmental variants. Due to the quantitative nature of the genetic background, these multiple genes are called quantitative trait loci (QTL). When the QTL effects are treated as random, they can be estimated in a single generalized linear mixed model (GLMM), even if the number of QTL may be larger than the sample size. The GLMM in its original form cannot be applied to QTL mapping for discrete traits if there are missing genotypes. We examined two alternative missing genotype-handling methods: the expectation method and the overdispersion method. Simulation studies show that the two methods are efficient for multiple QTL mapping (MQM) under the GLMM framework. The overdispersion method showed slight advantages over the expectation method in terms of smaller mean-squared errors of the estimated QTL effects. The two methods of GLMM were applied to MQM for the female fertility trait of wheat. Multiple QTL were detected to control the variation of the number of seeded spikelets.

doi:10.1038/hdy.2012.10

PMCID: PMC3375403
PMID: 22415425

binary trait; binomial trait; mixed model; overdispersion; QTL

Summary

Multiple imputation is a practically useful approach to handling incompletely observed data in statistical analysis. Parameter estimation and inference based on imputed full data have been made easy by Rubin's rule for result combination. However, creating proper imputation that accommodates flexible models for statistical analysis in practice can be very challenging. We propose an imputation framework that uses conditional semiparametric odds ratio models to impute the missing values. The proposed imputation framework is more flexible and robust than the imputation approach based on the normal model. It is a compatible framework in comparison to the approach based on fully conditionally specified models. The proposed algorithms for multiple imputation through the Monte Carlo Markov Chain sampling approach can be straightforwardly carried out. Simulation studies demonstrate that the proposed approach performs better than existing, commonly used imputation approaches. The proposed approach is applied to imputing missing values in bone fracture data.

doi:10.1111/j.1541-0420.2010.01538.x

PMCID: PMC3135790
PMID: 21210771

Acceptance-rejection sampling; Dirichlet process prior; Gibbs sampler; Hybrid MCMC; Molecular dynamics algorithm; Nonparametric Bayesian inference; Rejection control

Background

There is no consensus on the most appropriate approach to handle missing covariate data within prognostic modelling studies. Therefore a simulation study was performed to assess the effects of different missing data techniques on the performance of a prognostic model.

Methods

Datasets were generated to resemble the skewed distributions seen in a motivating breast cancer example. Multivariate missing data were imposed on four covariates using four different mechanisms; missing completely at random (MCAR), missing at random (MAR), missing not at random (MNAR) and a combination of all three mechanisms. Five amounts of incomplete cases from 5% to 75% were considered. Complete case analysis (CC), single imputation (SI) and five multiple imputation (MI) techniques available within the R statistical software were investigated: a) data augmentation (DA) approach assuming a multivariate normal distribution, b) DA assuming a general location model, c) regression switching imputation, d) regression switching with predictive mean matching (MICE-PMM) and e) flexible additive imputation models. A Cox proportional hazards model was fitted and appropriate estimates for the regression coefficients and model performance measures were obtained.

Results

Performing a CC analysis produced unbiased regression estimates, but inflated standard errors, which affected the significance of the covariates in the model with 25% or more missingness. Using SI, underestimated the variability; resulting in poor coverage even with 10% missingness. Of the MI approaches, applying MICE-PMM produced, in general, the least biased estimates and better coverage for the incomplete covariates and better model performance for all mechanisms. However, this MI approach still produced biased regression coefficient estimates for the incomplete skewed continuous covariates when 50% or more cases had missing data imposed with a MCAR, MAR or combined mechanism. When the missingness depended on the incomplete covariates, i.e. MNAR, estimates were biased with more than 10% incomplete cases for all MI approaches.

Conclusion

The results from this simulation study suggest that performing MICE-PMM may be the preferred MI approach provided that less than 50% of the cases have missing data and the missing data are not MNAR.

doi:10.1186/1471-2288-10-7

PMCID: PMC2824146
PMID: 20085642

Multiple imputation is a strategy for the analysis of incomplete data such that the impact of the missingness on the power and bias of estimates is mitigated. When data from multiple studies are collated, we can propose both within-study and multilevel imputation models to impute missing data on covariates. It is not clear how to choose between imputation models or how to combine imputation and inverse-variance weighted meta-analysis methods. This is especially important as often different studies measure data on different variables, meaning that we may need to impute data on a variable which is systematically missing in a particular study. In this paper, we consider a simulation analysis of sporadically missing data in a single covariate with a linear analysis model and discuss how the results would be applicable to the case of systematically missing data. We find in this context that ensuring the congeniality of the imputation and analysis models is important to give correct standard errors and confidence intervals. For example, if the analysis model allows between-study heterogeneity of a parameter, then we should incorporate this heterogeneity into the imputation model to maintain the congeniality of the two models. In an inverse-variance weighted meta-analysis, we should impute missing data and apply Rubin's rules at the study level prior to meta-analysis, rather than meta-analyzing each of the multiple imputations and then combining the meta-analysis estimates using Rubin's rules. We illustrate the results using data from the Emerging Risk Factors Collaboration.

doi:10.1002/sim.5844

PMCID: PMC3963448
PMID: 23703895

missing data; multiple imputation; meta-analysis; individual participant data; Rubin's rules

Background

Multiple imputation is a commonly used method for handling incomplete covariates as it can provide valid inference when data are missing at random. This depends on being able to correctly specify the parametric model used to impute missing values, which may be difficult in many realistic settings. Imputation by predictive mean matching (PMM) borrows an observed value from a donor with a similar predictive mean; imputation by local residual draws (LRD) instead borrows the donor’s residual. Both methods relax some assumptions of parametric imputation, promising greater robustness when the imputation model is misspecified.

Methods

We review development of PMM and LRD and outline the various forms available, and aim to clarify some choices about how and when they should be used. We compare performance to fully parametric imputation in simulation studies, first when the imputation model is correctly specified and then when it is misspecified.

Results

In using PMM or LRD we strongly caution against using a single donor, the default value in some implementations, and instead advocate sampling from a pool of around 10 donors. We also clarify which matching metric is best. Among the current MI software there are several poor implementations.

Conclusions

PMM and LRD may have a role for imputing covariates (i) which are not strongly associated with outcome, and (ii) when the imputation model is thought to be slightly but not grossly misspecified. Researchers should spend efforts on specifying the imputation model correctly, rather than expecting predictive mean matching or local residual draws to do the work.

doi:10.1186/1471-2288-14-75

PMCID: PMC4051964
PMID: 24903709

Multiple imputation; Imputation model; Predictive mean matching; Local residual draws; Missing data

Motivation: R/qtl is free and powerful software for mapping and exploring quantitative trait loci (QTL). R/qtl provides a fully comprehensive range of methods for a wide range of experimental cross types. We recently added multiple QTL mapping (MQM) to R/qtl. MQM adds higher statistical power to detect and disentangle the effects of multiple linked and unlinked QTL compared with many other methods. MQM for R/qtl adds many new features including improved handling of missing data, analysis of 10 000 s of molecular traits, permutation for determining significance thresholds for QTL and QTL hot spots, and visualizations for cis–trans and QTL interaction effects. MQM for R/qtl is the first free and open source implementation of MQM that is multi-platform, scalable and suitable for automated procedures and large genetical genomics datasets.

Availability: R/qtl is free and open source multi-platform software for the statistical language R, and is made available under the GPLv3 license. R/qtl can be installed from http://www.rqtl.org/. R/qtl queries should be directed at the mailing list, see http://www.rqtl.org/list/.

Contact: kbroman@biostat.wisc.edu

doi:10.1093/bioinformatics/btq565

PMCID: PMC2982156
PMID: 20966004

Genotyping by sequencing (GBS) recently has emerged as a promising genomic approach for assessing genetic diversity on a genome-wide scale. However, concerns are not lacking about the uniquely large unbalance in GBS genotype data. Although some genotype imputation has been proposed to infer missing observations, little is known about the reliability of a genetic diversity analysis of GBS data, with up to 90% of observations missing. Here we performed an empirical assessment of accuracy in genetic diversity analysis of highly incomplete single nucleotide polymorphism genotypes with imputations. Three large single-nucleotide polymorphism genotype data sets for corn, wheat, and rice were acquired, and missing data with up to 90% of missing observations were randomly generated and then imputed for missing genotypes with three map-independent imputation methods. Estimating heterozygosity and inbreeding coefficient from original, missing, and imputed data revealed variable patterns of bias from assessed levels of missingness and genotype imputation, but the estimation biases were smaller for missing data without genotype imputation. The estimates of genetic differentiation were rather robust up to 90% of missing observations but became substantially biased when missing genotypes were imputed. The estimates of topology accuracy for four representative samples of interested groups generally were reduced with increased levels of missing genotypes. Probabilistic principal component analysis based imputation performed better in terms of topology accuracy than those analyses of missing data without genotype imputation. These findings are not only significant for understanding the reliability of the genetic diversity analysis with respect to large missing data and genotype imputation but also are instructive for performing a proper genetic diversity analysis of highly incomplete GBS or other genotype data.

doi:10.1534/g3.114.010942

PMCID: PMC4025488
PMID: 24626289

Genetic diversity; genotyping-by-sequencing; imputation; missing data; unordered marker data

The presence of missing data in association studies is an important problem, particularly with high-density single-nucleotide polymorphism (SNP) maps, because the probability that at least one genotype is missing dramatically increases with the number of markers. A possible strategy is to simply ignore the missing data and only use the complete observations, and, consequently, to accept a significant decrease of the sample size. Using Genetic Analysis Workshop 15 simulated data on which we removed some genotypes to generate different levels of missing data, we show that this strategy might lead to an important loss in power to detect association, but may also result in false conclusions regarding the most likely susceptibility site if another marker is in linkage disequilibrium with the disease susceptibility site. We propose a multiple imputation approach to deal with missing data on case-parent trios and evaluated the performance of this approach on the same simulated data. We found that our multiple imputation approach has high power to detect association with the susceptibility site even with a large amount of missing data, and can identify the susceptibility sites among a set of sites in linkage disequilibrium.

PMCID: PMC2367517
PMID: 18466521

Background

Multiple imputation (MI) provides an effective approach to handle missing covariate data within prognostic modelling studies, as it can properly account for the missing data uncertainty. The multiply imputed datasets are each analysed using standard prognostic modelling techniques to obtain the estimates of interest. The estimates from each imputed dataset are then combined into one overall estimate and variance, incorporating both the within and between imputation variability. Rubin's rules for combining these multiply imputed estimates are based on asymptotic theory. The resulting combined estimates may be more accurate if the posterior distribution of the population parameter of interest is better approximated by the normal distribution. However, the normality assumption may not be appropriate for all the parameters of interest when analysing prognostic modelling studies, such as predicted survival probabilities and model performance measures.

Methods

Guidelines for combining the estimates of interest when analysing prognostic modelling studies are provided. A literature review is performed to identify current practice for combining such estimates in prognostic modelling studies.

Results

Methods for combining all reported estimates after MI were not well reported in the current literature. Rubin's rules without applying any transformations were the standard approach used, when any method was stated.

Conclusion

The proposed simple guidelines for combining estimates after MI may lead to a wider and more appropriate use of MI in future prognostic modelling studies.

doi:10.1186/1471-2288-9-57

PMCID: PMC2727536
PMID: 19638200

Summary

Missing covariate data often arise in biomedical studies, and analysis of such data that ignores subjects with incomplete information may lead to inefficient and possibly biased estimates. A great deal of attention has been paid to handling a single missing covariate or a monotone pattern of missing data when the missingness mechanism is missing at random. In this paper, we propose a semiparametric method for handling non-monotone patterns of missing data. The proposed method relies on the assumption that the missingness mechanism of a variable does not depend on the missing variable itself but may depend on the other missing variables. This mechanism is somewhat less general than the completely non-ignorable mechanism but is sometimes more flexible than the missing at random mechanism where the missingness mechansim is allowed to depend only on the completely observed variables. The proposed approach is robust to misspecification of the distribution of the missing covariates, and the proposed mechanism helps to nullify (or reduce) the problems due to non-identifiability that result from the non-ignorable missingness mechanism. The asymptotic properties of the proposed estimator are derived. Finite sample performance is assessed through simulation studies. Finally, for the purpose of illustration we analyze an endometrial cancer dataset and a hip fracture dataset.

doi:10.1111/biom.12159

PMCID: PMC4061254
PMID: 24571224

Dimension reduction; Estimating equations; Missing at random; Non-ignorable missing data; Robust method

Due to the growing need to combine data across multiple studies and to impute untyped markers based on a reference sample, several analytical tools for imputation and analysis of missing genotypes have been developed. Current imputation methods rely on single imputation, which ignores the variation in estimation due to imputation. An alternative to single imputation is multiple imputation. In this paper, we assess the variation in imputation by completing both single and multiple imputations of genotypic data using MACH, a commonly used hidden Markov model imputation method. Using data from the North American Rheumatoid Arthritis Consortium genome-wide study, the use of single and multiple imputation was assessed in four regions of chromosome 1 with varying levels of linkage disequilibrium and association signals. Two scenarios for missing genotypic data were assessed: imputation of untyped markers and combination of genotypic data from two studies. This limited study involving four regions indicates that, contrary to expectations, multiple imputations may not be necessary.

PMCID: PMC2795971
PMID: 20018064

Missing covariate data present a challenge to tree-structured methodology due to the fact that a single tree model, as opposed to an estimated parameter value, may be desired for use in a clinical setting. To address this problem, we suggest a multiple imputation algorithm that adds draws of stochastic error to a tree-based single imputation method presented by Conversano and Siciliano (Technical Report, University of Naples, 2003). Unlike previously proposed techniques for accommodating missing covariate data in tree-structured analyses, our methodology allows the modeling of complex and nonlinear covariate structures while still resulting in a single tree model. We perform a simulation study to evaluate our stochastic multiple imputation algorithm when covariate data are missing at random and compare it to other currently used methods. Our algorithm is advantageous for identifying the true underlying covariate structure when complex data and larger percentages of missing covariate observations are present. It is competitive with other current methods with respect to prediction accuracy. To illustrate our algorithm, we create a tree-structured survival model for predicting time to treatment response in older, depressed adults.

doi:10.1002/sim.4079

PMCID: PMC3021888
PMID: 20963751

regression trees; classification trees; survival trees; survival analysis; missing data; imputation

Background

Attrition, which leads to missing data, is a common problem in cluster randomized trials (CRTs), where groups of patients rather than individuals are randomized. Standard multiple imputation (MI) strategies may not be appropriate to impute missing data from CRTs since they assume independent data. In this paper, under the assumption of missing completely at random and covariate dependent missing, we compared six MI strategies which account for the intra-cluster correlation for missing binary outcomes in CRTs with the standard imputation strategies and complete case analysis approach using a simulation study.

Method

We considered three within-cluster and three across-cluster MI strategies for missing binary outcomes in CRTs. The three within-cluster MI strategies are logistic regression method, propensity score method, and Markov chain Monte Carlo (MCMC) method, which apply standard MI strategies within each cluster. The three across-cluster MI strategies are propensity score method, random-effects (RE) logistic regression approach, and logistic regression with cluster as a fixed effect. Based on the community hypertension assessment trial (CHAT) which has complete data, we designed a simulation study to investigate the performance of above MI strategies.

Results

The estimated treatment effect and its 95% confidence interval (CI) from generalized estimating equations (GEE) model based on the CHAT complete dataset are 1.14 (0.76 1.70). When 30% of binary outcome are missing completely at random, a simulation study shows that the estimated treatment effects and the corresponding 95% CIs from GEE model are 1.15 (0.76 1.75) if complete case analysis is used, 1.12 (0.72 1.73) if within-cluster MCMC method is used, 1.21 (0.80 1.81) if across-cluster RE logistic regression is used, and 1.16 (0.82 1.64) if standard logistic regression which does not account for clustering is used.

Conclusion

When the percentage of missing data is low or intra-cluster correlation coefficient is small, different approaches for handling missing binary outcome data generate quite similar results. When the percentage of missing data is large, standard MI strategies, which do not take into account the intra-cluster correlation, underestimate the variance of the treatment effect. Within-cluster and across-cluster MI strategies (except for random-effects logistic regression MI strategy), which take the intra-cluster correlation into account, seem to be more appropriate to handle the missing outcome from CRTs. Under the same imputation strategy and percentage of missingness, the estimates of the treatment effect from GEE and RE logistic regression models are similar.

doi:10.1186/1471-2288-11-18

PMCID: PMC3055218
PMID: 21324148

Several approaches exist for handling missing covariates in the Cox proportional hazards model. The multiple imputation (MI) is relatively easy to implement with various software available and results in consistent estimates if the imputation model is correct. On the other hand, the fully augmented weighted estimators (FAWEs) recover a substantial proportion of the efficiency and have the doubly robust property. In this paper, we compare the FAWEs and the MI through a comprehensive simulation study. For the MI, we consider the multiple imputation by chained equation (MICE) and focus on two imputation methods: Bayesian linear regression imputation and predictive mean matching. Simulation results show that the imputation methods can be rather sensitive to model misspecification and may have large bias when the censoring time depends on the missing covariates. In contrast, the FAWEs allow the censoring time to depend on the missing covariates and are remarkably robust as long as getting either the conditional expectations or the selection probability correct due to the doubly robust property. The comparison suggests that the FAWEs show the potential for being a competitive and attractive tool for tackling the analysis of survival data with missing covariates.

doi:10.1002/sim.4016

PMCID: PMC4022355
PMID: 20806403

accelerated failure time model; augmented inverse probability weighted estimators; doubly robust property; missing data; proportional hazards model; survival analysis

Background

Multiple imputation is becoming increasingly popular for handling missing data. However, it is often implemented without adequate consideration of whether it offers any advantage over complete case analysis for the research question of interest, or whether potential gains may be offset by bias from a poorly fitting imputation model, particularly as the amount of missing data increases.

Methods

Simulated datasets (n = 1000) drawn from a synthetic population were used to explore information recovery from multiple imputation in estimating the coefficient of a binary exposure variable when various proportions of data (10-90%) were set missing at random in a highly-skewed continuous covariate or in the binary exposure. Imputation was performed using multivariate normal imputation (MVNI), with a simple or zero-skewness log transformation to manage non-normality. Bias, precision, mean-squared error and coverage for a set of regression parameter estimates were compared between multiple imputation and complete case analyses.

Results

For missingness in the continuous covariate, multiple imputation produced less bias and greater precision for the effect of the binary exposure variable, compared with complete case analysis, with larger gains in precision with more missing data. However, even with only moderate missingness, large bias and substantial under-coverage were apparent in estimating the continuous covariate’s effect when skewness was not adequately addressed. For missingness in the binary covariate, all estimates had negligible bias but gains in precision from multiple imputation were minimal, particularly for the coefficient of the binary exposure.

Conclusions

Although multiple imputation can be useful if covariates required for confounding adjustment are missing, benefits are likely to be minimal when data are missing in the exposure variable of interest. Furthermore, when there are large amounts of missingness, multiple imputation can become unreliable and introduce bias not present in a complete case analysis if the imputation model is not appropriate. Epidemiologists dealing with missing data should keep in mind the potential limitations as well as the potential benefits of multiple imputation. Further work is needed to provide clearer guidelines on effective application of this method.

doi:10.1186/1742-7622-9-3

PMCID: PMC3544721
PMID: 22695083

Missing data; Multiple imputation; Fully conditional specification; Multivariate normal imputation; Non-normal data

We present a framework for generating multiple imputations for continuous data when the missing data mechanism is unknown. Imputations are generated from more than one imputation model in order to incorporate uncertainty regarding the missing data mechanism. Parameter estimates based on the different imputation models are combined using rules for nested multiple imputation. Through the use of simulation, we investigate the impact of missing data mechanism uncertainty on post-imputation inferences and show that incorporating this uncertainty can increase the coverage of parameter estimates. We apply our method to a longitudinal clinical trial of low-income women with depression where nonignorably missing data were a concern. We show that different assumptions regarding the missing data mechanism can have a substantial impact on inferences. Our method provides a simple approach for formalizing subjective notions regarding nonresponse so that they can be easily stated, communicated, and compared.

doi:10.1214/12-AOAS555

PMCID: PMC3596844
PMID: 23503984

nonignorable; NMAR; MNAR; not missing at random; missing not at random

Multiple imputation (MI) is an approach widely used in statistical analysis of incomplete data. However, its application to missing data problems in nonlinear mixed-effects modelling is limited. The objective was to implement a four-step MI method for handling missing covariate data in NONMEM and to evaluate the method’s sensitivity to η-shrinkage. Four steps were needed; (1) estimation of empirical Bayes estimates (EBEs) using a base model without the partly missing covariate, (2) a regression model for the covariate values given the EBEs from subjects with covariate information, (3) imputation of covariates using the regression model and (4) estimation of the population model. Steps (3) and (4) were repeated several times. The procedure was automated in PsN and is now available as the mimp functionality (http://psn.sourceforge.net/). The method’s sensitivity to shrinkage in EBEs was evaluated in a simulation study where the covariate was missing according to a missing at random type of missing data mechanism. The η-shrinkage was increased in steps from 4.5 to 54%. Two hundred datasets were simulated and analysed for each scenario. When shrinkage was low the MI method gave unbiased and precise estimates of all population parameters. With increased shrinkage the estimates became less precise but remained unbiased.

doi:10.1208/s12248-013-9508-0

PMCID: PMC3787209
PMID: 23868748

covariates; missing data; multiple imputation; NONMEM

Background

Multiple imputation (MI) is becoming increasingly popular as a strategy for handling missing data, but there is a scarcity of tools for checking the adequacy of imputation models. The Kolmogorov-Smirnov (KS) test has been identified as a potential diagnostic method for assessing whether the distribution of imputed data deviates substantially from that of the observed data. The aim of this study was to evaluate the performance of the KS test as an imputation diagnostic.

Methods

Using simulation, we examined whether the KS test could reliably identify departures from assumptions made in the imputation model. To do this we examined how the p-values from the KS test behaved when skewed and heavy-tailed data were imputed using a normal imputation model. We varied the amount of missing data, the missing data models and the amount of skewness, and evaluated the performance of KS test in diagnosing issues with the imputation models under these different scenarios.

Results

The KS test was able to flag differences between the observations and imputed values; however, these differences did not always correspond to problems with MI inference for the regression parameter of interest. When there was a strong missing at random dependency, the KS p-values were very small, regardless of whether or not the MI estimates were biased; so that the KS test was not able to discriminate between imputed variables that required further investigation, and those that did not. The p-values were also sensitive to sample size and the proportion of missing data, adding to the challenge of interpreting the results from the KS test.

Conclusions

Given our study results, it is difficult to establish guidelines or recommendations for using the KS test as a diagnostic tool for MI. The investigation of other imputation diagnostics and their incorporation into statistical software are important areas for future research.

doi:10.1186/1471-2288-13-144

PMCID: PMC3840572
PMID: 24252653

Missing data; Multiple imputation; Model checking; Kolmogorov-Smirnov test; Diagnostics; Simulations

Summary

Two approaches commonly used to deal with missing data are multiple
imputation (MI) and inverse-probability weighting (IPW). IPW is also used to
adjust for unequal sampling fractions. MI is generally more efficient than
IPW but more complex. Whereas IPW requires only a model for the probability
that an individual has complete data (a univariate outcome), MI needs a
model for the joint distribution of the missing data (a multivariate
outcome) given the observed data. Inadequacies in either model may lead to
important bias if large amounts of data are missing. A third approach
combines MI and IPW to give a doubly robust estimator. A fourth approach
(IPW/MI) combines MI and IPW but, unlike doubly robust methods, imputes only
isolated missing values and uses weights to account for remaining larger
blocks of unimputed missing data, such as would arise, e.g., in a cohort
study subject to sample attrition, and/or unequal sampling fractions. In
this article, we examine the performance, in terms of bias and efficiency,
of IPW/MI relative to MI and IPW alone and investigate whether the
Rubin’s rules variance estimator is valid for IPW/MI. We prove that
the Rubin’s rules variance estimator is valid for IPW/MI for linear
regression with an imputed outcome, we present simulations supporting the
use of this variance estimator in more general settings, and we demonstrate
that IPW/MI can have advantages over alternatives. IPW/MI is applied to data
from the National Child Development Study.

doi:10.1111/j.1541-0420.2011.01666.x

PMCID: PMC3412287
PMID: 22050039

Marginal model; Missing at random; Survey weighting; 1958 British Birth Cohort

An Approximate Bayesian Bootstrap (ABB) offers advantages in incorporating appropriate uncertainty when imputing missing data, but most implementations of the ABB have lacked the ability to handle nonignorable missing data where the probability of missingness depends on unobserved values. This paper outlines a strategy for using an ABB to multiply impute nonignorable missing data. The method allows the user to draw inferences and perform sensitivity analyses when the missing data mechanism cannot automatically be assumed to be ignorable. Results from imputing missing values in a longitudinal depression treatment trial as well as a simulation study are presented to demonstrate the method’s performance. We show that a procedure that uses a different type of ABB for each imputed data set accounts for appropriate uncertainty and provides nominal coverage.

doi:10.1016/j.csda.2008.07.042

PMCID: PMC2678725
PMID: 20016665

Not Missing at Random; NMAR; Multiple Imputation; Hot-Deck

Background

A common feature of microarray experiments is the occurence of missing gene expression data. These missing values occur for a variety of reasons, in particular, because of the filtering of poor quality spots and the removal of undefined values when a logarithmic transformation is applied to negative background-corrected intensities. The efficiency and power of an analysis performed can be substantially reduced by having an incomplete matrix of gene intensities. Additionally, most statistical methods require a complete intensity matrix. Furthermore, biases may be introduced into analyses through missing information on some genes. Thus methods for appropriately replacing (imputing) missing data and/or weighting poor quality spots are required.

Results

We present a likelihood-based method for imputing missing data or weighting poor quality spots that requires a number of biological or technical replicates. This likelihood-based approach assumes that the data for a given spot arising from each channel of a two-dye (two-channel) cDNA microarray comparison experiment independently come from a three-component mixture distribution – the parameters of which are estimated through use of a constrained E-M algorithm. Posterior probabilities of belonging to each component of the mixture distributions are calculated and used to decide whether imputation is required. These posterior probabilities may also be used to construct quality weights that can down-weight poor quality spots in any analysis performed afterwards. The approach is illustrated using data obtained from an experiment to observe gene expression changes with 24 hr paclitaxel (Taxol ®) treatment on a human cervical cancer derived cell line (HeLa).

Conclusion

As the quality of microarray experiments affect downstream processes, it is important to have a reliable and automatic method of identifying poor quality spots and arrays. We propose a method of identifying poor quality spots, and suggest a method of repairing the arrays by either imputation or assigning quality weights to the spots. This repaired data set would be less biased and can be analysed using any of the appropriate statistical methods found in the microarray literature.

doi:10.1186/1471-2105-6-234

PMCID: PMC1262693
PMID: 16185360

Background

The appropriate handling of missing covariate data in prognostic modelling studies is yet to be conclusively determined. A resampling study was performed to investigate the effects of different missing data methods on the performance of a prognostic model.

Methods

Observed data for 1000 cases were sampled with replacement from a large complete dataset of 7507 patients to obtain 500 replications. Five levels of missingness (ranging from 5% to 75%) were imposed on three covariates using a missing at random (MAR) mechanism. Five missing data methods were applied; a) complete case analysis (CC) b) single imputation using regression switching with predictive mean matching (SI), c) multiple imputation using regression switching imputation, d) multiple imputation using regression switching with predictive mean matching (MICE-PMM) and e) multiple imputation using flexible additive imputation models. A Cox proportional hazards model was fitted to each dataset and estimates for the regression coefficients and model performance measures obtained.

Results

CC produced biased regression coefficient estimates and inflated standard errors (SEs) with 25% or more missingness. The underestimated SE after SI resulted in poor coverage with 25% or more missingness. Of the MI approaches investigated, MI using MICE-PMM produced the least biased estimates and better model performance measures. However, this MI approach still produced biased regression coefficient estimates with 75% missingness.

Conclusions

Very few differences were seen between the results from all missing data approaches with 5% missingness. However, performing MI using MICE-PMM may be the preferred missing data approach for handling between 10% and 50% MAR missingness.

doi:10.1186/1471-2288-10-112

PMCID: PMC3019210
PMID: 21194416