Related Articles
Background
Multiple imputation (MI) provides an effective approach to handle missing covariate data within prognostic modelling studies, as it can properly account for the missing data uncertainty. The multiply imputed datasets are each analysed using standard prognostic modelling techniques to obtain the estimates of interest. The estimates from each imputed dataset are then combined into one overall estimate and variance, incorporating both the within and between imputation variability. Rubin's rules for combining these multiply imputed estimates are based on asymptotic theory. The resulting combined estimates may be more accurate if the posterior distribution of the population parameter of interest is better approximated by the normal distribution. However, the normality assumption may not be appropriate for all the parameters of interest when analysing prognostic modelling studies, such as predicted survival probabilities and model performance measures.
Methods
Guidelines for combining the estimates of interest when analysing prognostic modelling studies are provided. A literature review is performed to identify current practice for combining such estimates in prognostic modelling studies.
Results
Methods for combining all reported estimates after MI were not well reported in the current literature. Rubin's rules without applying any transformations were the standard approach used, when any method was stated.
Conclusion
The proposed simple guidelines for combining estimates after MI may lead to a wider and more appropriate use of MI in future prognostic modelling studies.
doi:10.1186/1471-2288-9-57
PMCID: PMC2727536
PMID: 19638200
Background
Intermediate outcome variables can often be used as auxiliary variables for the true outcome of interest in randomized clinical trials. For many cancers, time to recurrence is an informative marker in predicting a patient’s overall survival outcome, and could provide auxiliary information for the analysis of survival times.
Purpose
To investigate whether models linking recurrence and death combined with a multiple imputation procedure for censored observations can result in efficiency gains in the estimation of treatment effects, and be used to shorten trial lengths.
Methods
Recurrence and death times are modeled using data from 12 trials in colorectal cancer. Multiple imputation is used as a strategy for handling missing values arising from censoring. The imputation procedure uses a cure model for time to recurrence and a time-dependent Weibull proportional hazards model for time to death. Recurrence times are imputed, and then death times are imputed conditionally on recurrence times. To illustrate these methods, trials are artificially censored 2-years after the last accrual, the imputation procedure is implemented, and a log-rank test and Cox model are used to analyze and compare these new data with the original data.
Results
The results show modest, but consistent gains in efficiency in the analysis by using the auxiliary information in recurrence times. Comparison of analyses show the treatment effect estimates and log rank test results from the 2-year censored imputed data to be in between the estimates from the original data and the artificially censored data, indicating that the procedure was able to recover some of the lost information due to censoring.
Limitations
The models used are all fully parametric, requiring distributional assumptions of the data.
Conclusions
The proposed models may be useful to improve the efficiency in estimation of treatment effects in cancer trials and shortening trial length.
doi:10.1177/1740774511414741
PMCID: PMC3197975
PMID: 21921063
Auxiliary Variables; Colon Cancer; Cure Models; Multiple Imputation; Surrogate Endpoints
Due to the growing need to combine data across multiple studies and to impute untyped markers based on a reference sample, several analytical tools for imputation and analysis of missing genotypes have been developed. Current imputation methods rely on single imputation, which ignores the variation in estimation due to imputation. An alternative to single imputation is multiple imputation. In this paper, we assess the variation in imputation by completing both single and multiple imputations of genotypic data using MACH, a commonly used hidden Markov model imputation method. Using data from the North American Rheumatoid Arthritis Consortium genome-wide study, the use of single and multiple imputation was assessed in four regions of chromosome 1 with varying levels of linkage disequilibrium and association signals. Two scenarios for missing genotypic data were assessed: imputation of untyped markers and combination of genotypic data from two studies. This limited study involving four regions indicates that, contrary to expectations, multiple imputations may not be necessary.
PMCID: PMC2795971
PMID: 20018064
SYNOPSIS
Objective.
The purpose of this study was to assess an alternative statistical approach—multiple imputation—to risk factor redistribution in the national human immunodeficiency virus (HIV)/acquired immunodeficiency syndrome (AIDS) surveillance system as a way to adjust for missing risk factor information.
Methods.
We used an approximate model incorporating random variation to impute values for missing risk factors for HIV and AIDS cases diagnosed from 2000 to 2004. The process was repeated M times to generate M datasets. We combined results from the datasets to compute an overall multiple imputation estimate and standard error (SE), and then compared results from multiple imputation and from risk factor redistribution. Variables in the imputation models were age at diagnosis, race/ethnicity, type of facility where diagnosis was made, region of residence, national origin, CD-4 T-lymphocyte cell count within six months of diagnosis, and reporting year.
Results.
In HIV data, male-to-male sexual contact accounted for 67.3% of cases by risk factor redistribution and 70.4% (SE=0.45) by multiple imputation. Also among males, injection drug use (IDU) accounted for 11.6% and 10.8% (SE=0.34), and high-risk heterosexual contact for 15.1% and 13.0% (SE=0.34) by risk factor redistribution and multiple imputation, respectively. Among females, IDU accounted for 18.2% and 17.9% (SE=0.61), and high-risk heterosexual contact for 80.8% and 80.9% (SE=0.63) by risk factor redistribution and multiple imputation, respectively.
Conclusions.
Because multiple imputation produces less biased subgroup estimates and offers objectivity and a semiautomated approach, we suggest consideration of its use in adjusting for missing risk factor information.
PMCID: PMC2496935
PMID: 18828417
When the outcome of interest is a quantity whose value may be altered through the use of medications, estimation of associations with this outcome is a challenging statistical problem. For participants taking medication the treated value is observed, but the underlying “untreated” value may be the measure that is truly of interest. Problematically, those with the highest untreated values may have some of the lowest observed measurements due to the effectiveness of medications. In this paper we propose an approach in which we parametrically estimate the underlying untreated variable of interest as a function of the observed treated value, dose and type of medication. Multiple imputation is used to incorporate the variability induced by the estimation. We show that this approach yields more realistic parameter estimates than other more traditional approaches to the problem, and that study conclusions may be altered in a meaningful way by using the imputed values.
doi:10.1002/sim.3341
PMCID: PMC2829271
PMID: 18613245
Multiple imputation is commonly used to impute missing data, and is typically more efficient than complete cases analysis in regression analysis when covariates have missing values. Imputation may be performed using a regression model for the incomplete covariates on other covariates and, importantly, on the outcome. With a survival outcome, it is a common practice to use the event indicator D and the log of the observed event or censoring time T in the imputation model, but the rationale is not clear.
We assume that the survival outcome follows a proportional hazards model given covariates X and Z. We show that a suitable model for imputing binary or Normal X is a logistic or linear regression on the event indicator D, the cumulative baseline hazard H0(T), and the other covariates Z. This result is exact in the case of a single binary covariate; in other cases, it is approximately valid for small covariate effects and/or small cumulative incidence. If we do not know H0(T), we approximate it by the Nelson–Aalen estimator of H(T) or estimate it by Cox regression.
We compare the methods using simulation studies. We find that using log T biases covariate-outcome associations towards the null, while the new methods have lower bias. Overall, we recommend including the event indicator and the Nelson–Aalen estimator of H(T) in the imputation model. Copyright © 2009 John Wiley & Sons, Ltd.
doi:10.1002/sim.3618
PMCID: PMC2998703
PMID: 19452569
missing data; missing covariates; multiple imputation; proportional hazards model
Background
Multiple imputation is becoming increasingly popular for handling missing data. However, it is often implemented without adequate consideration of whether it offers any advantage over complete case analysis for the research question of interest, or whether potential gains may be offset by bias from a poorly fitting imputation model, particularly as the amount of missing data increases.
Methods
Simulated datasets (n = 1000) drawn from a synthetic population were used to explore information recovery from multiple imputation in estimating the coefficient of a binary exposure variable when various proportions of data (10-90%) were set missing at random in a highly-skewed continuous covariate or in the binary exposure. Imputation was performed using multivariate normal imputation (MVNI), with a simple or zero-skewness log transformation to manage non-normality. Bias, precision, mean-squared error and coverage for a set of regression parameter estimates were compared between multiple imputation and complete case analyses.
Results
For missingness in the continuous covariate, multiple imputation produced less bias and greater precision for the effect of the binary exposure variable, compared with complete case analysis, with larger gains in precision with more missing data. However, even with only moderate missingness, large bias and substantial under-coverage were apparent in estimating the continuous covariate’s effect when skewness was not adequately addressed. For missingness in the binary covariate, all estimates had negligible bias but gains in precision from multiple imputation were minimal, particularly for the coefficient of the binary exposure.
Conclusions
Although multiple imputation can be useful if covariates required for confounding adjustment are missing, benefits are likely to be minimal when data are missing in the exposure variable of interest. Furthermore, when there are large amounts of missingness, multiple imputation can become unreliable and introduce bias not present in a complete case analysis if the imputation model is not appropriate. Epidemiologists dealing with missing data should keep in mind the potential limitations as well as the potential benefits of multiple imputation. Further work is needed to provide clearer guidelines on effective application of this method.
doi:10.1186/1742-7622-9-3
PMCID: PMC3544721
PMID: 22695083
Missing data; Multiple imputation; Fully conditional specification; Multivariate normal imputation; Non-normal data
Missing data often occur in cross-sectional surveys and longitudinal and experimental studies. The purpose of this study was to compare the prediction of self-rated health (SRH), a robust predictor of morbidity and mortality among diverse populations, before and after imputation of the missing variable “yearly household income.” We reviewed data from 4,162 participants of Mexican origin recruited from July 1, 2002, through December 31, 2005, and who were enrolled in a population-based cohort study. Missing yearly income data were imputed using three different single imputation methods and one multiple imputation under a Bayesian approach. Of 4,162 participants, 3,121 were randomly assigned to a training set (to derive the yearly income imputation methods and develop the health-outcome prediction models) and 1,041 to a testing set (to compare the areas under the curve (AUC) of the receiver-operating characteristic of the resulting health-outcome prediction models). The discriminatory powers of the SRH prediction models were good (range, 69–72%) and compared to the prediction model obtained after no imputation of missing yearly income, all other imputation methods improved the prediction of SRH (P<0.05 for all comparisons) with the AUC for the model after multiple imputation being the highest (AUC = 0.731). Furthermore, given that yearly income was imputed using multiple imputation, the odds of SRH as good or better increased by 11% for each $5,000 increment in yearly income. This study showed that although imputation of missing data for a key predictor variable can improve a risk health-outcome prediction model, further work is needed to illuminate the risk factors associated with SRH.
doi:10.1007/s10903-010-9415-8
PMCID: PMC3205225
PMID: 21103931
Self-rated health; Missing income data; Data imputation techniques; Mean substitution; Multiple imputation; Minority health
Summary
Often a binary variable is generated by dichotomizing an underlying continuous variable measured at a specific time point according to a prespecified threshold value. In the event that the underlying continuous measurements are from a longitudinal study, one can use repeated measures model to impute missing data on responder status as a result of subject drop-out and apply logistic regression model on the observed or otherwise imputed responder status. Standard Bayesian multiple imputation techniques (Rubin, 1987, Multiple Imputation for Nonresponse in Surveys) which draw the parameters for the imputation model from the posterior distribution and construct the variance of parameter estimates for the analysis model as a combination of within- and between-imputation variances are found to be conservative. The frequentist multiple imputation approach which fixes the parameters for the imputation model at the maximum likelihood estimates and construct the variance of parameter estimates for the analysis model using the results of (Robins and Wang, 2000, Biometrika 87, 113–124) is shown to be more efficient. We propose to apply (Kenward and Roger, 1997, Biometrics 53, 983–997) degrees-of-freedom to account for the uncertainty associated with variance-covariance parameter estimates for the repeated measures model.
doi:10.1111/j.1541-0420.2010.01405.x
PMCID: PMC3245577
PMID: 20337628
Logistic regression; Missing data; Multiple imputation; Repeated measures
Missing data are common in medical and social science studies and often pose a serious challenge in data analysis. Multiple imputation methods are popular and natural tools for handling missing data, replacing each missing value with a set of plausible values that represent the uncertainty about the underlying values. We consider a case of missing at random (MAR) and investigate the estimation of the marginal mean of an outcome variable in the presence of missing values when a set of fully observed covariates is available. We propose a new nonparametric multiple imputation (MI) approach that uses two working models to achieve dimension reduction and define the imputing sets for the missing observations. Compared with existing nonparametric imputation procedures, our approach can better handle covariates of high dimension, and is doubly robust in the sense that the resulting estimator remains consistent if either of the working models is correctly specified. Compared with existing doubly robust methods, our nonparametric MI approach is more robust to the misspecification of both working models; it also avoids the use of inverse-weighting and hence is less sensitive to missing probabilities that are close to 1. We propose a sensitivity analysis for evaluating the validity of the working models, allowing investigators to choose the optimal weights so that the resulting estimator relies either completely or more heavily on the working model that is likely to be correctly specified and achieves improved efficiency. We investigate the asymptotic properties of the proposed estimator, and perform simulation studies to show that the proposed method compares favorably with some existing methods in finite samples. The proposed method is further illustrated using data from a colorectal adenoma study.
PMCID: PMC3280694
PMID: 22347786
Doubly robust; Missing at random; Multiple imputation; Nearest neighbor; Nonparametric imputation; Sensitivity analysis
Summary
Multiple imputation is a practically useful approach to handling incompletely observed data in statistical analysis. Parameter estimation and inference based on imputed full data have been made easy by Rubin's rule for result combination. However, creating proper imputation that accommodates flexible models for statistical analysis in practice can be very challenging. We propose an imputation framework that uses conditional semiparametric odds ratio models to impute the missing values. The proposed imputation framework is more flexible and robust than the imputation approach based on the normal model. It is a compatible framework in comparison to the approach based on fully conditionally specified models. The proposed algorithms for multiple imputation through the Monte Carlo Markov Chain sampling approach can be straightforwardly carried out. Simulation studies demonstrate that the proposed approach performs better than existing, commonly used imputation approaches. The proposed approach is applied to imputing missing values in bone fracture data.
doi:10.1111/j.1541-0420.2010.01538.x
PMCID: PMC3135790
PMID: 21210771
Acceptance-rejection sampling; Dirichlet process prior; Gibbs sampler; Hybrid MCMC; Molecular dynamics algorithm; Nonparametric Bayesian inference; Rejection control
In cluster randomized trials (CRTs), identifiable clusters rather than individuals are randomized to study groups. Resulting data often consist of a small number of clusters with correlated observations within a treatment group. Missing data often present a problem in the analysis of such trials, and multiple imputation (MI) has been used to create complete data sets, enabling subsequent analysis with well-established analysis methods for CRTs. We discuss strategies for accounting for clustering when multiply imputing a missing continuous outcome, focusing on estimation of the variance of group means as used in an adjusted t-test or ANOVA. These analysis procedures are congenial to (can be derived from) a mixed effects imputation model; however, this imputation procedure is not yet available in commercial statistical software. An alternative approach that is readily available and has been used in recent studies is to include fixed effects for cluster, but the impact of using this convenient method has not been studied. We show that under this imputation model the MI variance estimator is positively biased and that smaller ICCs lead to larger overestimation of the MI variance. Analytical expressions for the bias of the variance estimator are derived in the case of data missing completely at random (MCAR), and cases in which data are missing at random (MAR) are illustrated through simulation. Finally, various imputation methods are applied to data from the Detroit Middle School Asthma Project, a recent school-based CRT, and differences in inference are compared.
doi:10.1002/bimj.201000140
PMCID: PMC3124925
PMID: 21259309
Cluster randomized; Missing Data; Multiple Imputation
Principled techniques for incomplete-data problems are increasingly part of mainstream statistical practice. Among many proposed techniques so far, inference by multiple imputation (MI) has emerged as one of the most popular. While many strategies leading to inference by MI are available in cross-sectional settings, the same richness does not exist in multilevel applications. The limited methods available for multilevel applications rely on the multivariate adaptations of mixed-effects models. This approach preserves the mean structure across clusters and incorporates distinct variance components into the imputation process. In this paper, I add to these methods by considering a random covariance structure and develop computational algorithms. The attraction of this new imputation modeling strategy is to correctly reflect the mean and variance structure of the joint distribution of the data, and allow the covariances differ across the clusters. Using Markov Chain Monte Carlo techniques, a predictive distribution of missing data given observed data is simulated leading to creation of multiple imputations. To circumvent the large sample size requirement to support independent covariance estimates for the level-1 error term, I consider distributional impositions mimicking random-effects distributions assigned a priori. These techniques are illustrated in an example exploring relationships between victimization and individual and contextual level factors that raise the risk of violent crime.
doi:10.1177/1471082X1001100404
PMCID: PMC3263314
PMID: 22271079
Missing data; multiple imputation; linear mixed-effects models; complex sample surveys; mixed effects; random covariances
SUMMARY
Biomedical research is plagued with problems of missing data, especially in clinical trials of medical and behavioral therapies adopting longitudinal design. After a literature review on modeling incomplete longitudinal data based on full-likelihood functions, this paper proposes a set of imputation-based strategies for implementing selection, pattern-mixture, and shared-parameter models for handling intermittent missing values and dropouts that are potentially nonignorable according to various criteria. Within the framework of multiple partial imputation, intermittent missing values are first imputed several times; then, each partially imputed data set is analyzed to deal with dropouts with or without further imputation. Depending on the choice of imputation model or measurement model, there exist various strategies that can be jointly applied to the same set of data to study the effect of treatment or intervention from multi-faceted perspectives. For illustration, the strategies were applied to a data set with continuous repeated measures from a smoking cessation clinical trial.
doi:10.1002/sim.3111
PMCID: PMC3032542
PMID: 18205247
multiple partial imputation; selection model; pattern-mixture model; Markov transition model; nonignorable dropout; intermittent missing values
Background
Missing data is a challenging problem in many prognostic studies. Multiple imputation (MI) accounts for imputation uncertainty that allows for adequate statistical testing. We developed and tested a methodology combining MI with bootstrapping techniques for studying prognostic variable selection.
Method
In our prospective cohort study we merged data from three different randomized controlled trials (RCTs) to assess prognostic variables for chronicity of low back pain. Among the outcome and prognostic variables data were missing in the range of 0 and 48.1%. We used four methods to investigate the influence of respectively sampling and imputation variation: MI only, bootstrap only, and two methods that combine MI and bootstrapping. Variables were selected based on the inclusion frequency of each prognostic variable, i.e. the proportion of times that the variable appeared in the model. The discriminative and calibrative abilities of prognostic models developed by the four methods were assessed at different inclusion levels.
Results
We found that the effect of imputation variation on the inclusion frequency was larger than the effect of sampling variation. When MI and bootstrapping were combined at the range of 0% (full model) to 90% of variable selection, bootstrap corrected c-index values of 0.70 to 0.71 and slope values of 0.64 to 0.86 were found.
Conclusion
We recommend to account for both imputation and sampling variation in sets of missing data. The new procedure of combining MI with bootstrapping for variable selection, results in multivariable prognostic models with good performance and is therefore attractive to apply on data sets with missing values.
doi:10.1186/1471-2288-7-33
PMCID: PMC1945032
PMID: 17629912
The authors attempted to catalog the use of procedures to impute missing data in the epidemiologic literature and to determine the degree to which imputed results differed in practice from unimputed results. The full text of articles published in 2005 and 2006 in four leading epidemiologic journals was searched for the text imput. Sixteen articles utilizing multiple imputation, inverse probability weighting, or the expectation-maximization algorithm to impute missing data were found. The small number of relevant manuscripts and diversity of detail provided precluded systematic analysis of the use of imputation procedures. To form a bridge between current and future practice, the authors suggest details that should be included in articles that utilize these procedures.
doi:10.1093/aje/kwn071
PMCID: PMC2561989
PMID: 18591202
expectation; imputation; missing data; probability weighting
Genotype imputation methods are now being widely used in the analysis of genome-wide association studies. Most imputation analyses to date have used the HapMap as a reference dataset, but new reference panels (such as controls genotyped on multiple SNP chips and densely typed samples from the 1,000 Genomes Project) will soon allow a broader range of SNPs to be imputed with higher accuracy, thereby increasing power. We describe a genotype imputation method (IMPUTE version 2) that is designed to address the challenges presented by these new datasets. The main innovation of our approach is a flexible modelling framework that increases accuracy and combines information across multiple reference panels while remaining computationally feasible. We find that IMPUTE v2 attains higher accuracy than other methods when the HapMap provides the sole reference panel, but that the size of the panel constrains the improvements that can be made. We also find that imputation accuracy can be greatly enhanced by expanding the reference panel to contain thousands of chromosomes and that IMPUTE v2 outperforms other methods in this setting at both rare and common SNPs, with overall error rates that are 15%–20% lower than those of the closest competing method. One particularly challenging aspect of next-generation association studies is to integrate information across multiple reference panels genotyped on different sets of SNPs; we show that our approach to this problem has practical advantages over other suggested solutions.
Author Summary
Large association studies have proven to be effective tools for identifying parts of the genome that influence disease risk and other heritable traits. So-called “genotype imputation” methods form a cornerstone of modern association studies: by extrapolating genetic correlations from a densely characterized reference panel to a sparsely typed study sample, such methods can estimate unobserved genotypes with high accuracy, thereby increasing the chances of finding true associations. To date, most genome-wide imputation analyses have used reference data from the International HapMap Project. While this strategy has been successful, association studies in the near future will also have access to additional reference information, such as control sets genotyped on multiple SNP chips and dense genome-wide haplotypes from the 1,000 Genomes Project. These new reference panels should improve the quality and scope of imputation, but they also present new methodological challenges. We describe a genotype imputation method, IMPUTE version 2, that is designed to address these challenges in next-generation association studies. We show that our method can use a reference panel containing thousands of chromosomes to attain higher accuracy than is possible with the HapMap alone, and that our approach is more accurate than competing methods on both current and next-generation datasets. We also highlight the modeling issues that arise in imputation datasets.
doi:10.1371/journal.pgen.1000529
PMCID: PMC2689936
PMID: 19543373
This is a continuation of and a development of a debate between John Keown and me. The issue discussed is whether, in Britain, an unpaid system of blood donation promotes and is justified by its promotion of altruism. Doubt is cast on the notions that public policies can, and, if they can, that they should, be aimed at the promotion and expression of altruism rather than of self-interest, especially that of a mercenary sort. Reflections upon President Kennedy's proposition, introduced into the debate by Keown, that we should ask not what our country can do for us but what we can do for our country is pivotal to this casting of doubt. A case is made for suggesting that advocacy along the lines which Keown presents of an exclusive reliance on a voluntary, unpaid system of blood donation encourages inappropriate attitudes towards the provision of health care. Perhaps, it is suggested, and the suggestion represents, on my part, a change of mind as a consequence of the debate, a dual system of blood provision might be preferable.
PMCID: PMC479309
PMID: 10635510
About 6000 women in the United Kingdom develop ovarian cancer each year and about two-thirds of the women will die from the disease. Establishing the prognosis of a woman with ovarian cancer is an important part of her evaluation and treatment. Prognostic models and indices in ovarian cancer should be developed using large databases and, ideally, with complete information on both prognostic indicators and long-term outcome. We developed a prognostic model using Cox regression and multiple imputation from 1189 primary cases of epithelial ovarian cancer (with median follow-up of 4.6 years). We found that the significant (P≤ 0.05) prognostic factors for overall survival were age at diagnosis, FIGO stage, grade of tumour, histology (mixed mesodermal, clear cell and endometrioid versus serous papillary), the presence or absence of ascites, albumin, alkaline phosphatase, performance status on the ZUBROD-ECOG-WHO scale, and debulking of the tumour. This model is consistent with other models in the ovarian cancer literature; it has better predictive ability and, after simplification and validation, could be used in clinical practice. http://www.bjcancer.com © 2001 Cancer Research Campaignhttp://www.bjcancer.com
doi:10.1054/bjoc.2001.2030
PMCID: PMC2375096
PMID: 11592763
ovarian cancer; prognostic model; overall survival
Background
The appropriate handling of missing covariate data in prognostic modelling studies is yet to be conclusively determined. A resampling study was performed to investigate the effects of different missing data methods on the performance of a prognostic model.
Methods
Observed data for 1000 cases were sampled with replacement from a large complete dataset of 7507 patients to obtain 500 replications. Five levels of missingness (ranging from 5% to 75%) were imposed on three covariates using a missing at random (MAR) mechanism. Five missing data methods were applied; a) complete case analysis (CC) b) single imputation using regression switching with predictive mean matching (SI), c) multiple imputation using regression switching imputation, d) multiple imputation using regression switching with predictive mean matching (MICE-PMM) and e) multiple imputation using flexible additive imputation models. A Cox proportional hazards model was fitted to each dataset and estimates for the regression coefficients and model performance measures obtained.
Results
CC produced biased regression coefficient estimates and inflated standard errors (SEs) with 25% or more missingness. The underestimated SE after SI resulted in poor coverage with 25% or more missingness. Of the MI approaches investigated, MI using MICE-PMM produced the least biased estimates and better model performance measures. However, this MI approach still produced biased regression coefficient estimates with 75% missingness.
Conclusions
Very few differences were seen between the results from all missing data approaches with 5% missingness. However, performing MI using MICE-PMM may be the preferred missing data approach for handling between 10% and 50% MAR missingness.
doi:10.1186/1471-2288-10-112
PMCID: PMC3019210
PMID: 21194416
The availability of extensively genotyped reference samples, such as “The HapMap” and 1,000 Genomes Project reference panels, together with advances in statistical methodology, have allowed for the imputation of genotypes at single nucleotide polymorphism (SNP) markers that are untyped in a cohort or case-control study. These imputation procedures facilitate the interpretation and meta-analyses of genome-wide association studies. A natural question when implementing these procedures concerns how best to take into account uncertainty in imputed genotypes. Here we compare the performance of the following three strategies: least-squares regression on the “best-guess” imputed genotype; regression on the expected genotype score or “dosage”; and mixture regression models that more fully incorporate posterior probabilities of genotypes at untyped SNPs. Using simulation, we considered a range of sample sizes, minor allele frequencies, and imputation accuracies to compare the performance of the different methods under various genetic models. The mixture models performed the best in the setting of a large genetic effect and low imputation accuracies. However, for most realistic settings, we find that regressing the phenotype on the estimated allelic or genotypic dosage provides an attractive compromise between accuracy and computational tractability.
doi:10.1002/gepi.20552
PMCID: PMC3143715
PMID: 21254217
GWAS; genotype imputation; mixture models
Methods to handle missing data have been an area of statistical research for many years. Little has been done within the context of pedigree analysis. In this paper we present two methods for imputing missing data for polygenic models using family data. The imputation schemes take into account familial relationships and use the observed familial information for the imputation. A traditional multiple imputation approach and multiple imputation or data augmentation approach within a Gibbs sampler for the handling of missing data for a polygenic model are presented.
We used both the Genetic Analysis Workshop 13 simulated missing phenotype and the complete phenotype data sets as the means to illustrate the two methods. We looked at the phenotypic trait systolic blood pressure and the covariate gender at time point 11 (1970) for Cohort 1 and time point 1 (1971) for Cohort 2. Comparing the results for three replicates of complete and missing data incorporating multiple imputation, we find that multiple imputation via a Gibbs sampler produces more accurate results. Thus, we recommend the Gibbs sampler for imputation purposes because of the ease with which it can be extended to more complicated models, the consistency of the results, and the accountability of the variation due to imputation.
doi:10.1186/1471-2156-4-S1-S42
PMCID: PMC1866478
PMID: 14975110
Background
The weighted estimators generally used for analyzing case-cohort studies are not fully efficient and naive estimates of the predictive ability of a model from case-cohort data depend on the subcohort size. However, case-cohort studies represent a special type of incomplete data, and methods for analyzing incomplete data should be appropriate, in particular multiple imputation (MI).
Methods
We performed simulations to validate the MI approach for estimating hazard ratios and the predictive ability of a model or of an additional variable in case-cohort surveys. As an illustration, we analyzed a case-cohort survey from the Three-City study to estimate the predictive ability of D-dimer plasma concentration on coronary heart disease (CHD) and on vascular dementia (VaD) risks.
Results
When the imputation model of the phase-2 variable was correctly specified, MI estimates of hazard ratios and predictive abilities were similar to those obtained with full data. When the imputation model was misspecified, MI could provide biased estimates of hazard ratios and predictive abilities. In the Three-City case-cohort study, elevated D-dimer levels increased the risk of VaD (hazard ratio for two consecutive tertiles = 1.69, 95%CI: 1.63-1.74). However, D-dimer levels did not improve the predictive ability of the model.
Conclusions
MI is a simple approach for analyzing case-cohort data and provides an easy evaluation of the predictive ability of a model or of an additional variable.
doi:10.1186/1471-2288-12-24
PMCID: PMC3378447
PMID: 22405090
Summary
Two approaches commonly used to deal with missing data are multiple
imputation (MI) and inverse-probability weighting (IPW). IPW is also used to
adjust for unequal sampling fractions. MI is generally more efficient than
IPW but more complex. Whereas IPW requires only a model for the probability
that an individual has complete data (a univariate outcome), MI needs a
model for the joint distribution of the missing data (a multivariate
outcome) given the observed data. Inadequacies in either model may lead to
important bias if large amounts of data are missing. A third approach
combines MI and IPW to give a doubly robust estimator. A fourth approach
(IPW/MI) combines MI and IPW but, unlike doubly robust methods, imputes only
isolated missing values and uses weights to account for remaining larger
blocks of unimputed missing data, such as would arise, e.g., in a cohort
study subject to sample attrition, and/or unequal sampling fractions. In
this article, we examine the performance, in terms of bias and efficiency,
of IPW/MI relative to MI and IPW alone and investigate whether the
Rubin’s rules variance estimator is valid for IPW/MI. We prove that
the Rubin’s rules variance estimator is valid for IPW/MI for linear
regression with an imputed outcome, we present simulations supporting the
use of this variance estimator in more general settings, and we demonstrate
that IPW/MI can have advantages over alternatives. IPW/MI is applied to data
from the National Child Development Study.
doi:10.1111/j.1541-0420.2011.01666.x
PMCID: PMC3412287
PMID: 22050039
Marginal model; Missing at random; Survey weighting; 1958 British Birth Cohort
The temporal context model posits that search through episodic memory is driven by associations between the multiattribute representations of items and context. Context, in turn, is a recency weighted sum of previous experiences or memories. Because recently processed items are most similar to the current representation of context, M. Usher, E. J. Davelaar, H. J. Haarmann, and Y. Goshen-Gottstein (2008) have suggested that the temporal context model (TCM-A) embodies a distinction between short-term and long-term memory and that this distinction is central to TCM-A’s success in accounting for the pattern of recency and contiguity observed across short and long timescales. The authors dispute Usher et al’s claim that context in TCM-A has much in common with classic interpretations of short-term memory. The idea that multiple representations interact in the process of memory encoding and retrieval (across timescales), as proposed in TCM-A, is very different from the classic dual-process view that distinct rules govern retrieval of recent and remote memories.
doi:10.1037/a0013724
PMCID: PMC2957902
PMID: 20976034
episodic memory; context; free recall; short-term memory; working memory