Although high-throughput genotyping arrays have made whole-genome association studies (WGAS) feasible, only a small proportion of SNPs in the human genome are actually surveyed in such studies. In addition, various SNP arrays assay different sets of SNPs, which leads to challenges in comparing results and merging data for meta-analyses. Genome-wide imputation of untyped markers allows us to address these issues in a direct fashion.
384 Caucasian American liver donors were genotyped using Illumina 650Y (Ilmn650Y) arrays, from which we also derived genotypes from the Ilmn317K array. On these data, we compared two imputation methods: MACH and BEAGLE. We imputed 2.5 million HapMap Release22 SNPs, and conducted GWAS on ~40,000 liver mRNA expression traits (eQTL analysis). In addition, 200 Caucasian American and 200 African American subjects were genotyped using the Affymetrix 500 K array plus a custom 164 K fill-in chip. We then imputed the HapMap SNPs and quantified the accuracy by randomly masking observed SNPs.
MACH and BEAGLE perform similarly with respect to imputation accuracy. The Ilmn650Y results in excellent imputation performance, and it outperforms Affx500K or Ilmn317K sets. For Caucasian Americans, 90% of the HapMap SNPs were imputed at 98% accuracy. As expected, imputation of poorly tagged SNPs (untyped SNPs in weak LD with typed markers) was not as successful. It was more challenging to impute genotypes in the African American population, given (1) shorter LD blocks and (2) admixture with Caucasian populations in this population. To address issue (2), we pooled HapMap CEU and YRI data as an imputation reference set, which greatly improved overall performance. The approximate 40,000 phenotypes scored in these populations provide a path to determine empirically how the power to detect associations is affected by the imputation procedures. That is, at a fixed false discovery rate, the number of cis-eQTL discoveries detected by various methods can be interpreted as their relative statistical power in the GWAS. In this study, we find that imputation offer modest additional power (by 4%) on top of either Ilmn317K or Ilmn650Y, much less than the power gain from Ilmn317K to Ilmn650Y (13%).
Current algorithms can accurately impute genotypes for untyped markers, which enables researchers to pool data between studies conducted using different SNP sets. While genotyping itself results in a small error rate (e.g. 0.5%), imputing genotypes is surprisingly accurate. We found that dense marker sets (e.g. Ilmn650Y) outperform sparser ones (e.g. Ilmn317K) in terms of imputation yield and accuracy. We also noticed it was harder to impute genotypes for African American samples, partially due to population admixture, although using a pooled reference boosts performance. Interestingly, GWAS carried out using imputed genotypes only slightly increased power on top of assayed SNPs. The reason is likely due to adding more markers via imputation only results in modest gain in genetic coverage, but worsens the multiple testing penalties. Furthermore, cis-eQTL mapping using dense SNP set derived from imputation achieves great resolution, and locate associate peak closer to causal variants than conventional approach.
The objective of the CARRECT software is to make cutting edge statistical methods for reducing bias in epidemiological studies easy to use and useful for both novice and expert users.
Analyses produced by epidemiologists and public health practitioners are susceptible to bias from a number of sources including missing data, confounding variables, and statistical model selection. It often requires a great deal of expertise to understand and apply the multitude of tests, corrections, and selection rules, and these tasks can be time-consuming and burdensome. To address this challenge, Aptima began development of CARRECT, the Collaborative Automation Reliably Remediating Erroneous Conclusion Threats system. When complete, CARRECT will provide an expert system that can be embedded in an analyst’s workflow. CARRECT will support statistical bias reduction and improved analyses and decision making by engaging the user in a collaborative process in which the technology is transparent to the analyst.
Older approaches to imputing missing data, including mean imputation and single imputation regression methods, have steadily given way to a class of methods known as “multiple imputation” (hereafter “MI”; Rubin 1987). Rather than making the restrictive assumption that the data are missing completely at random (MCAR), MI typically assumes the data are missing at random (MAR).
There are two key innovations behind MI. First, the observed values can be useful in predicting the missing cells, and thus specifying a joint distribution of the data is the first step in implementing the models. Second, single imputation methods will likely fail not only because of the inherent uncertainty in the missing values but also because of the estimation uncertainty associated with generating the parameters in the imputation procedure itself. By contrast, drawing the missing values multiple times, thereby generating m complete datasets along with the estimated parameters of the model properly accounts for both types of uncertainty (Rubin 1987; King et al. 2001). As a result, MI will lead to valid standard errors and confidence intervals along with unbiased point estimates.
In order to compute the joint distribution, CARRECT uses a bootstrapping-based algorithm that gives essentially the same answers as the standard Bayesian Markov Chain Monte Carlo (MCMC) or Expectation Maximization (EM) approaches, is usually considerably faster than existing approaches and can handle many more variables.
Tests were conducted on one of the proposed methods with an epidemiological dataset from the Integrated Health Interview Series (IHIS) producing verifiably unbiased results despite high missingness rates. In addition, mockups (Figure 1) were created of an intuitive data wizard that guides the user through the analysis processes by analyzing key features of a given dataset. The mockups also show prompts for the user to provide additional substantive knowledge to improve the handling of imperfect datasets, as well as the selection of the most appropriate algorithms and models.
Our approach and program were designed to make bias mitigation much more accessible to much more than only the statistical elite. We hope that it will have a wide impact on reducing bias in epidemiological studies and provide more accurate information to policymakers.
Bias reduction; Missing data; Statistical model selection
Attrition, which leads to missing data, is a common problem in cluster randomized trials (CRTs), where groups of patients rather than individuals are randomized. Standard multiple imputation (MI) strategies may not be appropriate to impute missing data from CRTs since they assume independent data. In this paper, under the assumption of missing completely at random and covariate dependent missing, we compared six MI strategies which account for the intra-cluster correlation for missing binary outcomes in CRTs with the standard imputation strategies and complete case analysis approach using a simulation study.
We considered three within-cluster and three across-cluster MI strategies for missing binary outcomes in CRTs. The three within-cluster MI strategies are logistic regression method, propensity score method, and Markov chain Monte Carlo (MCMC) method, which apply standard MI strategies within each cluster. The three across-cluster MI strategies are propensity score method, random-effects (RE) logistic regression approach, and logistic regression with cluster as a fixed effect. Based on the community hypertension assessment trial (CHAT) which has complete data, we designed a simulation study to investigate the performance of above MI strategies.
The estimated treatment effect and its 95% confidence interval (CI) from generalized estimating equations (GEE) model based on the CHAT complete dataset are 1.14 (0.76 1.70). When 30% of binary outcome are missing completely at random, a simulation study shows that the estimated treatment effects and the corresponding 95% CIs from GEE model are 1.15 (0.76 1.75) if complete case analysis is used, 1.12 (0.72 1.73) if within-cluster MCMC method is used, 1.21 (0.80 1.81) if across-cluster RE logistic regression is used, and 1.16 (0.82 1.64) if standard logistic regression which does not account for clustering is used.
When the percentage of missing data is low or intra-cluster correlation coefficient is small, different approaches for handling missing binary outcome data generate quite similar results. When the percentage of missing data is large, standard MI strategies, which do not take into account the intra-cluster correlation, underestimate the variance of the treatment effect. Within-cluster and across-cluster MI strategies (except for random-effects logistic regression MI strategy), which take the intra-cluster correlation into account, seem to be more appropriate to handle the missing outcome from CRTs. Under the same imputation strategy and percentage of missingness, the estimates of the treatment effect from GEE and RE logistic regression models are similar.
Methods to handle missing data have been an area of statistical research for many years. Little has been done within the context of pedigree analysis. In this paper we present two methods for imputing missing data for polygenic models using family data. The imputation schemes take into account familial relationships and use the observed familial information for the imputation. A traditional multiple imputation approach and multiple imputation or data augmentation approach within a Gibbs sampler for the handling of missing data for a polygenic model are presented.
We used both the Genetic Analysis Workshop 13 simulated missing phenotype and the complete phenotype data sets as the means to illustrate the two methods. We looked at the phenotypic trait systolic blood pressure and the covariate gender at time point 11 (1970) for Cohort 1 and time point 1 (1971) for Cohort 2. Comparing the results for three replicates of complete and missing data incorporating multiple imputation, we find that multiple imputation via a Gibbs sampler produces more accurate results. Thus, we recommend the Gibbs sampler for imputation purposes because of the ease with which it can be extended to more complicated models, the consistency of the results, and the accountability of the variation due to imputation.
The Center for Epidemiologic Studies - Depression scale (CES-D) is a validated tool commonly used to screen depressive symptoms. As with any self-administered questionnaire, missing data are frequently observed and can strongly bias any inference. The objective of this study was to investigate the best approach for handling missing data in the CES-D scale.
Among the 71,412 women from the French E3N prospective cohort (Etude Epidémiologique auprès des femmes de la Mutuelle Générale de l’Education Nationale) who returned the questionnaire comprising the CES-D scale in 2005, 45% had missing values in the scale. The reasons for failure to complete certain items were investigated by semi-directive interviews on a random sample of 204 participants. The prevalence of high depressive symptoms (score ≥16, hDS) was estimated after applying various methods for ignorable missing data including multiple imputation using imputation models with CES-D items with or without covariates. The accuracy of imputation models was investigated. Various scenarios of nonignorable missing data mechanisms were investigated by a sensitivity analysis based on the mixture modelling approach.
The interviews showed that participants were not reluctant to answer the CES-D scale. Possible reasons for nonresponse were identified. The prevalence of hDS among complete responders was 26.1%. After multiple imputation, the prevalence was 28.6%, 29.8% and 31.7% for women presenting up to 4, 10 and 20 missing values, respectively. The estimates were robust to the various imputation models investigated and to the scenarios of nonignorable missing data.
The CES-D scale can easily be used in large cohorts even in the presence of missing data. Based on the results from both a qualitative study and a sensitivity analysis under various scenarios of missing data mechanism in a population of women, missing data mechanism does not appear to be nonignorable and estimates are robust to departures from ignorability. Multiple imputation is recommended to reliably handle missing data in the CES-D scale.
CES-D; Cohort; Missing data; Multiple imputation; Non ignorable; Sensitivity analysis
Epistatic miniarray profiling (E-MAPs) is a high-throughput approach capable of quantifying aggravating or alleviating genetic interactions between gene pairs. The datasets resulting from E-MAP experiments typically take the form of a symmetric pairwise matrix of interaction scores. These datasets have a significant number of missing values - up to 35% - that can reduce the effectiveness of some data analysis techniques and prevent the use of others. An effective method for imputing interactions would therefore increase the types of possible analysis, as well as increase the potential to identify novel functional interactions between gene pairs. Several methods have been developed to handle missing values in microarray data, but it is unclear how applicable these methods are to E-MAP data because of their pairwise nature and the significantly larger number of missing values. Here we evaluate four alternative imputation strategies, three local (Nearest neighbor-based) and one global (PCA-based), that have been modified to work with symmetric pairwise data.
We identify different categories for the missing data based on their underlying cause, and show that values from the largest category can be imputed effectively. We compare local and global imputation approaches across a variety of distinct E-MAP datasets, showing that both are competitive and preferable to filling in with zeros. In addition we show that these methods are effective in an E-MAP from a different species, suggesting that pairwise imputation techniques will be increasingly useful as analogous epistasis mapping techniques are developed in different species. We show that strongly alleviating interactions are significantly more difficult to predict than strongly aggravating interactions. Finally we show that imputed interactions, generated using nearest neighbor methods, are enriched for annotations in the same manner as measured interactions. Therefore our method potentially expands the number of mapped epistatic interactions. In addition we make implementations of our algorithms available for use by other researchers.
We address the problem of missing value imputation for E-MAPs, and suggest the use of symmetric nearest neighbor based approaches as they offer consistently accurate imputations across multiple datasets in a tractable manner.
The aim of this review was to establish the frequency with which trials take into account missingness, and to discover what methods trialists use for adjustment in randomised controlled trials with longitudinal measurements. Failing to address the problems that can arise from missing outcome data can result in misleading conclusions. Missing data should be addressed as a means of a sensitivity analysis of the complete case analysis results. One hundred publications of randomised controlled trials with longitudinal measurements were selected randomly from trial publications from the years 2005 to 2012. Information was extracted from these trials, including whether reasons for dropout were reported, what methods were used for handing the missing data, whether there was any explanation of the methods for missing data handling, and whether a statistician was involved in the analysis. The main focus of the review was on missing data post dropout rather than missing interim data. Of all the papers in the study, 9 (9%) had no missing data. More than half of the papers included in the study failed to make any attempt to explain the reasons for their choice of missing data handling method. Of the papers with clear missing data handling methods, 44 papers (50%) used adequate methods of missing data handling, whereas 30 (34%) of the papers used missing data methods which may not have been appropriate. In the remaining 17 papers (19%), it was difficult to assess the validity of the methods used. An imputation method was used in 18 papers (20%). Multiple imputation methods were introduced in 1987 and are an efficient way of accounting for missing data in general, and yet only 4 papers used these methods. Out of the 18 papers which used imputation, only 7 displayed the results as a sensitivity analysis of the complete case analysis results. 61% of the papers that used an imputation explained the reasons for their chosen method. Just under a third of the papers made no reference to reasons for missing outcome data. There was little consistency in reporting of missing data within longitudinal trials.
Review; Missing; Data; Handling; Longitudinal; Repeated; Measures
Environmental epidemiology, when focused on the life course of exposure to a specific pollutant, requires historical exposure estimates that are difficult to obtain for the full time period due to gaps in the historical record, especially in earlier years. We show that these gaps can be filled by applying multiple imputation methods to a formal risk equation that incorporates lifetime exposure. We also address challenges that arise, including choice of imputation method, potential bias in regression coefficients, and uncertainty in age-at-exposure sensitivities.
During time periods when parameters needed in the risk equation are missing for an individual, the parameters are filled by an imputation model using group level information or interpolation. A random component is added to match the variance found in the estimates for study subjects not needing imputation. The process is repeated to obtain multiple data sets, whose regressions against health data can be combined statistically to develop confidence limits using Rubin’s rules to account for the uncertainty introduced by the imputations. To test for possible recall bias between cases and controls, which can occur when historical residence location is obtained by interview, and which can lead to misclassification of imputed exposure by disease status, we introduce an “incompleteness index,” equal to the percentage of dose imputed (PDI) for a subject. “Effective doses” can be computed using different functional dependencies of relative risk on age of exposure, allowing intercomparison of different risk models. To illustrate our approach, we quantify lifetime exposure (dose) from traffic air pollution in an established case–control study on Long Island, New York, where considerable in-migration occurred over a period of many decades.
The major result is the described approach to imputation. The illustrative example revealed potential recall bias, suggesting that regressions against health data should be done as a function of PDI to check for consistency of results. The 1% of study subjects who lived for long durations near heavily trafficked intersections, had very high cumulative exposures. Thus, imputation methods must be designed to reproduce non-standard distributions.
Our approach meets a number of methodological challenges to extending historical exposure reconstruction over a lifetime and shows promise for environmental epidemiology. Application to assessment of breast cancer risks will be reported in a subsequent manuscript.
Exposure; Air pollution; Traffic; Benzo(a)pyrene; PAH; Multiple imputation; Epidemiology; In-migration; Dose
Whole brain fMRI analyses rarely include the entire brain because of missing data that result from data acquisition limits and susceptibility artifact, in particular. This missing data problem is typically addressed by omitting voxels from analysis, which may exclude brain regions that are of theoretical interest and increase the potential for Type II error at cortical boundaries or Type I error when spatial thresholds are used to establish significance. Imputation could significantly expand statistical map coverage, increase power, and enhance interpretations of fMRI results. We examined multiple imputation for group level analyses of missing fMRI data using methods that leverage the spatial information in fMRI datasets for both real and simulated data. Available case analysis, neighbor replacement, and regression based imputation approaches were compared in a general linear model framework to determine the extent to which these methods quantitatively (effect size) and qualitatively (spatial coverage) increased the sensitivity of group analyses. In both real and simulated data analysis, multiple imputation provided 1) variance that was most similar to estimates for voxels with no missing data, 2) fewer false positive errors in comparison to mean replacement, and 3) fewer false negative errors in comparison to available case analysis. Compared to the standard analysis approach of omitting voxels with missing data, imputation methods increased brain coverage in this study by 35% (from 33,323 to 45,071 voxels). In addition, multiple imputation increased the size of significant clusters by 58% and number of significant clusters across statistical thresholds, compared to the standard voxel omission approach. While neighbor replacement produced similar results, we recommend multiple imputation because it uses an informed sampling distribution to deal with missing data across subjects that can include neighbor values and other predictors. Multiple imputation is anticipated to be particularly useful for 1) large fMRI data sets with inconsistent missing voxels across subjects and 2) addressing the problem of increased artifact at ultra-high field, which significantly limit the extent of whole brain coverage and interpretations of results.
missing data; fMRI; group analysis; multiple imputation; replacement; neuroimaging methods
Attrition in longitudinal studies can lead to biased results. The study is motivated by the unexpected observation that alcohol consumption decreased despite of increased availability, which may be due to sample attrition of heavy drinkers. Several imputation methods have been proposed, but rarely compared in longitudinal studies of alcohol consumption. The imputation of consumption level measurements is computationally particularly challenging due to alcohol consumption being a semi-continuous variable (dichotomous drinking status and continuous volume among drinkers), and the non-normality of data in the continuous part. Data come from a longitudinal study in Denmark with four waves (2003–2006) and 1771 individuals at baseline. Five techniques for missing data are compared: Last value carried forward (LVCF) was used as a single, and Hotdeck, Heckman modelling, multivariate imputation by chained equations (MICE), and a Bayesian approach as multiple imputation methods. Predictive mean matching was used to account for non-normality, where instead of imputing regression estimates, “real” observed values from similar cases are imputed. Methods were also compared by means of a simulated dataset. The simulation showed that the Bayesian approach yielded the most unbiased estimates for imputation. The finding of no increase in consumption levels despite a higher availability remained unaltered.
panel surveys; missing data; multiple imputation; Bayesian models; alcohol consumption
Conventional multiple-trait quantitative trait locus (QTL) mapping methods must discard cases (individuals) with incomplete phenotypic data, thereby sacrificing other phenotypic and genotypic information contained in the discarded cases. Under standard assumptions about the missing-data mechanism, it is possible to exploit these cases.
We present an expectation-maximization (EM) algorithm, derived for recombinant inbred and F2 genetic models but extensible to any mating design, that supports conventional hypothesis tests for QTL main effect, pleiotropy, and QTL-by-environment interaction in multiple-trait analyses with missing phenotypic data. We evaluate its performance by simulations and illustrate with a real-data example.
The EM method affords improved QTL detection power and precision of QTL location and effect estimation in comparison with case deletion or imputation methods. It may be incorporated into any least-squares or likelihood-maximization QTL-mapping approach.
Multiple imputation (MI) provides an effective approach to handle missing covariate data within prognostic modelling studies, as it can properly account for the missing data uncertainty. The multiply imputed datasets are each analysed using standard prognostic modelling techniques to obtain the estimates of interest. The estimates from each imputed dataset are then combined into one overall estimate and variance, incorporating both the within and between imputation variability. Rubin's rules for combining these multiply imputed estimates are based on asymptotic theory. The resulting combined estimates may be more accurate if the posterior distribution of the population parameter of interest is better approximated by the normal distribution. However, the normality assumption may not be appropriate for all the parameters of interest when analysing prognostic modelling studies, such as predicted survival probabilities and model performance measures.
Guidelines for combining the estimates of interest when analysing prognostic modelling studies are provided. A literature review is performed to identify current practice for combining such estimates in prognostic modelling studies.
Methods for combining all reported estimates after MI were not well reported in the current literature. Rubin's rules without applying any transformations were the standard approach used, when any method was stated.
The proposed simple guidelines for combining estimates after MI may lead to a wider and more appropriate use of MI in future prognostic modelling studies.
Missing data are common in medical and social science studies and often pose a serious challenge in data analysis. Multiple imputation methods are popular and natural tools for handling missing data, replacing each missing value with a set of plausible values that represent the uncertainty about the underlying values. We consider a case of missing at random (MAR) and investigate the estimation of the marginal mean of an outcome variable in the presence of missing values when a set of fully observed covariates is available. We propose a new nonparametric multiple imputation (MI) approach that uses two working models to achieve dimension reduction and define the imputing sets for the missing observations. Compared with existing nonparametric imputation procedures, our approach can better handle covariates of high dimension, and is doubly robust in the sense that the resulting estimator remains consistent if either of the working models is correctly specified. Compared with existing doubly robust methods, our nonparametric MI approach is more robust to the misspecification of both working models; it also avoids the use of inverse-weighting and hence is less sensitive to missing probabilities that are close to 1. We propose a sensitivity analysis for evaluating the validity of the working models, allowing investigators to choose the optimal weights so that the resulting estimator relies either completely or more heavily on the working model that is likely to be correctly specified and achieves improved efficiency. We investigate the asymptotic properties of the proposed estimator, and perform simulation studies to show that the proposed method compares favorably with some existing methods in finite samples. The proposed method is further illustrated using data from a colorectal adenoma study.
Doubly robust; Missing at random; Multiple imputation; Nearest neighbor; Nonparametric imputation; Sensitivity analysis
Missing data present a challenge to many research projects. The problem is often pronounced in studies utilizing self-report scales, and literature addressing different strategies for dealing with missing data in such circumstances is scarce. The objective of this study was to compare six different imputation techniques for dealing with missing data in the Zung Self-reported Depression scale (SDS).
1580 participants from a surgical outcomes study completed the SDS. The SDS is a 20 question scale that respondents complete by circling a value of 1 to 4 for each question. The sum of the responses is calculated and respondents are classified as exhibiting depressive symptoms when their total score is over 40. Missing values were simulated by randomly selecting questions whose values were then deleted (a missing completely at random simulation). Additionally, a missing at random and missing not at random simulation were completed. Six imputation methods were then considered; 1) multiple imputation, 2) single regression, 3) individual mean, 4) overall mean, 5) participant's preceding response, and 6) random selection of a value from 1 to 4. For each method, the imputed mean SDS score and standard deviation were compared to the population statistics. The Spearman correlation coefficient, percent misclassified and the Kappa statistic were also calculated.
When 10% of values are missing, all the imputation methods except random selection produce Kappa statistics greater than 0.80 indicating 'near perfect' agreement. MI produces the most valid imputed values with a high Kappa statistic (0.89), although both single regression and individual mean imputation also produced favorable results. As the percent of missing information increased to 30%, or when unbalanced missing data were introduced, MI maintained a high Kappa statistic. The individual mean and single regression method produced Kappas in the 'substantial agreement' range (0.76 and 0.74 respectively).
Multiple imputation is the most accurate method for dealing with missing data in most of the missind data scenarios we assessed for the SDS. Imputing the individual's mean is also an appropriate and simple method for dealing with missing data that may be more interpretable to the majority of medical readers. Researchers should consider conducting methodological assessments such as this one when confronted with missing data. The optimal method should balance validity, ease of interpretability for readers, and analysis expertise of the research team.
Retaining participants in cohort studies with multiple follow-up waves is difficult. Commonly, researchers are faced with the problem of missing data, which may introduce biased results as well as a loss of statistical power and precision. The STROBE guidelines von Elm et al. (Lancet, 370:1453-1457, 2007); Vandenbroucke et al. (PLoS Med, 4:e297, 2007) and the guidelines proposed by Sterne et al. (BMJ, 338:b2393, 2009) recommend that cohort studies report on the amount of missing data, the reasons for non-participation and non-response, and the method used to handle missing data in the analyses. We have conducted a review of publications from cohort studies in order to document the reporting of missing data for exposure measures and to describe the statistical methods used to account for the missing data.
A systematic search of English language papers published from January 2000 to December 2009 was carried out in PubMed. Prospective cohort studies with a sample size greater than 1,000 that analysed data using repeated measures of exposure were included.
Among the 82 papers meeting the inclusion criteria, only 35 (43%) reported the amount of missing data according to the suggested guidelines. Sixty-eight papers (83%) described how they dealt with missing data in the analysis. Most of the papers excluded participants with missing data and performed a complete-case analysis (n = 54, 66%). Other papers used more sophisticated methods including multiple imputation (n = 5) or fully Bayesian modeling (n = 1). Methods known to produce biased results were also used, for example, Last Observation Carried Forward (n = 7), the missing indicator method (n = 1), and mean value substitution (n = 3). For the remaining 14 papers, the method used to handle missing data in the analysis was not stated.
This review highlights the inconsistent reporting of missing data in cohort studies and the continuing use of inappropriate methods to handle missing data in the analysis. Epidemiological journals should invoke the STROBE guidelines as a framework for authors so that the amount of missing data and how this was accounted for in the analysis is transparent in the reporting of cohort studies.
Longitudinal cohort studies; Missing exposure data; Repeated exposure measurement; Missing data methods; Reporting
Multiple imputation is becoming increasingly popular for handling missing data. However, it is often implemented without adequate consideration of whether it offers any advantage over complete case analysis for the research question of interest, or whether potential gains may be offset by bias from a poorly fitting imputation model, particularly as the amount of missing data increases.
Simulated datasets (n = 1000) drawn from a synthetic population were used to explore information recovery from multiple imputation in estimating the coefficient of a binary exposure variable when various proportions of data (10-90%) were set missing at random in a highly-skewed continuous covariate or in the binary exposure. Imputation was performed using multivariate normal imputation (MVNI), with a simple or zero-skewness log transformation to manage non-normality. Bias, precision, mean-squared error and coverage for a set of regression parameter estimates were compared between multiple imputation and complete case analyses.
For missingness in the continuous covariate, multiple imputation produced less bias and greater precision for the effect of the binary exposure variable, compared with complete case analysis, with larger gains in precision with more missing data. However, even with only moderate missingness, large bias and substantial under-coverage were apparent in estimating the continuous covariate’s effect when skewness was not adequately addressed. For missingness in the binary covariate, all estimates had negligible bias but gains in precision from multiple imputation were minimal, particularly for the coefficient of the binary exposure.
Although multiple imputation can be useful if covariates required for confounding adjustment are missing, benefits are likely to be minimal when data are missing in the exposure variable of interest. Furthermore, when there are large amounts of missingness, multiple imputation can become unreliable and introduce bias not present in a complete case analysis if the imputation model is not appropriate. Epidemiologists dealing with missing data should keep in mind the potential limitations as well as the potential benefits of multiple imputation. Further work is needed to provide clearer guidelines on effective application of this method.
Missing data; Multiple imputation; Fully conditional specification; Multivariate normal imputation; Non-normal data
Gene-gene interaction is believed to play an important role in understanding complex traits. Multifactor dimensionality reduction (MDR) was proposed by Ritchie, et al.  to identify multiple loci that simultaneously affect disease susceptibility. Although the MDR method has been widely used to detect gene-gene interactions, few studies have been reported on MDR analysis when there are missing data. Currently, there are four approaches available in MDR analysis to handle missing data. The first approach uses only complete observations that have no missing data, which can cause a severe loss of data. The second approach is to treat missing values as an additional genotype category, but interpretation of the results may then be not clear and the conclusions may be misleading. Furthermore, it performs poorly when the missing rates are unbalanced between the case and control groups. The third approach is a simple imputation method that imputes missing genotypes as the most frequent genotype, which also may produce biased results. The fourth approach, Available, uses all data available for the given loci, to increase power. In any real data analysis, it is not clear which MDR approach one should use when there are missing data. In this paper, we consider a new EM Impute approach, to handle missing data more appropriately. Through simulation studies, we compared the performance of the proposed EM Impute approach with the current approaches. Our results showed that Available and EM Impute approaches perform better than the three other current approaches in terms of power and precision.
Gene-gene interaction; Multifactor Dimensionality Reduction; Missing genotypes; Association study
Tissue micro-arrays (TMAs) are increasingly used to generate data of the molecular phenotype of tumours in clinical epidemiology studies, such as studies of disease prognosis. However, TMA data are particularly prone to missingness. A variety of methods to deal with missing data are available. However, the validity of the various approaches is dependent on the structure of the missing data and there are few empirical studies dealing with missing data from molecular pathology. The purpose of this study was to investigate the results of four commonly used approaches to handling missing data from a large, multi-centre study of the molecular pathological determinants of prognosis in breast cancer.
Patients and methods:
We pooled data from over 11 000 cases of invasive breast cancer from five studies that collected information on seven prognostic indicators together with survival time data. We compared the results of a multi-variate Cox regression using four approaches to handling missing data – complete case analysis (CCA), mean substitution (MS) and multiple imputation without inclusion of the outcome (MI−) and multiple imputation with inclusion of the outcome (MI+). We also performed an analysis in which missing data were simulated under different assumptions and the results of the four methods were compared.
Over half the cases had missing data on at least one of the seven variables and 11 percent had missing data on 4 or more. The multi-variate hazard ratio estimates based on multiple imputation models were very similar to those derived after using MS, with similar standard errors. Hazard ratio estimates based on the CCA were only slightly different, but the estimates were less precise as the standard errors were large. However, in data simulated to be missing completely at random (MCAR) or missing at random (MAR), estimates for MI+ were least biased and most accurate, whereas estimates for CCA were most biased and least accurate.
In this study, empirical results from analyses using CCA, MS, MI− and MI+ were similar, although results from CCA were less precise. The results from simulations suggest that in general MI+ is likely to be the best. Given the ease of implementing MI in standard statistical software, the results of MI+ and CCA should be compared in any multi-variate analysis where missing data are a problem.
missing data; multiple imputation; complete case analysis; missing covariates; tissue micro-arrays
In randomised trials of medical interventions, the most reliable analysis follows the intention-to-treat (ITT) principle. However, the ITT analysis requires that missing outcome data have to be imputed. Different imputation techniques may give different results and some may lead to bias. In anti-obesity drug trials, many data are usually missing, and the most used imputation method is last observation carried forward (LOCF). LOCF is generally considered conservative, but there are more reliable methods such as multiple imputation (MI).
To compare four different methods of handling missing data in a 60-week placebo controlled anti-obesity drug trial on topiramate.
We compared an analysis of complete cases with datasets where missing body weight measurements had been replaced using three different imputation methods: LOCF, baseline carried forward (BOCF) and MI.
561 participants were randomised. Compared to placebo, there was a significantly greater weight loss with topiramate in all analyses: 9.5 kg (SE 1.17) in the complete case analysis (N = 86), 6.8 kg (SE 0.66) using LOCF (N = 561), 6.4 kg (SE 0.90) using MI (N = 561) and 1.5 kg (SE 0.28) using BOCF (N = 561).
The different imputation methods gave very different results. Contrary to widely stated claims, LOCF did not produce a conservative (i.e., lower) efficacy estimate compared to MI. Also, LOCF had a lower SE than MI.
Clinical trial participants may be temporarily absent or withdraw from trials, leading to missing data. In intention-to-treat (ITT) analyses, several approaches are used for handling the missing information - complete case (CC) analysis, mixed-effects model (MM) analysis, last observation carried forward (LOCF) and multiple imputation (MI). This report discusses the consequences of applying the CC, LOCF and MI for the ITT analysis of published data (analysed using the MM method) from the Fracture Reduction Evaluation (FREE) trial.
The FREE trial was a randomised, non-blinded study comparing balloon kyphoplasty with non-surgical care for the treatment of patients with acute painful vertebral fractures. Patients were randomised to treatment (1:1 ratio), and stratified for gender, fracture aetiology, use of bisphosphonates and use of systemic steroids at the time of enrolment. Six outcome measures - Short-form 36 physical component summary (SF-36 PCS) scale, EuroQol 5-Dimension Questionnaire (EQ-5D), Roland-Morris Disability (RMD) score, back pain, number of days with restricted activity in last 2 weeks, and number of days in bed in last 2 weeks - were analysed using four methods for dealing with missing data: CC, LOCF, MM and MI analyses.
There were no missing data in baseline covariates values, and only a few missing baseline values in outcome variables. The overall missing-response level increased during follow-up (1 month: 14.5%; 24 months: 28%), corresponding to a mean of 19% missing data during the entire period. Overall patterns of missing response across time were similar for each treatment group. Almost half of all randomised patients were not available for a CC analysis, a maximum of 4% were not included in the LOCF analysis, and all randomised patients were included in the MM and MI analyses. Improved estimates of treatment effect were observed with LOCF, MM and MI compared with CC; only MM provided improved estimates across all six outcomes considered.
The FREE trial results are robust as the alternative methods used for substituting missing data produced similar results. The MM method showed the highest statistical precision suggesting it is the most appropriate method to use for analysing the FREE trial data.
This trial is registered with ClinicalTrials.gov (number NCT00211211).
Many QTL studies have two common features: (1) often there is missing marker information, (2) among many markers involved in the biological process only a few are causal. In statistics, the second issue falls under the headings “sparsity” and “causal inference”. The goal of this work is to develop a two-step statistical methodology for QTL mapping for markers with binary genotypes. The first step introduces a novel imputation method for missing genotypes. Outcomes of the proposed imputation method are probabilities which serve as weights to the second step, namely in weighted lasso. The sparse phenotype inference is employed to select a set of predictive markers for the trait of interest.
Simulation studies validate the proposed methodology under a wide range of realistic settings. Furthermore, the methodology outperforms alternative imputation and variable selection methods in such studies. The methodology was applied to an Arabidopsis experiment, containing 69 markers for 165 recombinant inbred lines of a F8 generation. The results confirm previously identified regions, however several new markers are also found. On the basis of the inferred ROC behavior these markers show good potential for being real, especially for the germination trait Gmax.
Our imputation method shows higher accuracy in terms of sensitivity and specificity compared to alternative imputation method. Also, the proposed weighted lasso outperforms commonly practiced multiple regression as well as the traditional lasso and adaptive lasso with three weighting schemes. This means that under realistic missing data settings this methodology can be used for QTL identification.
Arabidopsis; Germination traits; QTL mapping; Recombinant inbred line (RIL); Binary genotypes; Likelihood-based genotype imputation; Sparse variable selection; Weighted lasso
Missing data are unavoidable in environmental epidemiologic surveys. The aim of this study was to compare methods for handling large amounts of missing values: omission of missing values, single and multiple imputations (through linear regression or partial least squares regression), and a fully Bayesian approach. These methods were applied to the PARIS birth cohort, where indoor domestic pollutant measurements were performed in a random sample of babies' dwellings. A simulation study was conducted to assess performances of different approaches with a high proportion of missing values (from 50% to 95%). Different simulation scenarios were carried out, controlling the true value of the association (odds ratio of 1.0, 1.2, and 1.4), and varying the health outcome prevalence. When a large amount of data is missing, omitting these missing data reduced statistical power and inflated standard errors, which affected the significance of the association. Single imputation underestimated the variability, and considerably increased risk of type I error. All approaches were conservative, except the Bayesian joint model. In the case of a common health outcome, the fully Bayesian approach is the most efficient approach (low root mean square error, reasonable type I error, and high statistical power). Nevertheless for a less prevalent event, the type I error is increased and the statistical power is reduced. The estimated posterior distribution of the OR is useful to refine the conclusion. Among the methods handling missing values, no approach is absolutely the best but when usual approaches (e.g. single imputation) are not sufficient, joint modelling approach of missing process and health association is more efficient when large amounts of data are missing.
The appropriate handling of missing covariate data in prognostic modelling studies is yet to be conclusively determined. A resampling study was performed to investigate the effects of different missing data methods on the performance of a prognostic model.
Observed data for 1000 cases were sampled with replacement from a large complete dataset of 7507 patients to obtain 500 replications. Five levels of missingness (ranging from 5% to 75%) were imposed on three covariates using a missing at random (MAR) mechanism. Five missing data methods were applied; a) complete case analysis (CC) b) single imputation using regression switching with predictive mean matching (SI), c) multiple imputation using regression switching imputation, d) multiple imputation using regression switching with predictive mean matching (MICE-PMM) and e) multiple imputation using flexible additive imputation models. A Cox proportional hazards model was fitted to each dataset and estimates for the regression coefficients and model performance measures obtained.
CC produced biased regression coefficient estimates and inflated standard errors (SEs) with 25% or more missingness. The underestimated SE after SI resulted in poor coverage with 25% or more missingness. Of the MI approaches investigated, MI using MICE-PMM produced the least biased estimates and better model performance measures. However, this MI approach still produced biased regression coefficient estimates with 75% missingness.
Very few differences were seen between the results from all missing data approaches with 5% missingness. However, performing MI using MICE-PMM may be the preferred missing data approach for handling between 10% and 50% MAR missingness.
Association mapping is a powerful approach for dissecting the genetic architecture of complex quantitative traits using high-density SNP markers in maize. Here, we expanded our association panel size from 368 to 513 inbred lines with 0.5 million high quality SNPs using a two-step data-imputation method which combines identity by descent (IBD) based projection and k-nearest neighbor (KNN) algorithm. Genome-wide association studies (GWAS) were carried out for 17 agronomic traits with a panel of 513 inbred lines applying both mixed linear model (MLM) and a new method, the Anderson-Darling (A-D) test. Ten loci for five traits were identified using the MLM method at the Bonferroni-corrected threshold −log10 (P) >5.74 (α = 1). Many loci ranging from one to 34 loci (107 loci for plant height) were identified for 17 traits using the A-D test at the Bonferroni-corrected threshold −log10 (P) >7.05 (α = 0.05) using 556809 SNPs. Many known loci and new candidate loci were only observed by the A-D test, a few of which were also detected in independent linkage analysis. This study indicates that combining IBD based projection and KNN algorithm is an efficient imputation method for inferring large missing genotype segments. In addition, we showed that the A-D test is a useful complement for GWAS analysis of complex quantitative traits. Especially for traits with abnormal phenotype distribution, controlled by moderate effect loci or rare variations, the A-D test balances false positives and statistical power. The candidate SNPs and associated genes also provide a rich resource for maize genetics and breeding.
Genotype imputation has been used widely in the analysis of genome-wide association studies (GWAS) to boost power and fine-map associations. We developed a two-step data imputation method to meet the challenge of large proportion missing genotypes. GWAS have uncovered an extensive genetic architecture of complex quantitative traits using high-density SNP markers in maize in the past few years. Here, GWAS were carried out for 17 agronomic traits with a panel of 513 inbred lines applying both mixed linear model and a new method, the Anderson-Darling (A-D) test. We intend to show that the A-D test is a complement to current GWAS methods, especially for complex quantitative traits controlled by moderate effect loci or rare variations and with abnormal phenotype distribution. In addition, the traits associated QTL identified here provide a rich resource for maize genetics and breeding.
Two common procedures for the treatment of missing information, listwise deletion and positive urine analysis (UA) imputation (e.g., if the participant fails to provide urine for analysis, then score the UA positive), may result in significant biases during the interpretation of treatment effects. To compare these approaches and to offer a possible alternative, these two procedures were compared to the multiple imputation (MI) procedure with publicly available data from a recent clinical trial. Listwise deletion, single imputation (i.e., positive UA imputation), and MI missing data procedures were used to comparatively examine the effect of two different buprenorphine/naloxone tapering schedules (7- or 28-days) for opioid addiction on the likelihood of a positive UA (Clinical Trial Network 0003; Ling et al., 2009). The listwise deletion of missing data resulted in a nonsignificant effect for the taper while the positive UA imputation procedure resulted in a significant effect, replicating the original findings by Ling et al. (2009). Although the MI procedure also resulted in a significant effect, the effect size was meaningfully smaller and the standard errors meaningfully larger when compared to the positive UA procedure. This study demonstrates that the researcher can obtain markedly different results depending on how the missing data are handled. Missing data theory suggests that listwise deletion and single imputation procedures should not be used to account for missing information, and that MI has advantages with respect to internal and external validity when the assumption of missing at random can be reasonably supported.
substance abuse treatment; missing data; positive drug test imputation; multiple imputation