Longitudinal designs in psychiatric research have many benefits, including the ability to measure the course of a disease over time. However, measuring participants repeatedly over time also leads to repeated opportunities for missing data, either through failure to answer certain items, missed assessments, or permanent withdrawal from the study. To avoid bias and loss of information, one should take missing values into account in the analysis. Several popular ways that are now being used to handle missing data, such as the last observation carried forward (LOCF), often lead to incorrect analyses. We discuss a number of these popular but unprincipled methods and describe modern approaches to classifying and analyzing data with missing values. We illustrate these approaches using data from the WECare study, a longitudinal randomized treatment study of low income women with depression.
Longitudinal studies often feature incomplete response and covariate data. Likelihood-based methods such as the expectation–maximization algorithm give consistent estimators for model parameters when data are missing at random (MAR) provided that the response model and the missing covariate model are correctly specified; however, we do not need to specify the missing data mechanism. An alternative method is the weighted estimating equation, which gives consistent estimators if the missing data and response models are correctly specified; however, we do not need to specify the distribution of the covariates that have missing values. In this article, we develop a doubly robust estimation method for longitudinal data with missing response and missing covariate when data are MAR. This method is appealing in that it can provide consistent estimators if either the missing data model or the missing covariate model is correctly specified. Simulation studies demonstrate that this method performs well in a variety of situations.
Doubly robust; Estimating equation; Missing at random; Missing covariate; Missing response
Even in a well-designed and controlled study, missing data occurs in almost all research. Missing data can reduce the statistical power of a study and can produce biased estimates, leading to invalid conclusions. This manuscript reviews the problems and types of missing data, along with the techniques for handling missing data. The mechanisms by which missing data occurs are illustrated, and the methods for handling the missing data are discussed. The paper concludes with recommendations for the handling of missing data.
Expectation-Maximization; Imputation; Missing data; Sensitivity analysis
Using an appropriate method to handle cases with missing data when performing secondary analyses of survey data is important to reduce bias and to reach valid conclusions for the target population. Many published secondary analyses using child health data sets do not discuss the technique employed to treat missing data or simply delete cases with missing data. Missing data may threaten statistical power by reducing sample size or, in more extreme situations, estimates derived by deleting cases with missing values may be biased, particularly if the cases with missing values are systematically different from those with complete data. The aim of this study was to determine which of 4 techniques for handling missing data most closely estimates the true model coefficient when varying proportions of cases are missing data.
We performed a simulation study to compare model coefficients when all cases had complete data and when 4 techniques for handling missing data were employed with 10%, 20%, 30% or 40% of the cases missing data.
When more than 10% of the cases had missing data, the re-weight and multiple imputation techniques were superior to dropping cases with missing scores or hot deck imputation.
These findings suggest that child health researchers should use caution when analyzing survey data if a large percentage of cases have missing values. In most situations, the technique of dropping cases with missing data should be discouraged. Investigators should consider re-weighting or multiple imputation, if a large percentage of cases are missing data.
missing data; non-response bias; secondary analysis; hot deck imputation; weighting; multiple imputation
Missing data often occur in cross-sectional surveys and longitudinal and experimental studies. The purpose of this study was to compare the prediction of self-rated health (SRH), a robust predictor of morbidity and mortality among diverse populations, before and after imputation of the missing variable “yearly household income.” We reviewed data from 4,162 participants of Mexican origin recruited from July 1, 2002, through December 31, 2005, and who were enrolled in a population-based cohort study. Missing yearly income data were imputed using three different single imputation methods and one multiple imputation under a Bayesian approach. Of 4,162 participants, 3,121 were randomly assigned to a training set (to derive the yearly income imputation methods and develop the health-outcome prediction models) and 1,041 to a testing set (to compare the areas under the curve (AUC) of the receiver-operating characteristic of the resulting health-outcome prediction models). The discriminatory powers of the SRH prediction models were good (range, 69–72%) and compared to the prediction model obtained after no imputation of missing yearly income, all other imputation methods improved the prediction of SRH (P<0.05 for all comparisons) with the AUC for the model after multiple imputation being the highest (AUC = 0.731). Furthermore, given that yearly income was imputed using multiple imputation, the odds of SRH as good or better increased by 11% for each $5,000 increment in yearly income. This study showed that although imputation of missing data for a key predictor variable can improve a risk health-outcome prediction model, further work is needed to illuminate the risk factors associated with SRH.
Self-rated health; Missing income data; Data imputation techniques; Mean substitution; Multiple imputation; Minority health
Longitudinal data often contain missing observations and error-prone covariates. Extensive attention has been directed to analysis methods to adjust for the bias induced by missing observations. There is relatively little work on investigating the effects of covariate measurement error on estimation of the response parameters, especially on simultaneously accounting for the biases induced by both missing values and mismeasured covariates. It is not clear what the impact of ignoring measurement error is when analyzing longitudinal data with both missing observations and error-prone covariates. In this article, we study the effects of covariate measurement error on estimation of the response parameters for longitudinal studies. We develop an inference method that adjusts for the biases induced by measurement error as well as by missingness. The proposed method does not require the full specification of the distribution of the response vector but only requires modeling its mean and variance structures. Furthermore, the proposed method employs the so-called functional modeling strategy to handle the covariate process, with the distribution of covariates left unspecified. These features, plus the simplicity of implementation, make the proposed method very attractive. In this paper, we establish the asymptotic properties for the resulting estimators. With the proposed method, we conduct sensitivity analyses on a cohort data set arising from the Framingham Heart Study. Simulation studies are carried out to evaluate the impact of ignoring covariate measurement error and to assess the performance of the proposed method.
Estimating equations; Longitudinal data; Measurement error; Missing data; Simulation and extrapolation method
Carotid intima-media thickness (CIMT) measurements have been widely used as primary endpoint in studies into the effects of new interventions as alternative for cardiovascular morbidity and mortality. There are no accepted standards on the use of CIMT measurements in intervention studies and choices in the design and analysis of a CIMT study are generally based on experience and expert opinion. In the present review, we provide an overview of the current evidence on several aspects in the design and analysis of a CIMT study on the early effects of new interventions.
Summary of Issues
A balanced evaluation of the carotid segments, carotid walls, and image view to be used as CIMT study endpoint; the reading method (manual or semi-automated and continuously or in batch) to be employed, the required sample size, and the frequency of ultrasound examinations is provided. We also discuss the preferred methods to analyse longitudinal CIMT data and address the possible impact of, and methods to deal with missing and biologically implausible CIMT values.
Linear mixed effects models are the preferred way to analyse CIMT data and do appropriately handle missing and biologically implausible CIMT values. Furthermore, we recommend to use extensive CIMT designs that measure CIMT at regular points during the multiple carotid sites as such approach is likely to increase the success rates of CIMT intervention studies designed to evaluate the effects of new interventions on atherosclerotic burden.
Carotid intima-media thickness; Trials; Study design; Data analysis; Atherosclerosis
Missing data often cause problems in longitudinal cohort studies with repeated follow-up waves. Research in this area has focussed on analyses with missing data in repeated measures of the outcome, from which participants with missing exposure data are typically excluded. We performed a simulation study to compare complete-case analysis with Multiple imputation (MI) for dealing with missing data in an analysis of the association of waist circumference, measured at two waves, and the risk of colorectal cancer (a completely observed outcome).
We generated 1,000 datasets of 41,476 individuals with values of waist circumference at waves 1 and 2 and times to the events of colorectal cancer and death to resemble the distributions of the data from the Melbourne Collaborative Cohort Study. Three proportions of missing data (15, 30 and 50%) were imposed on waist circumference at wave 2 using three missing data mechanisms: Missing Completely at Random (MCAR), and a realistic and a more extreme covariate-dependent Missing at Random (MAR) scenarios. We assessed the impact of missing data on two epidemiological analyses: 1) the association between change in waist circumference between waves 1 and 2 and the risk of colorectal cancer, adjusted for waist circumference at wave 1; and 2) the association between waist circumference at wave 2 and the risk of colorectal cancer, not adjusted for waist circumference at wave 1.
We observed very little bias for complete-case analysis or MI under all missing data scenarios, and the resulting coverage of interval estimates was near the nominal 95% level. MI showed gains in precision when waist circumference was included as a strong auxiliary variable in the imputation model.
This simulation study, based on data from a longitudinal cohort study, demonstrates that there is little gain in performing MI compared to a complete-case analysis in the presence of up to 50% missing data for the exposure of interest when the data are MCAR, or missing dependent on covariates. MI will result in some gain in precision if a strong auxiliary variable that is not in the analysis model is included in the imputation model.
Simulation study; Missing exposure; Multiple imputation; Complete-case analysis; Repeated exposure measurement
An Approximate Bayesian Bootstrap (ABB) offers advantages in incorporating appropriate uncertainty when imputing missing data, but most implementations of the ABB have lacked the ability to handle nonignorable missing data where the probability of missingness depends on unobserved values. This paper outlines a strategy for using an ABB to multiply impute nonignorable missing data. The method allows the user to draw inferences and perform sensitivity analyses when the missing data mechanism cannot automatically be assumed to be ignorable. Results from imputing missing values in a longitudinal depression treatment trial as well as a simulation study are presented to demonstrate the method’s performance. We show that a procedure that uses a different type of ABB for each imputed data set accounts for appropriate uncertainty and provides nominal coverage.
Not Missing at Random; NMAR; Multiple Imputation; Hot-Deck
Randomized clinical trials are the gold standard for evaluating interventions as
randomized assignment equalizes known and unknown characteristics between
intervention groups. However, when participants miss visits, the ability to
conduct an intent-to-treat analysis and draw conclusions about a causal link is
compromised. As guidance to those performing clinical trials, this review is a
non-technical overview of the consequences of missing data and a prescription
for its treatment beyond the typical analytic approaches to the entire research
process. Examples of bias from incorrect analysis with missing data and
discussion of the advantages/disadvantages of analytic methods are given. As no
single analysis is definitive when missing data occurs, strategies for its
prevention throughout the course of a trial are presented. We aim to convey an
appreciation for how missing data influences results and an understanding of the
need for careful consideration of missing data during the design, planning,
conduct, and analytic stages.
missing data; clinical trial; intent to treat; MCAR; MAR; MNAR; study design
Imputation of missing data and the use of haplotype-based association tests can improve the power of genome-wide association studies (GWAS). In this article, I review methods for haplotype inference and missing data imputation, and discuss their application to GWAS. I discuss common features of the best algorithms for haplotype phase inference and missing data imputation in large-scale data sets, as well as some important differences between classes of methods, and highlight the methods that provide the highest accuracy and fastest computational performance.
genotype imputation; HapMap; GWAS
In the last two decades predictive testing programs have become available for various hereditary diseases, often accompanied by follow-up studies on the psychological effects of test outcomes. The aim of this systematic literature review is to describe and evaluate the statistical methods that were used in these follow-up studies. A literature search revealed 40 longitudinal quantitative studies that met the selection criteria for the review. Fifteen studies (38%) applied adequate statistical methods. The majority, 25 studies, applied less suitable statistical techniques. Nine studies (23%) did not report on dropout rate, and 18 studies provided no characteristics of the dropouts. Thirteen out of 22 studies that should have provided data on missing values, actually reported on the missing values. It is concluded that many studies could have yielded more and better results if more appropriate methodology had been used.
Biomedical research is plagued with problems of missing data, especially in clinical trials of medical and behavioral therapies adopting longitudinal design. After a literature review on modeling incomplete longitudinal data based on full-likelihood functions, this paper proposes a set of imputation-based strategies for implementing selection, pattern-mixture, and shared-parameter models for handling intermittent missing values and dropouts that are potentially nonignorable according to various criteria. Within the framework of multiple partial imputation, intermittent missing values are first imputed several times; then, each partially imputed data set is analyzed to deal with dropouts with or without further imputation. Depending on the choice of imputation model or measurement model, there exist various strategies that can be jointly applied to the same set of data to study the effect of treatment or intervention from multi-faceted perspectives. For illustration, the strategies were applied to a data set with continuous repeated measures from a smoking cessation clinical trial.
multiple partial imputation; selection model; pattern-mixture model; Markov transition model; nonignorable dropout; intermittent missing values
Non ignorable missing data is a common problem in longitudinal studies. Latent class models are attractive for simplifying the modeling of missing data when the data are subject to either a monotone or intermittent missing data pattern. In our study, we propose a new two-latent-class model for categorical data with informative dropouts, dividing the observed data into two latent classes; one class in which the outcomes are deterministic and a second one in which the outcomes can be modeled using logistic regression. In the model, the latent classes connect the longitudinal responses and the missingness process under the assumption of conditional independence. Parameters are estimated by the method of maximum likelihood estimation based on the above assumptions and the tetrachoric correlation between responses within the same subject. We compare the proposed method with the shared parameter model and the weighted GEE model using the areas under the ROC curves in the simulations and the application to the smoking cessation data set. The simulation results indicate that the proposed two-latent-class model performs well under different missing procedures. The application results show that our proposed method is better than the shared parameter model and the weighted GEE model.
Area under ROC curve; Informative dropout; Latent class; Tetrachoric correlation
A significant source of missing data in longitudinal epidemiologic studies on elderly individuals is death. It is generally believed that these missing data by death are non-ignorable to likelihood based inference. Inference based on data only from surviving participants in the study may lead to biased results. In this paper we model both the probability of disease and the probability of death using shared random effect parameters. We also propose to use the Laplace approximation for obtaining an approximate likelihood function so that high dimensional integration over the distributions of the random effect parameters is not necessary. Parameter estimates can be obtained by maximizing the approximate log-likelihood function. Data from a longitudinal dementia study will be used to illustrate the approach. A small simulation is conducted to compare parameter estimates from the proposed method to the ‘naive’ method where missing data is considered at random.
Missing data is a very common problem in medical and social studies, especially when data are collected longitudinally. It is a challenging problem to utilize observed data effectively. Many papers on missing data problems can be found in statistical literature. It is well known that the inverse weighted estimation is neither efficient nor robust. On the other hand, the doubly robust (DR) method can improve the efficiency and robustness. As is known, the DR estimation requires a missing data model (i.e., a model for the probability that data are observed) and a working regression model (i.e., a model for the outcome variable given covariates and surrogate variables). Because the DR estimating function has mean zero for any parameters in the working regression model when the missing data model is correctly specified, in this paper, we derive a formula for the estimator of the parameters of the working regression model that yields the optimally efficient estimator of the marginal mean model (the parameters of interest) when the missing data model is correctly specified. Furthermore, the proposed method also inherits the DR property. Simulation studies demonstrate the greater efficiency of the proposed method compared with the standard DR method. A longitudinal dementia data set is used for illustration.
longitudinal data; missing data; optimal; surrogate outcome