Longitudinal designs in psychiatric research have many benefits, including the ability to measure the course of a disease over time. However, measuring participants repeatedly over time also leads to repeated opportunities for missing data, either through failure to answer certain items, missed assessments, or permanent withdrawal from the study. To avoid bias and loss of information, one should take missing values into account in the analysis. Several popular ways that are now being used to handle missing data, such as the last observation carried forward (LOCF), often lead to incorrect analyses. We discuss a number of these popular but unprincipled methods and describe modern approaches to classifying and analyzing data with missing values. We illustrate these approaches using data from the WECare study, a longitudinal randomized treatment study of low income women with depression.
Using an appropriate method to handle cases with missing data when performing secondary analyses of survey data is important to reduce bias and to reach valid conclusions for the target population. Many published secondary analyses using child health data sets do not discuss the technique employed to treat missing data or simply delete cases with missing data. Missing data may threaten statistical power by reducing sample size or, in more extreme situations, estimates derived by deleting cases with missing values may be biased, particularly if the cases with missing values are systematically different from those with complete data. The aim of this study was to determine which of 4 techniques for handling missing data most closely estimates the true model coefficient when varying proportions of cases are missing data.
We performed a simulation study to compare model coefficients when all cases had complete data and when 4 techniques for handling missing data were employed with 10%, 20%, 30% or 40% of the cases missing data.
When more than 10% of the cases had missing data, the re-weight and multiple imputation techniques were superior to dropping cases with missing scores or hot deck imputation.
These findings suggest that child health researchers should use caution when analyzing survey data if a large percentage of cases have missing values. In most situations, the technique of dropping cases with missing data should be discouraged. Investigators should consider re-weighting or multiple imputation, if a large percentage of cases are missing data.
missing data; non-response bias; secondary analysis; hot deck imputation; weighting; multiple imputation
Missing data often occur in cross-sectional surveys and longitudinal and experimental studies. The purpose of this study was to compare the prediction of self-rated health (SRH), a robust predictor of morbidity and mortality among diverse populations, before and after imputation of the missing variable “yearly household income.” We reviewed data from 4,162 participants of Mexican origin recruited from July 1, 2002, through December 31, 2005, and who were enrolled in a population-based cohort study. Missing yearly income data were imputed using three different single imputation methods and one multiple imputation under a Bayesian approach. Of 4,162 participants, 3,121 were randomly assigned to a training set (to derive the yearly income imputation methods and develop the health-outcome prediction models) and 1,041 to a testing set (to compare the areas under the curve (AUC) of the receiver-operating characteristic of the resulting health-outcome prediction models). The discriminatory powers of the SRH prediction models were good (range, 69–72%) and compared to the prediction model obtained after no imputation of missing yearly income, all other imputation methods improved the prediction of SRH (P<0.05 for all comparisons) with the AUC for the model after multiple imputation being the highest (AUC = 0.731). Furthermore, given that yearly income was imputed using multiple imputation, the odds of SRH as good or better increased by 11% for each $5,000 increment in yearly income. This study showed that although imputation of missing data for a key predictor variable can improve a risk health-outcome prediction model, further work is needed to illuminate the risk factors associated with SRH.
Self-rated health; Missing income data; Data imputation techniques; Mean substitution; Multiple imputation; Minority health
Longitudinal data often contain missing observations and error-prone covariates. Extensive attention has been directed to analysis methods to adjust for the bias induced by missing observations. There is relatively little work on investigating the effects of covariate measurement error on estimation of the response parameters, especially on simultaneously accounting for the biases induced by both missing values and mismeasured covariates. It is not clear what the impact of ignoring measurement error is when analyzing longitudinal data with both missing observations and error-prone covariates. In this article, we study the effects of covariate measurement error on estimation of the response parameters for longitudinal studies. We develop an inference method that adjusts for the biases induced by measurement error as well as by missingness. The proposed method does not require the full specification of the distribution of the response vector but only requires modeling its mean and variance structures. Furthermore, the proposed method employs the so-called functional modeling strategy to handle the covariate process, with the distribution of covariates left unspecified. These features, plus the simplicity of implementation, make the proposed method very attractive. In this paper, we establish the asymptotic properties for the resulting estimators. With the proposed method, we conduct sensitivity analyses on a cohort data set arising from the Framingham Heart Study. Simulation studies are carried out to evaluate the impact of ignoring covariate measurement error and to assess the performance of the proposed method.
Estimating equations; Longitudinal data; Measurement error; Missing data; Simulation and extrapolation method
An Approximate Bayesian Bootstrap (ABB) offers advantages in incorporating appropriate uncertainty when imputing missing data, but most implementations of the ABB have lacked the ability to handle nonignorable missing data where the probability of missingness depends on unobserved values. This paper outlines a strategy for using an ABB to multiply impute nonignorable missing data. The method allows the user to draw inferences and perform sensitivity analyses when the missing data mechanism cannot automatically be assumed to be ignorable. Results from imputing missing values in a longitudinal depression treatment trial as well as a simulation study are presented to demonstrate the method’s performance. We show that a procedure that uses a different type of ABB for each imputed data set accounts for appropriate uncertainty and provides nominal coverage.
Not Missing at Random; NMAR; Multiple Imputation; Hot-Deck
Imputation of missing data and the use of haplotype-based association tests can improve the power of genome-wide association studies (GWAS). In this article, I review methods for haplotype inference and missing data imputation, and discuss their application to GWAS. I discuss common features of the best algorithms for haplotype phase inference and missing data imputation in large-scale data sets, as well as some important differences between classes of methods, and highlight the methods that provide the highest accuracy and fastest computational performance.
genotype imputation; HapMap; GWAS
In the last two decades predictive testing programs have become available for various hereditary diseases, often accompanied by follow-up studies on the psychological effects of test outcomes. The aim of this systematic literature review is to describe and evaluate the statistical methods that were used in these follow-up studies. A literature search revealed 40 longitudinal quantitative studies that met the selection criteria for the review. Fifteen studies (38%) applied adequate statistical methods. The majority, 25 studies, applied less suitable statistical techniques. Nine studies (23%) did not report on dropout rate, and 18 studies provided no characteristics of the dropouts. Thirteen out of 22 studies that should have provided data on missing values, actually reported on the missing values. It is concluded that many studies could have yielded more and better results if more appropriate methodology had been used.
Biomedical research is plagued with problems of missing data, especially in clinical trials of medical and behavioral therapies adopting longitudinal design. After a literature review on modeling incomplete longitudinal data based on full-likelihood functions, this paper proposes a set of imputation-based strategies for implementing selection, pattern-mixture, and shared-parameter models for handling intermittent missing values and dropouts that are potentially nonignorable according to various criteria. Within the framework of multiple partial imputation, intermittent missing values are first imputed several times; then, each partially imputed data set is analyzed to deal with dropouts with or without further imputation. Depending on the choice of imputation model or measurement model, there exist various strategies that can be jointly applied to the same set of data to study the effect of treatment or intervention from multi-faceted perspectives. For illustration, the strategies were applied to a data set with continuous repeated measures from a smoking cessation clinical trial.
multiple partial imputation; selection model; pattern-mixture model; Markov transition model; nonignorable dropout; intermittent missing values
Non ignorable missing data is a common problem in longitudinal studies. Latent class models are attractive for simplifying the modeling of missing data when the data are subject to either a monotone or intermittent missing data pattern. In our study, we propose a new two-latent-class model for categorical data with informative dropouts, dividing the observed data into two latent classes; one class in which the outcomes are deterministic and a second one in which the outcomes can be modeled using logistic regression. In the model, the latent classes connect the longitudinal responses and the missingness process under the assumption of conditional independence. Parameters are estimated by the method of maximum likelihood estimation based on the above assumptions and the tetrachoric correlation between responses within the same subject. We compare the proposed method with the shared parameter model and the weighted GEE model using the areas under the ROC curves in the simulations and the application to the smoking cessation data set. The simulation results indicate that the proposed two-latent-class model performs well under different missing procedures. The application results show that our proposed method is better than the shared parameter model and the weighted GEE model.
Area under ROC curve; Informative dropout; Latent class; Tetrachoric correlation
A significant source of missing data in longitudinal epidemiologic studies on elderly individuals is death. It is generally believed that these missing data by death are non-ignorable to likelihood based inference. Inference based on data only from surviving participants in the study may lead to biased results. In this paper we model both the probability of disease and the probability of death using shared random effect parameters. We also propose to use the Laplace approximation for obtaining an approximate likelihood function so that high dimensional integration over the distributions of the random effect parameters is not necessary. Parameter estimates can be obtained by maximizing the approximate log-likelihood function. Data from a longitudinal dementia study will be used to illustrate the approach. A small simulation is conducted to compare parameter estimates from the proposed method to the ‘naive’ method where missing data is considered at random.
Cronbach Coefficient Alpha (CCA) is a classic measure of item internal consistency of an instrument and is used in a wide range of behavioral, biomedical, psychosocial, and health-care related research. Methods are available for making inference about one CCA or multiple CCAs from correlated outcomes. However, none of the existing approaches effectively address missing data. As longitudinal study designs become increasingly popular and complex in modern-day clinical studies, missing data has become a serious issue, and the lack of methods to systematically address this problem has hampered the progress of research in the aforementioned fields. In this paper, we develop a novel approach to tackle the complexities involved in addressing missing data (at the instrument level due to subject dropout) within a longitudinal data setting. The approach is illustrated with both clinical and simulated data.
Cronbach Coefficient Alpha; Inverse probability weighting; Missing data; Monotone missing data pattern; U-statistics
Missing observations are commonplace in longitudinal data. We discuss how to model and
analyze such data in a dynamic framework, that is, taking into consideration the time
structure of the process and the influence of the past on the present and future
responses. An autoregressive model is used as a special case of the linear increments
model defined by Farewell (2006. Linear models
for censored data, [PhD Thesis]. Lancaster University) and Diggle and
others (2007. Analysis of longitudinal data with drop-out: objectives,
assumptions and a proposal. Journal of the Royal Statistical Society, Series C
(Applied Statistics, 56, 499–550). We wish to reconstruct
responses for missing data and discuss the required assumptions needed for both monotone
and nonmonotone missingness. The computational procedures suggested are very simple and
easily applicable. They can also be used to estimate causal effects in the presence of
time-dependent confounding. There are also connections to methods from survival analysis:
The Aalen–Johansen estimator for the transition matrix of a Markov chain turns out
to be a special case. Analysis of quality of life data from a cancer clinical trial is
analyzed and presented. Some simulations are given in the supplementary material available
at Biostatistics online.
Cancer clinical trial; Dynamic approach; Linear increments model; Longitudinal data; Missing data; Quality of life
Methods for identifying meaningful growth patterns of longitudinal trial data with both nonignorable intermittent and drop-out missingness are rare. In this study, a combined approach with statistical and data mining techniques is utilized to address the nonignorable missing data issue in growth pattern recognition. First, a parallel mixture model is proposed to model the nonignorable missing information from a real-world patient-oriented study and concurrently to estimate the growth trajectories of participants. Then, based on individual growth parameter estimates and their auxiliary feature attributes, a fuzzy clustering method is incorporated to identify the growth patterns. This case study demonstrates that the combined multi-step approach can achieve both statistical gener ality and computational efficiency for growth pattern recognition in longitudinal studies with nonignorable missing data.
Nonmissing at random; intermittent missing; growth pattern recognition; parallel mixture model; fuzzy clustering
Missing data are a major problem in the behavioral neurosciences, particularly when data collection is costly. Often researchers exclude cases with missing data, which can result in biased estimates and reduced power. Trying to avoid the deletion of a case because of a missing data point can be conducted, but implementing a naïve missing data method can result in distorted estimates and incorrect conclusions. New approaches for handling missing data have been developed but these techniques are not typically included in undergraduate research methods texts. The topic of missing data techniques would be useful for teaching research methods and for helping students with their research projects. This paper aimed to illustrate that estimating missing data is often more efficacious than complete case analysis, otherwise known as listwise deletion. Longitudinal data was obtained from an experiment examining the effects of an anorectic drug on food consumption in a small sample (n=17) of rats. The complete dataset was degraded by removing a percentage of datapoints (1–5%, 10%). Four missing data techniques: listwise deletion, mean substitution, regression, and expectation-maximization (EM) were applied to all six datasets to ensure that each approach was applied to the same missing data points. P-values, effect sizes, and Bayes factors were computed. Results demonstrated listwise deletion was the least effective method. EM and regression imputation were the preferred methods when more than 5% of the data were missing. Based on these findings it is recommended that researchers avoid using listwise deletion and consider alternative missing data techniques.
missing data; imputation; expectation maximization; listwise deletion; mean substitution; regression
Attrition from mortality is common in longitudinal studies of the elderly. Ignoring the resulting non-response or missing data can bias study results.
1260 elderly participants underwent biennial follow-up assessments over 10 years. Many missed one or more assessments over this period. We compared three statistical models to evaluate the impact of missing data on an analysis of depressive symptoms over time. The first analytic model (generalized mixed model) treated non-response as data missing at random. The other two models used shared parameter methods; each had different specifications for dropout but both jointly modeled both outcome and dropout through a common random effect.
The presence of depressive symptoms was associated with being female, having less education, functional impairment, using more prescription drugs, and taking antidepressant drugs. In all three models, the same variables were significantly associated with depression and in the same direction. However, the strength of the associations differed widely between the generalized mixed model and the shared parameter models. Although the two shared parameter models had different assumptions about the dropout process, they yielded similar estimates for the outcome. One model fitted the data better, and the other was computationally faster.
Dropout does not occur randomly in longitudinal studies of the elderly. Thus, simply ignoring it can yield biased results. Shared parameter models are a powerful, flexible, and easily implemented tool for analyzing longitudinal data while minimizing bias due to nonrandom attrition.
discrete failure time model; dropout; non-ignorable nonresponse; shared parameter model; Weibull model
Retaining participants in cohort studies with multiple follow-up waves is difficult. Commonly, researchers are faced with the problem of missing data, which may introduce biased results as well as a loss of statistical power and precision. The STROBE guidelines von Elm et al. (Lancet, 370:1453-1457, 2007); Vandenbroucke et al. (PLoS Med, 4:e297, 2007) and the guidelines proposed by Sterne et al. (BMJ, 338:b2393, 2009) recommend that cohort studies report on the amount of missing data, the reasons for non-participation and non-response, and the method used to handle missing data in the analyses. We have conducted a review of publications from cohort studies in order to document the reporting of missing data for exposure measures and to describe the statistical methods used to account for the missing data.
A systematic search of English language papers published from January 2000 to December 2009 was carried out in PubMed. Prospective cohort studies with a sample size greater than 1,000 that analysed data using repeated measures of exposure were included.
Among the 82 papers meeting the inclusion criteria, only 35 (43%) reported the amount of missing data according to the suggested guidelines. Sixty-eight papers (83%) described how they dealt with missing data in the analysis. Most of the papers excluded participants with missing data and performed a complete-case analysis (n = 54, 66%). Other papers used more sophisticated methods including multiple imputation (n = 5) or fully Bayesian modeling (n = 1). Methods known to produce biased results were also used, for example, Last Observation Carried Forward (n = 7), the missing indicator method (n = 1), and mean value substitution (n = 3). For the remaining 14 papers, the method used to handle missing data in the analysis was not stated.
This review highlights the inconsistent reporting of missing data in cohort studies and the continuing use of inappropriate methods to handle missing data in the analysis. Epidemiological journals should invoke the STROBE guidelines as a framework for authors so that the amount of missing data and how this was accounted for in the analysis is transparent in the reporting of cohort studies.
Longitudinal cohort studies; Missing exposure data; Repeated exposure measurement; Missing data methods; Reporting