We propose Bayesian parametric and semiparametric partially linear regression methods to analyze the outcome-dependent follow-up data when the random time of a follow-up measurement of an individual depends on the history of both observed longitudinal outcomes and previous measurement times. We begin with the investigation of the simplifying assumptions of Lipsitz, Fitzmaurice, Ibrahim, Gelber, and Lipshultz, and present a new model for analyzing such data by allowing subject-specific correlations for the longitudinal response and by introducing a subject-specific latent variable to accommodate the association between the longitudinal measurements and the follow-up times. An extensive simulation study shows that our Bayesian partially linear regression method facilitates accurate estimation of the true regression line and the regression parameters. We illustrate our new methodology using data from a longitudinal observational study.
Bayesian cubic smoothing spline; Latent variable; Partially linear model
We consider variable selection in the Cox regression model (Cox, 1975, Biometrika
362, 269–276) with covariates missing at random. We investigate the smoothly clipped absolute deviation penalty and adaptive least absolute shrinkage and selection operator (LASSO) penalty, and propose a unified model selection and estimation procedure. A computationally attractive algorithm is developed, which simultaneously optimizes the penalized likelihood function and penalty parameters. We also optimize a model selection criterion, called the ICQ statistic (Ibrahim, Zhu, and Tang, 2008, Journal of the American Statistical Association
103, 1648–1658), to estimate the penalty parameters and show that it consistently selects all important covariates. Simulations are performed to evaluate the finite sample performance of the penalty estimates. Also, two lung cancer data sets are analyzed to demonstrate the proposed methodology.
ALASSO; Missing data; Partial likelihood; Penalized likelihood; Proportional hazards model; SCAD; Variable selection
Missing data are a prevailing problem in any type of data analyses. A participant variable is considered missing if the value of the variable (outcome or covariate) for the participant is not observed. In this article, various issues in analyzing studies with missing data are discussed. Particularly, we focus on missing response and/or covariate data for studies with discrete, continuous, or time-to-event end points in which generalized linear models, models for longitudinal data such as generalized linear mixed effects models, or Cox regression models are used. We discuss various classifications of missing data that may arise in a study and demonstrate in several situations that the commonly used method of throwing out all participants with any missing data may lead to incorrect results and conclusions. The methods described are applied to data from an Eastern Cooperative Oncology Group phase II clinical trial of liver cancer and a phase III clinical trial of advanced non–small-cell lung cancer. Although the main area of application discussed here is cancer, the issues and methods we discuss apply to any type of study.
Longitudinal designs in psychiatric research have many benefits, including the ability to measure the course of a disease over time. However, measuring participants repeatedly over time also leads to repeated opportunities for missing data, either through failure to answer certain items, missed assessments, or permanent withdrawal from the study. To avoid bias and loss of information, one should take missing values into account in the analysis. Several popular ways that are now being used to handle missing data, such as the last observation carried forward (LOCF), often lead to incorrect analyses. We discuss a number of these popular but unprincipled methods and describe modern approaches to classifying and analyzing data with missing values. We illustrate these approaches using data from the WECare study, a longitudinal randomized treatment study of low income women with depression.
We consider selecting both fixed and random effects in a general class of mixed effects models using maximum penalized likelihood (MPL) estimation along with the smoothly clipped absolute deviation (SCAD) and adaptive LASSO (ALASSO) penalty functions. The maximum penalized likelihood estimates are shown to posses consistency and sparsity properties and asymptotic normality. A model selection criterion, called the ICQ statistic, is proposed for selecting the penalty parameters (Ibrahim, Zhu and Tang, 2008). The variable selection procedure based on ICQ is shown to consistently select important fixed and random effects. The methodology is very general and can be applied to numerous situations involving random effects, including generalized linear mixed models. Simulation studies and a real data set from an Yale infant growth study are used to illustrate the proposed methodology.
ALASSO; Cholesky decomposition; EM algorithm; ICQ criterion; Mixed Effects selection; Penalized likelihood; SCAD
Objective: To evaluate the efficacy of primary intra vitreal bevacizumab (IVB) injection on macular edema in diabetic patients with improvement in best corrected visual acuity (BCVA) and central macular thickness (CMT) on optical coherence tomography (OCT).
Methods: This prospective interventional case series study was conducted at Retina Clinic, Al-Ibrahim Eye Hospital, and Isra Postgraduate Institute of Ophthalmology Karachi. Between December 2010 to June 2012. BCVA measurement with Early Treatment in Diabetic Retinopathy Study (ETDRS) charts and ophthalmic examination, including Slit-lamp bio microscopy, indirect ophthalmoscopy, Fundus fluorescein angiography (FFA) and OCT were done at the base line examination. At monthly interval all patients were treated with 3 injections of 0.05 ml intra vitreal injection containing 1.25 mg bevacizumab. Patients were followed up for 6 months and BCVA and OCT were taken at the final visit at 6 month.
Results: The mean BCVA at base line was 0.42±0.14 Log Mar units. This improved to 0.34±0.13, 0.25±0.12, 0.17±0.12 and 0.16±0.14 Log Mar units at 1 month after 1st, 2nd 3rd injections and at final visit at 6 months respectively, a difference that was statistically significant (P>0.0001) from base line. The mean 1mm CMT measurement was 452.9 ± 143.1 µm at base line, improving to 279.8 ± 65.2 µm (P<0.0001) on final visit. No serious complications were observed.
Conclusions: Primary IVB at a dose of 1.25 mg on monthly interval seems to provide stability and improvement in BCVA and CMT in patient with DME.
Best Corrected Visual Acuity (BCVA); Central Macular Thickness (CMT); Diabetic Macular Edema (DME); Intra Vitreal Bevacizumab (IVB)
Objective: To assess the visual outcome and complications in patients after Ab-externo scleral fixation of intraocular lens in pediatric age group (15 years or less).
Methods: This quasi experimental study was conducted at Isra Postgraduate Institute of Ophthalmology, Al-Ibrahim Eye Hospital, Karachi, from January 2012 to December 2012. All cases included were worked up according to the protocol. All patients underwent Ab-externo scleral fixation of IOL under general anesthesia. Patients were followed up at 1stday, 1stweek, 1stmonth, 2ndmonth and 3rdmonth. Complete eye examination including best-corrected visual acuity and complications were noted on each visit.
Results: Thirty patients were included in the study, with mean age of 8.6 years (±3.93569). Most of the patients, 20 (66.7%), had visual acuities of 6/18 or better. No complication was seen in 18 (60%) of the patients intra operatively while soft eye was observed in 7 (23.3%) of the patients. Another complication noted was vitreous hemorrhage, which was seen in 5 (16.7%) patients. Most common post-operative complication was Uveitis followed by astigmatism. Lens dislocation and iris abnormalities were seen in only one patient. Most of the patients showed significant visual improvement after surgery.
Conclusion: Ab-externo scleral fixation of an IOL was found to be safe and showed favorable postoperative results with fewer complications.
Astigmatism; Complication; Scleral fixation
Longitudinal studies often feature incomplete response and covariate data. Likelihood-based methods such as the expectation–maximization algorithm give consistent estimators for model parameters when data are missing at random (MAR) provided that the response model and the missing covariate model are correctly specified; however, we do not need to specify the missing data mechanism. An alternative method is the weighted estimating equation, which gives consistent estimators if the missing data and response models are correctly specified; however, we do not need to specify the distribution of the covariates that have missing values. In this article, we develop a doubly robust estimation method for longitudinal data with missing response and missing covariate when data are MAR. This method is appealing in that it can provide consistent estimators if either the missing data model or the missing covariate model is correctly specified. Simulation studies demonstrate that this method performs well in a variety of situations.
Doubly robust; Estimating equation; Missing at random; Missing covariate; Missing response
A common problem in the longitudinal data analysis is the missing data problem. Two types of missing patterns are generally considered in statistical literature: monotone and non-monotone missing data. Non-monotone missing data occur when study participants intermittently miss scheduled visits, while monotone missing data can be from discontinued participation, loss to follow-up and mortality. Although many novel statistical approaches have been developed to handle missing data in recent years, few methods are available to provide inferences to handle both types of missing data simultaneously. In this article, a latent random effects model is proposed to analyze longitudinal outcomes with both monotone and non-monotone missingness in the context of missing not at random (MNAR). Another significant contribution of this paper is to propose a new computational algorithm for latent random effects models. To reduce the computational burden of high dimensional integration problem in latent random effects models, we develop a new computational algorithm that uses a new adaptive quadrature approach in conjunction with the Taylor series approximation for the likelihood function to simplify the E step computation in the EM algorithm. Simulation study is performed and the data from the Scleroderma lung study are used to demonstrate the effectiveness of this method.
Adaptive quadrature; Missing not at random; Joint model; Scleroderma study
Even in a well-designed and controlled study, missing data occurs in almost all research. Missing data can reduce the statistical power of a study and can produce biased estimates, leading to invalid conclusions. This manuscript reviews the problems and types of missing data, along with the techniques for handling missing data. The mechanisms by which missing data occurs are illustrated, and the methods for handling the missing data are discussed. The paper concludes with recommendations for the handling of missing data.
Expectation-Maximization; Imputation; Missing data; Sensitivity analysis
In this paper, we consider theoretical and computational connections between six popular methods for variable subset selection in generalized linear models (GLM’s). Under the conjugate priors developed by Chen and Ibrahim (2003) for the generalized linear model, we obtain closed form analytic relationships between the Bayes factor (posterior model probability), the Conditional Predictive Ordinate (CPO), the L measure, the Deviance Information Criterion (DIC), the Aikiake Information Criterion (AIC), and the Bayesian Information Criterion (BIC) in the case of the linear model. Moreover, we examine computational relationships in the model space for these Bayesian methods for an arbitrary GLM under conjugate priors as well as examine the performance of the conjugate priors of Chen and Ibrahim (2003) in Bayesian variable selection. Specifically, we show that once Markov chain Monte Carlo (MCMC) samples are obtained from the full model, the four Bayesian criteria can be simultaneously computed for all possible subset models in the model space. We illustrate our new methodology with a simulation study and a real dataset.
Bayes factor; Conditional Predictive Ordinate; Conjugate prior; L measure; Poisson regression; Logistic regression
Using an appropriate method to handle cases with missing data when performing secondary analyses of survey data is important to reduce bias and to reach valid conclusions for the target population. Many published secondary analyses using child health data sets do not discuss the technique employed to treat missing data or simply delete cases with missing data. Missing data may threaten statistical power by reducing sample size or, in more extreme situations, estimates derived by deleting cases with missing values may be biased, particularly if the cases with missing values are systematically different from those with complete data. The aim of this study was to determine which of 4 techniques for handling missing data most closely estimates the true model coefficient when varying proportions of cases are missing data.
We performed a simulation study to compare model coefficients when all cases had complete data and when 4 techniques for handling missing data were employed with 10%, 20%, 30% or 40% of the cases missing data.
When more than 10% of the cases had missing data, the re-weight and multiple imputation techniques were superior to dropping cases with missing scores or hot deck imputation.
These findings suggest that child health researchers should use caution when analyzing survey data if a large percentage of cases have missing values. In most situations, the technique of dropping cases with missing data should be discouraged. Investigators should consider re-weighting or multiple imputation, if a large percentage of cases are missing data.
missing data; non-response bias; secondary analysis; hot deck imputation; weighting; multiple imputation
In many longitudinal studies, evaluating the effect of a binary or continuous predictor variable on the rate of change of the outcome, i.e., slope, is often of primary interest. Sample size determination of these studies, however, is complicated by the expectation that missing data will occur due to missed visits, early drop out and staggered entry. Despite the availability of methods for assessing power in longitudinal studies with missing data, the impact on power of the magnitude and distribution of missing data in the study population remain poorly understood. As a result, simple but erroneous alterations of the sample size formulae for complete/balanced data are commonly applied. These ”naive” approaches include the average sum of squares (ASQ) and average number of subjects (ANS) methods. The goal of this paper is to explore in greater detail the effect of missing data on study power and compare the performance of naive sample size methods to a correct maximum likelihood based method using both mathematical and simulation based approaches. Two different longitudinal aging studies are used to illustrate the methods.
Missing data often occur in cross-sectional surveys and longitudinal and experimental studies. The purpose of this study was to compare the prediction of self-rated health (SRH), a robust predictor of morbidity and mortality among diverse populations, before and after imputation of the missing variable “yearly household income.” We reviewed data from 4,162 participants of Mexican origin recruited from July 1, 2002, through December 31, 2005, and who were enrolled in a population-based cohort study. Missing yearly income data were imputed using three different single imputation methods and one multiple imputation under a Bayesian approach. Of 4,162 participants, 3,121 were randomly assigned to a training set (to derive the yearly income imputation methods and develop the health-outcome prediction models) and 1,041 to a testing set (to compare the areas under the curve (AUC) of the receiver-operating characteristic of the resulting health-outcome prediction models). The discriminatory powers of the SRH prediction models were good (range, 69–72%) and compared to the prediction model obtained after no imputation of missing yearly income, all other imputation methods improved the prediction of SRH (P<0.05 for all comparisons) with the AUC for the model after multiple imputation being the highest (AUC = 0.731). Furthermore, given that yearly income was imputed using multiple imputation, the odds of SRH as good or better increased by 11% for each $5,000 increment in yearly income. This study showed that although imputation of missing data for a key predictor variable can improve a risk health-outcome prediction model, further work is needed to illuminate the risk factors associated with SRH.
Self-rated health; Missing income data; Data imputation techniques; Mean substitution; Multiple imputation; Minority health
Longitudinal data often contain missing observations and error-prone covariates. Extensive attention has been directed to analysis methods to adjust for the bias induced by missing observations. There is relatively little work on investigating the effects of covariate measurement error on estimation of the response parameters, especially on simultaneously accounting for the biases induced by both missing values and mismeasured covariates. It is not clear what the impact of ignoring measurement error is when analyzing longitudinal data with both missing observations and error-prone covariates. In this article, we study the effects of covariate measurement error on estimation of the response parameters for longitudinal studies. We develop an inference method that adjusts for the biases induced by measurement error as well as by missingness. The proposed method does not require the full specification of the distribution of the response vector but only requires modeling its mean and variance structures. Furthermore, the proposed method employs the so-called functional modeling strategy to handle the covariate process, with the distribution of covariates left unspecified. These features, plus the simplicity of implementation, make the proposed method very attractive. In this paper, we establish the asymptotic properties for the resulting estimators. With the proposed method, we conduct sensitivity analyses on a cohort data set arising from the Framingham Heart Study. Simulation studies are carried out to evaluate the impact of ignoring covariate measurement error and to assess the performance of the proposed method.
Estimating equations; Longitudinal data; Measurement error; Missing data; Simulation and extrapolation method
Carotid intima-media thickness (CIMT) measurements have been widely used as primary endpoint in studies into the effects of new interventions as alternative for cardiovascular morbidity and mortality. There are no accepted standards on the use of CIMT measurements in intervention studies and choices in the design and analysis of a CIMT study are generally based on experience and expert opinion. In the present review, we provide an overview of the current evidence on several aspects in the design and analysis of a CIMT study on the early effects of new interventions.
Summary of Issues
A balanced evaluation of the carotid segments, carotid walls, and image view to be used as CIMT study endpoint; the reading method (manual or semi-automated and continuously or in batch) to be employed, the required sample size, and the frequency of ultrasound examinations is provided. We also discuss the preferred methods to analyse longitudinal CIMT data and address the possible impact of, and methods to deal with missing and biologically implausible CIMT values.
Linear mixed effects models are the preferred way to analyse CIMT data and do appropriately handle missing and biologically implausible CIMT values. Furthermore, we recommend to use extensive CIMT designs that measure CIMT at regular points during the multiple carotid sites as such approach is likely to increase the success rates of CIMT intervention studies designed to evaluate the effects of new interventions on atherosclerotic burden.
Carotid intima-media thickness; Trials; Study design; Data analysis; Atherosclerosis
Missing data often cause problems in longitudinal cohort studies with repeated follow-up waves. Research in this area has focussed on analyses with missing data in repeated measures of the outcome, from which participants with missing exposure data are typically excluded. We performed a simulation study to compare complete-case analysis with Multiple imputation (MI) for dealing with missing data in an analysis of the association of waist circumference, measured at two waves, and the risk of colorectal cancer (a completely observed outcome).
We generated 1,000 datasets of 41,476 individuals with values of waist circumference at waves 1 and 2 and times to the events of colorectal cancer and death to resemble the distributions of the data from the Melbourne Collaborative Cohort Study. Three proportions of missing data (15, 30 and 50%) were imposed on waist circumference at wave 2 using three missing data mechanisms: Missing Completely at Random (MCAR), and a realistic and a more extreme covariate-dependent Missing at Random (MAR) scenarios. We assessed the impact of missing data on two epidemiological analyses: 1) the association between change in waist circumference between waves 1 and 2 and the risk of colorectal cancer, adjusted for waist circumference at wave 1; and 2) the association between waist circumference at wave 2 and the risk of colorectal cancer, not adjusted for waist circumference at wave 1.
We observed very little bias for complete-case analysis or MI under all missing data scenarios, and the resulting coverage of interval estimates was near the nominal 95% level. MI showed gains in precision when waist circumference was included as a strong auxiliary variable in the imputation model.
This simulation study, based on data from a longitudinal cohort study, demonstrates that there is little gain in performing MI compared to a complete-case analysis in the presence of up to 50% missing data for the exposure of interest when the data are MCAR, or missing dependent on covariates. MI will result in some gain in precision if a strong auxiliary variable that is not in the analysis model is included in the imputation model.
Simulation study; Missing exposure; Multiple imputation; Complete-case analysis; Repeated exposure measurement
An Approximate Bayesian Bootstrap (ABB) offers advantages in incorporating appropriate uncertainty when imputing missing data, but most implementations of the ABB have lacked the ability to handle nonignorable missing data where the probability of missingness depends on unobserved values. This paper outlines a strategy for using an ABB to multiply impute nonignorable missing data. The method allows the user to draw inferences and perform sensitivity analyses when the missing data mechanism cannot automatically be assumed to be ignorable. Results from imputing missing values in a longitudinal depression treatment trial as well as a simulation study are presented to demonstrate the method’s performance. We show that a procedure that uses a different type of ABB for each imputed data set accounts for appropriate uncertainty and provides nominal coverage.
Not Missing at Random; NMAR; Multiple Imputation; Hot-Deck
Randomized clinical trials are the gold standard for evaluating interventions as
randomized assignment equalizes known and unknown characteristics between
intervention groups. However, when participants miss visits, the ability to
conduct an intent-to-treat analysis and draw conclusions about a causal link is
compromised. As guidance to those performing clinical trials, this review is a
non-technical overview of the consequences of missing data and a prescription
for its treatment beyond the typical analytic approaches to the entire research
process. Examples of bias from incorrect analysis with missing data and
discussion of the advantages/disadvantages of analytic methods are given. As no
single analysis is definitive when missing data occurs, strategies for its
prevention throughout the course of a trial are presented. We aim to convey an
appreciation for how missing data influences results and an understanding of the
need for careful consideration of missing data during the design, planning,
conduct, and analytic stages.
missing data; clinical trial; intent to treat; MCAR; MAR; MNAR; study design
Imputation of missing data and the use of haplotype-based association tests can improve the power of genome-wide association studies (GWAS). In this article, I review methods for haplotype inference and missing data imputation, and discuss their application to GWAS. I discuss common features of the best algorithms for haplotype phase inference and missing data imputation in large-scale data sets, as well as some important differences between classes of methods, and highlight the methods that provide the highest accuracy and fastest computational performance.
genotype imputation; HapMap; GWAS
In the last two decades predictive testing programs have become available for various hereditary diseases, often accompanied by follow-up studies on the psychological effects of test outcomes. The aim of this systematic literature review is to describe and evaluate the statistical methods that were used in these follow-up studies. A literature search revealed 40 longitudinal quantitative studies that met the selection criteria for the review. Fifteen studies (38%) applied adequate statistical methods. The majority, 25 studies, applied less suitable statistical techniques. Nine studies (23%) did not report on dropout rate, and 18 studies provided no characteristics of the dropouts. Thirteen out of 22 studies that should have provided data on missing values, actually reported on the missing values. It is concluded that many studies could have yielded more and better results if more appropriate methodology had been used.
Biomedical research is plagued with problems of missing data, especially in clinical trials of medical and behavioral therapies adopting longitudinal design. After a literature review on modeling incomplete longitudinal data based on full-likelihood functions, this paper proposes a set of imputation-based strategies for implementing selection, pattern-mixture, and shared-parameter models for handling intermittent missing values and dropouts that are potentially nonignorable according to various criteria. Within the framework of multiple partial imputation, intermittent missing values are first imputed several times; then, each partially imputed data set is analyzed to deal with dropouts with or without further imputation. Depending on the choice of imputation model or measurement model, there exist various strategies that can be jointly applied to the same set of data to study the effect of treatment or intervention from multi-faceted perspectives. For illustration, the strategies were applied to a data set with continuous repeated measures from a smoking cessation clinical trial.
multiple partial imputation; selection model; pattern-mixture model; Markov transition model; nonignorable dropout; intermittent missing values
Non ignorable missing data is a common problem in longitudinal studies. Latent class models are attractive for simplifying the modeling of missing data when the data are subject to either a monotone or intermittent missing data pattern. In our study, we propose a new two-latent-class model for categorical data with informative dropouts, dividing the observed data into two latent classes; one class in which the outcomes are deterministic and a second one in which the outcomes can be modeled using logistic regression. In the model, the latent classes connect the longitudinal responses and the missingness process under the assumption of conditional independence. Parameters are estimated by the method of maximum likelihood estimation based on the above assumptions and the tetrachoric correlation between responses within the same subject. We compare the proposed method with the shared parameter model and the weighted GEE model using the areas under the ROC curves in the simulations and the application to the smoking cessation data set. The simulation results indicate that the proposed two-latent-class model performs well under different missing procedures. The application results show that our proposed method is better than the shared parameter model and the weighted GEE model.
Area under ROC curve; Informative dropout; Latent class; Tetrachoric correlation
A significant source of missing data in longitudinal epidemiologic studies on elderly individuals is death. It is generally believed that these missing data by death are non-ignorable to likelihood based inference. Inference based on data only from surviving participants in the study may lead to biased results. In this paper we model both the probability of disease and the probability of death using shared random effect parameters. We also propose to use the Laplace approximation for obtaining an approximate likelihood function so that high dimensional integration over the distributions of the random effect parameters is not necessary. Parameter estimates can be obtained by maximizing the approximate log-likelihood function. Data from a longitudinal dementia study will be used to illustrate the approach. A small simulation is conducted to compare parameter estimates from the proposed method to the ‘naive’ method where missing data is considered at random.
Missing data is a very common problem in medical and social studies, especially when data are collected longitudinally. It is a challenging problem to utilize observed data effectively. Many papers on missing data problems can be found in statistical literature. It is well known that the inverse weighted estimation is neither efficient nor robust. On the other hand, the doubly robust (DR) method can improve the efficiency and robustness. As is known, the DR estimation requires a missing data model (i.e., a model for the probability that data are observed) and a working regression model (i.e., a model for the outcome variable given covariates and surrogate variables). Because the DR estimating function has mean zero for any parameters in the working regression model when the missing data model is correctly specified, in this paper, we derive a formula for the estimator of the parameters of the working regression model that yields the optimally efficient estimator of the marginal mean model (the parameters of interest) when the missing data model is correctly specified. Furthermore, the proposed method also inherits the DR property. Simulation studies demonstrate the greater efficiency of the proposed method compared with the standard DR method. A longitudinal dementia data set is used for illustration.
longitudinal data; missing data; optimal; surrogate outcome