Within the pattern-mixture modeling framework for informative dropout, conditional linear models (CLMs) are a useful approach to deal with dropout that can occur at any point in continuous time (not just at observation times). However, in contrast with selection models, inferences about marginal covariate effects in CLMs are not readily available if nonidentity links are used in the mean structures. In this article, we propose a CLM for long series of longitudinal binary data with marginal covariate effects directly specified. The association between the binary responses and the dropout time is taken into account by modeling the conditional mean of the binary response as well as the dependence between the binary responses given the dropout time. Specifically, parameters in both the conditional mean and dependence models are assumed to be linear or quadratic functions of the dropout time; and the continuous dropout time distribution is left completely unspecified. Inference is fully Bayesian. We illustrate the proposed model using data from a longitudinal study of depression in HIV-infected women, where the strategy of sensitivity analysis based on the extrapolation method is also demonstrated.
Bayesian analysis; HIV/AIDS; Marginal model; Missing data; Sensitivity analysis
Dropout is a common occurrence in longitudinal studies. Building upon the pattern-mixture modeling approach within the Bayesian paradigm, we propose a general framework of varying-coefficient models for longitudinal data with informative dropout, where measurement times can be irregular and dropout can occur at any point in continuous time (not just at observation times) together with administrative censoring. Specifically, we assume that the longitudinal outcome process depends on the dropout process through its model parameters. The unconditional distribution of the repeated measures is a mixture over the dropout (administrative censoring) time distribution, and the continuous dropout time distribution with administrative censoring is left completely unspecified. We use Markov chain Monte Carlo to sample from the posterior distribution of the repeated measures given the dropout (administrative censoring) times; Bayesian bootstrapping on the observed dropout (administrative censoring) times is carried out to obtain marginal covariate effects. We illustrate the proposed framework using data from a longitudinal study of depression in HIV-infected women; the strategy for sensitivity analysis on unverifiable assumption is also demonstrated.
HIV/AIDS; Missing data; Nonparametric regression; Penalized splines
Dropout is common in longitudinal clinical trials and when the probability of dropout depends on unobserved outcomes even after conditioning on available data, it is considered missing not at random and therefore nonignorable. To address this problem, mixture models can be used to account for the relationship between a longitudinal outcome and dropout. We propose a Natural Spline Varying-coefficient mixture model (NSV), which is a straightforward extension of the parametric Conditional Linear Model (CLM). We assume that the outcome follows a varying-coefficient model conditional on a continuous dropout distribution. Natural cubic B-splines are used to allow the regression coefficients to semiparametrically depend on dropout and inference is therefore more robust. Additionally, this method is computationally stable and relatively simple to implement. We conduct simulation studies to evaluate performance and compare methodologies in settings where the longitudinal trajectories are linear and dropout time is observed for all individuals. Performance is assessed under conditions where model assumptions are both met and violated. In addition, we compare the NSV to the CLM and a standard random-effects model using an HIV/AIDS clinical trial with probable nonignorable dropout. The simulation studies suggest that the NSV is an improvement over the CLM when dropout has a nonlinear dependence on the outcome.
Dropout; Nonignorable Missing Data; Longitudinal data; Varying-coefficient model; B-spline; HIV/AIDS
The analysis of longitudinal repeated measures data is frequently complicated by missing data due to informative dropout. We describe a mixture model for joint distribution for longitudinal repeated measures, where the dropout distribution may be continuous and the dependence between response and dropout is semiparametric. Specifically, we assume that responses follow a varying coefficient random effects model conditional on dropout time, where the regression coefficients depend on dropout time through unspecified nonparametric functions that are estimated using step functions when dropout time is discrete (e.g., for panel data) and using smoothing splines when dropout time is continuous. Inference under the proposed semiparametric model is hence more robust than the parametric conditional linear model. The unconditional distribution of the repeated measures is a mixture over the dropout distribution. We show that estimation in the semiparametric varying coefficient mixture model can proceed by fitting a parametric mixed effects model and can be carried out on standard software platforms such as SAS. The model is used to analyze data from a recent AIDS clinical trial and its performance is evaluated using simulations.
Clinical trials; Equivalence trial; Linear mixed model; Missing data; Nonignorable dropout; Pattern-mixture model; Pediatric AIDS; Selection bias; Smoothing splines
This paper uses a general latent variable framework to study a series of models for non-ignorable missingness due to dropout. Non-ignorable missing data modeling acknowledges that missingness may depend on not only covariates and observed outcomes at previous time points as with the standard missing at random (MAR) assumption, but also on latent variables such as values that would have been observed (missing outcomes), developmental trends (growth factors), and qualitatively different types of development (latent trajectory classes). These alternative predictors of missing data can be explored in a general latent variable framework using the Mplus program. A flexible new model uses an extended pattern-mixture approach where missingness is a function of latent dropout classes in combination with growth mixture modeling using latent trajectory classes. A new selection model allows not only an influence of the outcomes on missingness, but allows this influence to vary across latent trajectory classes. Recommendations are given for choosing models. The missing data models are applied to longitudinal data from STAR*D, the largest antidepressant clinical trial in the U.S. to date. Despite the importance of this trial, STAR*D growth model analyses using non-ignorable missing data techniques have not been explored until now. The STAR*D data are shown to feature distinct trajectory classes, including a low class corresponding to substantial improvement in depression, a minority class with a U-shaped curve corresponding to transient improvement, and a high class corresponding to no improvement. The analyses provide a new way to assess drug efficiency in the presence of dropout.
Latent trajectory classes; random effects; survival analysis; not missing at random
Dyadic data are common in the social and behavioral sciences, in which members of dyads are correlated due to the interdependence structure within dyads. The analysis of longitudinal dyadic data becomes complex when nonignorable dropouts occur. We propose a fully Bayesian selection-model-based approach to analyze longitudinal dyadic data with nonignorable dropouts. We model repeated measures on subjects by a transition model and account for within-dyad correlations by random effects. In the model, we allow subject’s outcome to depend on his/her own characteristics and measure history, as well as those of the other member in the dyad. We further account for the nonignorable missing data mechanism using a selection model in which the probability of dropout depends on the missing outcome. We propose a Gibbs sampler algorithm to fit the model. Simulation studies show that the proposed method effectively addresses the problem of nonignorable dropouts. We illustrate our methodology using a longitudinal breast cancer study.
Dyadic Data; Missing Data; Nonignorable Dropout; Selection Model
In this paper, we propose a multivariate growth curve mixture model that groups subjects based on multiple symptoms measured repeatedly over time. Our model synthesizes features of two models. First, we follow Roy and Lin (2000) in relating the multiple symptoms at each time point to a single latent variable. Second, we use the growth mixture model of Muthén and Shedden (1999) to group subjects based on distinctive longitudinal profiles of this latent variable. The mean growth curve for the latent variable in each class defines that class’s features. For example, a class of “responders” would have a decline in the latent symptom summary variable over time. A Bayesian approach to estimation is employed where the methods of Elliott et al (2005) are extended to simultaneously estimate the posterior distributions of the parameters from the latent variable and growth curve mixture portions of the model. We apply our model to data from a randomized clinical trial evaluating the efficacy of Bacillus Calmette-Guerin (BCG) in treating symptoms of Interstitial Cystitis. In contrast to conventional approaches using a single subjective Global Response Assessment, we use the multivariate symptom data to identify a class of subjects where treatment demonstrates effectiveness. Simulations are used to confirm identifiability results and evaluate the performance of our algorithm. The definitive version of this paper is available at onlinelibrary.wiley.com.
We explore a Bayesian approach to selection of variables that represent fixed and random effects in modeling of longitudinal binary outcomes with missing data caused by dropouts. We show via analytic results for a simple example that nonignorable missing data lead to biased parameter estimates. This bias results in selection of wrong effects asymptotically, which we can confirm via simulations for more complex settings. By jointly modeling the longitudinal binary data with the dropout process that possibly leads to nonignorable missing data, we are able to correct the bias in estimation and selection. Mixture priors with a point mass at zero are used to facilitate variable selection. We illustrate the proposed approach using a clinical trial for acute ischemic stroke.
Bayesian variable selection; Bias; Dropout; Missing data; Model selection
We propose a marginalized joint-modeling approach for marginal inference on the association between longitudinal responses and covariates when longitudinal measurements are subject to informative dropouts. The proposed model is motivated by the idea of linking longitudinal responses and dropout times by latent variables while focusing on marginal inferences. We develop a simple inference procedure based on a series of estimating equations, and the resulting estimators are consistent and asymptotically normal with a sandwich-type covariance matrix ready to be estimated by the usual plug-in rule. The performance of our approach is evaluated through simulations and illustrated with a renal disease data application.
Biomedical research is plagued with problems of missing data, especially in clinical trials of medical and behavioral therapies adopting longitudinal design. After a literature review on modeling incomplete longitudinal data based on full-likelihood functions, this paper proposes a set of imputation-based strategies for implementing selection, pattern-mixture, and shared-parameter models for handling intermittent missing values and dropouts that are potentially nonignorable according to various criteria. Within the framework of multiple partial imputation, intermittent missing values are first imputed several times; then, each partially imputed data set is analyzed to deal with dropouts with or without further imputation. Depending on the choice of imputation model or measurement model, there exist various strategies that can be jointly applied to the same set of data to study the effect of treatment or intervention from multi-faceted perspectives. For illustration, the strategies were applied to a data set with continuous repeated measures from a smoking cessation clinical trial.
multiple partial imputation; selection model; pattern-mixture model; Markov transition model; nonignorable dropout; intermittent missing values
Random effects are often used in generalized linear models to explain the serial dependence for longitudinal categorical data. Marginalized random effects models (MREMs) for the analysis of longitudinal binary data have been proposed to permit likelihood-based estimation of marginal regression parameters. In this paper, we introduce an extension of the MREM to accommodate longitudinal ordinal data. Maximum marginal likelihood estimation is implemented utilizing quasi-Newton algorithms with Monte Carlo integration of the random effects. Our approach is applied to analyze the quality of life data from a recent colorectal cancer clinical trial. Dropout occurs at a high rate and is often due to tumor progression or death. To deal with progression/death, we use a mixture model for the joint distribution of longitudinal measures and progression/death times and principal stratification to draw causal inferences about survivors.
marginalized likelihood-based models; ordinal data models; dropout
Non ignorable missing data is a common problem in longitudinal studies. Latent class models are attractive for simplifying the modeling of missing data when the data are subject to either a monotone or intermittent missing data pattern. In our study, we propose a new two-latent-class model for categorical data with informative dropouts, dividing the observed data into two latent classes; one class in which the outcomes are deterministic and a second one in which the outcomes can be modeled using logistic regression. In the model, the latent classes connect the longitudinal responses and the missingness process under the assumption of conditional independence. Parameters are estimated by the method of maximum likelihood estimation based on the above assumptions and the tetrachoric correlation between responses within the same subject. We compare the proposed method with the shared parameter model and the weighted GEE model using the areas under the ROC curves in the simulations and the application to the smoking cessation data set. The simulation results indicate that the proposed two-latent-class model performs well under different missing procedures. The application results show that our proposed method is better than the shared parameter model and the weighted GEE model.
Area under ROC curve; Informative dropout; Latent class; Tetrachoric correlation
Gaussian latent factor models are routinely used for modeling of dependence in continuous, binary, and ordered categorical data. For unordered categorical variables, Gaussian latent factor models lead to challenging computation and complex modeling structures. As an alternative, we propose a novel class of simplex factor models. In the single-factor case, the model treats the different categorical outcomes as independent with unknown marginals. The model can characterize flexible dependence structures parsimoniously with few factors, and as factors are added, any multivariate categorical data distribution can be accurately approximated. Using a Bayesian approach for computation and inferences, a Markov chain Monte Carlo (MCMC) algorithm is proposed that scales well with increasing dimension, with the number of factors treated as unknown. We develop an efficient proposal for updating the base probability vector in hierarchical Dirichlet models. Theoretical properties are described, and we evaluate the approach through simulation examples. Applications are described for modeling dependence in nucleotide sequences and prediction from high-dimensional categorical features.
Classification; Contingency table; Factor analysis; Latent variable; Nonparametric Bayes; Nonnegative tensor factorization; Mutual information; Polytomous regression
Randomized trials with dropouts or censored data and discrete time-to-event type outcomes are frequently analyzed using the Kaplan–Meier or product limit (PL) estimation method. However, the PL method assumes that the censoring mechanism is noninformative and when this assumption is violated, the inferences may not be valid. We propose an expanded PL method using a Bayesian framework to incorporate informative censoring mechanism and perform sensitivity analysis on estimates of the cumulative incidence curves. The expanded method uses a model, which can be viewed as a pattern mixture model, where odds for having an event during the follow-up interval (tk−1,tk], conditional on being at risk at tk−1, differ across the patterns of missing data. The sensitivity parameters relate the odds of an event, between subjects from a missing-data pattern with the observed subjects for each interval. The large number of the sensitivity parameters is reduced by considering them as random and assumed to follow a log-normal distribution with prespecified mean and variance. Then we vary the mean and variance to explore sensitivity of inferences. The missing at random (MAR) mechanism is a special case of the expanded model, thus allowing exploration of the sensitivity to inferences as departures from the inferences under the MAR assumption. The proposed approach is applied to data from the TRial Of Preventing HYpertension.
Clinical trials; Hypertension; Ignorability index; Missing data; Pattern-mixture model; TROPHY trial
In this article we develop a latent class model with class probabilities that depend on subject-specific covariates. One of our major goals is to identify important predictors of latent classes. We consider methodology that allows estimation of latent classes while allowing for variable selection uncertainty. We propose a Bayesian variable selection approach and implement a stochastic search Gibbs sampler for posterior computation to obtain model averaged estimates of quantities of interest such as marginal inclusion probabilities of predictors. Our methods are illustrated through simulation studies and application to data on weight gain during pregnancy, where it is of interest to identify important predictors of latent weight gain classes.
Bayesian model averaging; Finite mixture model; Markov chain Monte Carlo; Multinomial logit model; Variable selection
Latent class models (LCMs) are used increasingly for addressing a broad variety of problems, including sparse modeling of multivariate and longitudinal data, model-based clustering, and flexible inferences on predictor effects. Typical frequentist LCMs require estimation of a single finite number of classes, which does not increase with the sample size, and have a well-known sensitivity to parametric assumptions on the distributions within a class. Bayesian nonparametric methods have been developed to allow an infinite number of classes in the general population, with the number represented in a sample increasing with sample size. In this article, we propose a new nonparametric Bayes model that allows predictors to flexibly impact the allocation to latent classes, while limiting sensitivity to parametric assumptions by allowing class-specific distributions to be unknown subject to a stochastic ordering constraint. An efficient MCMC algorithm is developed for posterior computation. The methods are validated using simulation studies and applied to the problem of ranking medical procedures in terms of the distribution of patient morbidity.
Factor analysis; Latent variables; Mixture model; Model-based clustering; Nested Dirichlet process; Order restriction; Random probability measure; Stick breaking
Incomplete multi-level data arise commonly in many clinical trials and observational studies. Because of multi-level variations in this type of data, appropriate data analysis should take these variations into account. A random effects model can allow for the multi-level variations by assuming random effects at each level, but the computation is intensive because high-dimensional integrations are often involved in fitting models. Marginal methods such as the inverse probability weighted generalized estimating equations can involve simple estimation computation, but it is hard to specify the working correlation matrix for multi-level data. In this paper, we introduce a latent variable method to deal with incomplete multi-level data when the missing mechanism is missing at random, which fills the gap between the random effects model and marginal models. Latent variable models are built for both the response and missing data processes to incorporate the variations that arise at each level. Simulation studies demonstrate that this method performs well in various situations. We apply the proposed method to an Alzheimer’s disease study.
estimating equation; latent variable; missing at random; missing response; multi-level
Asthma is an important chronic disease of childhood. An intervention programme for managing asthma was designed on principles of self-regulation and was evaluated by a randomized longitudinal study.The study focused on several outcomes, and, typically, missing data remained a pervasive problem. We develop a pattern–mixture model to evaluate the outcome of intervention on the number of hospitalizations with non-ignorable dropouts. Pattern–mixture models are not generally identifiable as no data may be available to estimate a number of model parameters. Sensitivity analyses are performed by imposing structures on the unidentified parameters.We propose a parameterization which permits sensitivity analyses on clustered longitudinal count data that have missing values due to non-ignorable missing data mechanisms. This parameterization is expressed as ratios between event rates across missing data patterns and the observed data pattern and thus measures departures from an ignorable missing data mechanism. Sensitivity analyses are performed within a Bayesian framework by averaging over different prior distributions on the event ratios. This model has the advantage of providing an intuitive and flexible framework for incorporating the uncertainty of the missing data mechanism in the final analysis.
Gibbs sampling; Longitudinal data; Non-linear mixed effects models; Poisson outcomes; Randomized trials; Transition Markov models
In this article we study a joint model for longitudinal measurements and competing risks survival data. Our joint model provides a flexible approach to handle possible nonignorable missing data in the longitudinal measurements due to dropout. It is also an extension of previous joint models with a single failure type, offering a possible way to model informatively censored events as a competing risk. Our model consists of a linear mixed effects submodel for the longitudinal outcome and a proportional cause-specific hazards frailty submodel (Prentice et al., 1978, Biometrics 34, 541-554) for the competing risks survival data, linked together by some latent random effects. We propose to obtain the maximum likelihood estimates of the parameters by an expectation maximization (EM) algorithm and estimate their standard errors using a profile likelihood method. The developed method works well in our simulation studies and is applied to a clinical trial for the scleroderma lung disease.
Cause-specific hazard; Competing risks; EM algorithm; Joint modeling; Longitudinal data; Mixed effects model
A spatial latent class analysis model that extends the classic latent class analysis model by adding spatial structure to the latent class distribution through the use of the multinomial probit model is introduced. Linear combinations of independent Gaussian spatial processes are used to develop multivariate spatial processes that are underlying the categorical latent classes. This allows the latent class membership to be correlated across spatially distributed sites and it allows correlation between the probabilities of particular types of classes at any one site. The number of latent classes is assumed fixed but is chosen by model comparison via cross-validation. An application of the spatial latent class analysis model is shown using soil pollution samples where 8 heavy metals were measured to be above or below government pollution limits across a 25 square kilometer region. Estimation is performed within a Bayesian framework using MCMC and is implemented using the OpenBUGS software.
Mixture model; multinomial probit; latent variables
Lin et al. (http://www.biostatsresearch.com/upennbiostat/papers/, 2006) proposed a nested Markov compliance class model in the Imbens and Rubin compliance class model framework to account for time-varying subject noncompliance in longitudinal randomized intervention studies. We use superclasses, or latent compliance class principal strata, to describe longitudinal compliance patterns, and time-varying compliance classes are assumed to depend on the history of compliance. In this paper, we search for good subject-level baseline predictors of these superclasses and also examine the relationship between these superclasses and all-cause mortality. Since the superclasses are completely latent in all subjects, we utilize multiple imputation techniques to draw inferences. We apply this approach to a randomized intervention study for elderly primary care patients with depression.
longitudinal compliance class model; noncompliance; principal stratification; latent class model; multiple imputation; geriatric depression
This paper introduces a new constrained model and the corresponding algorithm, called unsupervised Bayesian linear unmixing (uBLU), to identify biological signatures from high dimensional assays like gene expression microarrays. The basis for uBLU is a Bayesian model for the data samples which are represented as an additive mixture of random positive gene signatures, called factors, with random positive mixing coefficients, called factor scores, that specify the relative contribution of each signature to a specific sample. The particularity of the proposed method is that uBLU constrains the factor loadings to be non-negative and the factor scores to be probability distributions over the factors. Furthermore, it also provides estimates of the number of factors. A Gibbs sampling strategy is adopted here to generate random samples according to the posterior distribution of the factors, factor scores, and number of factors. These samples are then used to estimate all the unknown parameters.
Firstly, the proposed uBLU method is applied to several simulated datasets with known ground truth and compared with previous factor decomposition methods, such as principal component analysis (PCA), non negative matrix factorization (NMF), Bayesian factor regression modeling (BFRM), and the gradient-based algorithm for general matrix factorization (GB-GMF). Secondly, we illustrate the application of uBLU on a real time-evolving gene expression dataset from a recent viral challenge study in which individuals have been inoculated with influenza A/H3N2/Wisconsin. We show that the uBLU method significantly outperforms the other methods on the simulated and real data sets considered here.
The results obtained on synthetic and real data illustrate the accuracy of the proposed uBLU method when compared to other factor decomposition methods from the literature (PCA, NMF, BFRM, and GB-GMF). The uBLU method identifies an inflammatory component closely associated with clinical symptom scores collected during the study. Using a constrained model allows recovery of all the inflammatory genes in a single factor.
We propose a family of regression models to adjust for nonrandom dropouts in the analysis of longitudinal outcomes with fully observed covariates. The approach conceptually focuses on generalized linear models with random effects. A novel formulation of a shared random effects model is presented and shown to provide a dropout selection parameter with a meaningful interpretation. The proposed semiparametric and parametric models are made part of a sensitivity analysis to delineate the range of inferences consistent with observed data. Concerns about model identifiability are addressed by fixing some model parameters to construct functional estimators that are used as the basis of a global sensitivity test for parameter contrasts. Our simulation studies demonstrate a large reduction of bias for the semiparametric model relatively to the parametric model at times where the dropout rate is high or the dropout model is misspecified. The methodology’s practical utility is illustrated in a data analysis.
Exponential family distribution; Functional estimators; Global sensitivity analysis; Informative dropout; Infimum/Supremum statistic; Nonparametric mixture; Uniform convergence; non-identifiable models
Bayesian hierarchical models that characterize the distributions of (transformed) gene profiles have been proven very useful and flexible in selecting differentially expressed genes across different types of tissue samples (e.g. Lo and Gottardo, 2007). However, the marginal mean and variance of these models are assumed to be the same for different gene clusters and for different tissue types. Moreover, it is not easy to determine which of the many competing Bayesian hierarchical models provides the best fit for a specific microarray data set. To address these two issues, we propose a marginal mixture model that directly models the marginal distribution of transformed gene profiles. Specifically, we approximate the marginal distributions of transformed gene profiles via a mixture of three-component multivariate Normal distributions, each component of which has the same structures of marginal mean vector and covariance matrix as those for Bayesian hierarchical models, but the values can differ. Based on the proposed model, a method is derived to select genes differentially expressed across two types of tissue samples. The derived gene selection method performs well on a real microarray data set and consistently has the best performance (based on class agreement indices) compared with several other gene selection methods on simulated microarray data sets generated from three different mixture models.
This article studies a general joint model for longitudinal measurements and competing risks survival data. The model consists of a linear mixed effects sub-model for the longitudinal outcome, a proportional cause-specific hazards frailty sub-model for the competing risks survival data, and a regression sub-model for the variance–covariance matrix of the multivariate latent random effects based on a modified Cholesky decomposition. The model provides a useful approach to adjust for non-ignorable missing data due to dropout for the longitudinal outcome, enables analysis of the survival outcome with informative censoring and intermittently measured time-dependent covariates, as well as joint analysis of the longitudinal and survival outcomes. Unlike previously studied joint models, our model allows for heterogeneous random covariance matrices. It also offers a framework to assess the homogeneous covariance assumption of existing joint models. A Bayesian MCMC procedure is developed for parameter estimation and inference. Its performances and frequentist properties are investigated using simulations. A real data example is used to illustrate the usefulness of the approach.
Cause-specific hazard; Bayesian analysis; Cholesky decomposition; Mixed effects model; MCMC; Modeling covariance matrices