# Related Articles

Within the pattern-mixture modeling framework for informative dropout, conditional linear models (CLMs) are a useful approach to deal with dropout that can occur at any point in continuous time (not just at observation times). However, in contrast with selection models, inferences about marginal covariate effects in CLMs are not readily available if nonidentity links are used in the mean structures. In this article, we propose a CLM for long series of longitudinal binary data with marginal covariate effects directly specified. The association between the binary responses and the dropout time is taken into account by modeling the conditional mean of the binary response as well as the dependence between the binary responses given the dropout time. Specifically, parameters in both the conditional mean and dependence models are assumed to be linear or quadratic functions of the dropout time; and the continuous dropout time distribution is left completely unspecified. Inference is fully Bayesian. We illustrate the proposed model using data from a longitudinal study of depression in HIV-infected women, where the strategy of sensitivity analysis based on the extrapolation method is also demonstrated.

doi:10.1093/biostatistics/kxr041

PMCID: PMC3297830
PMID: 22133756

Bayesian analysis; HIV/AIDS; Marginal model; Missing data; Sensitivity analysis

Dropout is a common occurrence in longitudinal studies. Building upon the pattern-mixture modeling approach within the Bayesian paradigm, we propose a general framework of varying-coefficient models for longitudinal data with informative dropout, where measurement times can be irregular and dropout can occur at any point in continuous time (not just at observation times) together with administrative censoring. Specifically, we assume that the longitudinal outcome process depends on the dropout process through its model parameters. The unconditional distribution of the repeated measures is a mixture over the dropout (administrative censoring) time distribution, and the continuous dropout time distribution with administrative censoring is left completely unspecified. We use Markov chain Monte Carlo to sample from the posterior distribution of the repeated measures given the dropout (administrative censoring) times; Bayesian bootstrapping on the observed dropout (administrative censoring) times is carried out to obtain marginal covariate effects. We illustrate the proposed framework using data from a longitudinal study of depression in HIV-infected women; the strategy for sensitivity analysis on unverifiable assumption is also demonstrated.

doi:10.1093/biostatistics/kxp040

PMCID: PMC2800163
PMID: 19837655

HIV/AIDS; Missing data; Nonparametric regression; Penalized splines

Dropout is common in longitudinal clinical trials and when the probability of dropout depends on unobserved outcomes even after conditioning on available data, it is considered missing not at random and therefore nonignorable. To address this problem, mixture models can be used to account for the relationship between a longitudinal outcome and dropout. We propose a Natural Spline Varying-coefficient mixture model (NSV), which is a straightforward extension of the parametric Conditional Linear Model (CLM). We assume that the outcome follows a varying-coefficient model conditional on a continuous dropout distribution. Natural cubic B-splines are used to allow the regression coefficients to semiparametrically depend on dropout and inference is therefore more robust. Additionally, this method is computationally stable and relatively simple to implement. We conduct simulation studies to evaluate performance and compare methodologies in settings where the longitudinal trajectories are linear and dropout time is observed for all individuals. Performance is assessed under conditions where model assumptions are both met and violated. In addition, we compare the NSV to the CLM and a standard random-effects model using an HIV/AIDS clinical trial with probable nonignorable dropout. The simulation studies suggest that the NSV is an improvement over the CLM when dropout has a nonlinear dependence on the outcome.

doi:10.1016/j.cct.2011.11.009

PMCID: PMC3414213
PMID: 22101223

Dropout; Nonignorable Missing Data; Longitudinal data; Varying-coefficient model; B-spline; HIV/AIDS

SUMMARY

The analysis of longitudinal dyadic data is challenging due to the complicated correlations within and between dyads, as well as possibly non-ignorable dropouts. Based on a mixed-effects hybrid model, we propose an approach to analyze longitudinal dyadic data with non-ignorable dropouts. We factorize the joint distribution of the measurement and dropout processes into three components: the marginal distribution of random effects, the conditional distribution of the dropout process given the random effects, and the conditional distribution of the measurement process given the random effects and missing data patterns. We model the conditional dropout process using a discrete survival model, and the conditional measurement process using a latent-class pattern-mixture model. These models account for the dyadic interdependence using the “actor” and “partner” effects and dyad-specific random effects. We use the latent-dropout-class approach to address the problem of a large number of missing data patterns caused by the dyadic data structure. We evaluate the performance of the proposed method using a simulation study, and apply our method to a longitudinal dyadic data set that arose from a prostate cancer trial.

doi:10.1111/biom.12100

PMCID: PMC3970927
PMID: 24328715

dyadic; non-ignorable missingness; mixed-effect; longitudinal; latent class

SUMMARY

The analysis of longitudinal repeated measures data is frequently complicated by missing data due to informative dropout. We describe a mixture model for joint distribution for longitudinal repeated measures, where the dropout distribution may be continuous and the dependence between response and dropout is semiparametric. Specifically, we assume that responses follow a varying coefficient random effects model conditional on dropout time, where the regression coefficients depend on dropout time through unspecified nonparametric functions that are estimated using step functions when dropout time is discrete (e.g., for panel data) and using smoothing splines when dropout time is continuous. Inference under the proposed semiparametric model is hence more robust than the parametric conditional linear model. The unconditional distribution of the repeated measures is a mixture over the dropout distribution. We show that estimation in the semiparametric varying coefficient mixture model can proceed by fitting a parametric mixed effects model and can be carried out on standard software platforms such as SAS. The model is used to analyze data from a recent AIDS clinical trial and its performance is evaluated using simulations.

doi:10.1111/j.0006-341X.2004.00240.x

PMCID: PMC2677904
PMID: 15606405

Clinical trials; Equivalence trial; Linear mixed model; Missing data; Nonignorable dropout; Pattern-mixture model; Pediatric AIDS; Selection bias; Smoothing splines

Growth mixture models (GMMs) with nonignorable missing
data have drawn increasing attention in research communities but have not been
fully studied. The goal of this article is to propose and to evaluate a Bayesian
method to estimate the GMMs with latent class dependent missing data. An
extended GMM is first presented in which class probabilities depend on some
observed explanatory variables and data missingness depends on both the
explanatory variables and a latent class variable. A full Bayesian method is
then proposed to estimate the model. Through the data augmentation method,
conditional posterior distributions for all model parameters and missing data
are obtained. A Gibbs sampling procedure is then used to generate Markov chains
of model parameters for statistical inference. The application of the model and
the method is first demonstrated through the analysis of mathematical ability
growth data from the National Longitudinal Survey of Youth 1997 (Bureau of Labor Statistics, U.S. Department of
Labor, 1997). A simulation study considering 3 main factors (the
sample size, the class probability, and the missing data mechanism) is then
conducted and the results show that the proposed Bayesian estimation approach
performs very well under the studied conditions. Finally, some implications of
this study, including the misspecified missingness mechanism, the sample size,
the sensitivity of the model, the number of latent classes, the model
comparison, and the future directions of the approach, are discussed.

doi:10.1080/00273171.2011.589261

PMCID: PMC4002129
PMID: 24790248

This paper uses a general latent variable framework to study a series of models for non-ignorable missingness due to dropout. Non-ignorable missing data modeling acknowledges that missingness may depend on not only covariates and observed outcomes at previous time points as with the standard missing at random (MAR) assumption, but also on latent variables such as values that would have been observed (missing outcomes), developmental trends (growth factors), and qualitatively different types of development (latent trajectory classes). These alternative predictors of missing data can be explored in a general latent variable framework using the Mplus program. A flexible new model uses an extended pattern-mixture approach where missingness is a function of latent dropout classes in combination with growth mixture modeling using latent trajectory classes. A new selection model allows not only an influence of the outcomes on missingness, but allows this influence to vary across latent trajectory classes. Recommendations are given for choosing models. The missing data models are applied to longitudinal data from STAR*D, the largest antidepressant clinical trial in the U.S. to date. Despite the importance of this trial, STAR*D growth model analyses using non-ignorable missing data techniques have not been explored until now. The STAR*D data are shown to feature distinct trajectory classes, including a low class corresponding to substantial improvement in depression, a minority class with a U-shaped curve corresponding to transient improvement, and a high class corresponding to no improvement. The analyses provide a new way to assess drug efficiency in the presence of dropout.

doi:10.1037/a0022634

PMCID: PMC3060937
PMID: 21381817

Latent trajectory classes; random effects; survival analysis; not missing at random

Dyadic data are common in the social and behavioral sciences, in which members of dyads are correlated due to the interdependence structure within dyads. The analysis of longitudinal dyadic data becomes complex when nonignorable dropouts occur. We propose a fully Bayesian selection-model-based approach to analyze longitudinal dyadic data with nonignorable dropouts. We model repeated measures on subjects by a transition model and account for within-dyad correlations by random effects. In the model, we allow subject’s outcome to depend on his/her own characteristics and measure history, as well as those of the other member in the dyad. We further account for the nonignorable missing data mechanism using a selection model in which the probability of dropout depends on the missing outcome. We propose a Gibbs sampler algorithm to fit the model. Simulation studies show that the proposed method effectively addresses the problem of nonignorable dropouts. We illustrate our methodology using a longitudinal breast cancer study.

doi:10.1214/11-AOAS515

PMCID: PMC3693094
PMID: 23814631

Dyadic Data; Missing Data; Nonignorable Dropout; Selection Model

Summary

In this paper, we propose a multivariate growth curve mixture model that groups subjects based on multiple symptoms measured repeatedly over time. Our model synthesizes features of two models. First, we follow Roy and Lin (2000) in relating the multiple symptoms at each time point to a single latent variable. Second, we use the growth mixture model of Muthén and Shedden (1999) to group subjects based on distinctive longitudinal profiles of this latent variable. The mean growth curve for the latent variable in each class defines that class’s features. For example, a class of “responders” would have a decline in the latent symptom summary variable over time. A Bayesian approach to estimation is employed where the methods of Elliott et al (2005) are extended to simultaneously estimate the posterior distributions of the parameters from the latent variable and growth curve mixture portions of the model. We apply our model to data from a randomized clinical trial evaluating the efficacy of Bacillus Calmette-Guerin (BCG) in treating symptoms of Interstitial Cystitis. In contrast to conventional approaches using a single subjective Global Response Assessment, we use the multivariate symptom data to identify a class of subjects where treatment demonstrates effectiveness. Simulations are used to confirm identifiability results and evaluate the performance of our algorithm. The definitive version of this paper is available at onlinelibrary.wiley.com.

doi:10.1111/j.1467-9876.2009.00663.x

PMCID: PMC3104279
PMID: 21637724

We explore a Bayesian approach to selection of variables that represent fixed and random effects in modeling of longitudinal binary outcomes with missing data caused by dropouts. We show via analytic results for a simple example that nonignorable missing data lead to biased parameter estimates. This bias results in selection of wrong effects asymptotically, which we can confirm via simulations for more complex settings. By jointly modeling the longitudinal binary data with the dropout process that possibly leads to nonignorable missing data, we are able to correct the bias in estimation and selection. Mixture priors with a point mass at zero are used to facilitate variable selection. We illustrate the proposed approach using a clinical trial for acute ischemic stroke.

doi:10.1002/bimj.201100107

PMCID: PMC3855104
PMID: 23124889

Bayesian variable selection; Bias; Dropout; Missing data; Model selection

We propose a marginalized joint-modeling approach for marginal inference on the association between longitudinal responses and covariates when longitudinal measurements are subject to informative dropouts. The proposed model is motivated by the idea of linking longitudinal responses and dropout times by latent variables while focusing on marginal inferences. We develop a simple inference procedure based on a series of estimating equations, and the resulting estimators are consistent and asymptotically normal with a sandwich-type covariance matrix ready to be estimated by the usual plug-in rule. The performance of our approach is evaluated through simulations and illustrated with a renal disease data application.

PMCID: PMC3261622
PMID: 22267962

Random effects models are commonly used to analyze longitudinal categorical data. Marginalized random effects models are a class of models that permit direct estimation of marginal mean parameters and characterize serial correlation for longitudinal categorical data via random effects (Heagerty, 1999). Marginally specified logistic-normal models for longitudinal binary data. Biometrics
55, 688–698; Lee and Daniels, 2008. Marginalized models for longitudinal ordinal data with application to quality of life studies. Statistics in Medicine
27, 4359–4380). In this paper, we propose a Kronecker product (KP) covariance structure to capture the correlation between processes at a given time and the correlation within a process over time (serial correlation) for bivariate longitudinal ordinal data. For the latter, we consider a more general class of models than standard (first-order) autoregressive correlation models, by re-parameterizing the correlation matrix using partial autocorrelations (Daniels and Pourahmadi, 2009). Modeling covariance matrices via partial autocorrelations. Journal of Multivariate Analysis
100, 2352–2363). We assess the reasonableness of the KP structure with a score test. A maximum marginal likelihood estimation method is proposed utilizing a quasi-Newton algorithm with quasi-Monte Carlo integration of the random effects. We examine the effects of demographic factors on metabolic syndrome and C-reactive protein using the proposed models.

doi:10.1093/biostatistics/kxs058

PMCID: PMC3677737
PMID: 23365416

Kronecker product; Metabolic syndrome; Partial autocorrelation

SUMMARY

Biomedical research is plagued with problems of missing data, especially in clinical trials of medical and behavioral therapies adopting longitudinal design. After a literature review on modeling incomplete longitudinal data based on full-likelihood functions, this paper proposes a set of imputation-based strategies for implementing selection, pattern-mixture, and shared-parameter models for handling intermittent missing values and dropouts that are potentially nonignorable according to various criteria. Within the framework of multiple partial imputation, intermittent missing values are first imputed several times; then, each partially imputed data set is analyzed to deal with dropouts with or without further imputation. Depending on the choice of imputation model or measurement model, there exist various strategies that can be jointly applied to the same set of data to study the effect of treatment or intervention from multi-faceted perspectives. For illustration, the strategies were applied to a data set with continuous repeated measures from a smoking cessation clinical trial.

doi:10.1002/sim.3111

PMCID: PMC3032542
PMID: 18205247

multiple partial imputation; selection model; pattern-mixture model; Markov transition model; nonignorable dropout; intermittent missing values

SUMMARY

Random effects are often used in generalized linear models to explain the serial dependence for longitudinal categorical data. Marginalized random effects models (MREMs) for the analysis of longitudinal binary data have been proposed to permit likelihood-based estimation of marginal regression parameters. In this paper, we introduce an extension of the MREM to accommodate longitudinal ordinal data. Maximum marginal likelihood estimation is implemented utilizing quasi-Newton algorithms with Monte Carlo integration of the random effects. Our approach is applied to analyze the quality of life data from a recent colorectal cancer clinical trial. Dropout occurs at a high rate and is often due to tumor progression or death. To deal with progression/death, we use a mixture model for the joint distribution of longitudinal measures and progression/death times and principal stratification to draw causal inferences about survivors.

doi:10.1002/sim.3352

PMCID: PMC2858760
PMID: 18613246

marginalized likelihood-based models; ordinal data models; dropout

Gaussian latent factor models are routinely used for modeling of dependence in continuous, binary, and ordered categorical data. For unordered categorical variables, Gaussian latent factor models lead to challenging computation and complex modeling structures. As an alternative, we propose a novel class of simplex factor models. In the single-factor case, the model treats the different categorical outcomes as independent with unknown marginals. The model can characterize flexible dependence structures parsimoniously with few factors, and as factors are added, any multivariate categorical data distribution can be accurately approximated. Using a Bayesian approach for computation and inferences, a Markov chain Monte Carlo (MCMC) algorithm is proposed that scales well with increasing dimension, with the number of factors treated as unknown. We develop an efficient proposal for updating the base probability vector in hierarchical Dirichlet models. Theoretical properties are described, and we evaluate the approach through simulation examples. Applications are described for modeling dependence in nucleotide sequences and prediction from high-dimensional categorical features.

doi:10.1080/01621459.2011.646934

PMCID: PMC3728016
PMID: 23908561

Classification; Contingency table; Factor analysis; Latent variable; Nonparametric Bayes; Nonnegative tensor factorization; Mutual information; Polytomous regression

Non ignorable missing data is a common problem in longitudinal studies. Latent class models are attractive for simplifying the modeling of missing data when the data are subject to either a monotone or intermittent missing data pattern. In our study, we propose a new two-latent-class model for categorical data with informative dropouts, dividing the observed data into two latent classes; one class in which the outcomes are deterministic and a second one in which the outcomes can be modeled using logistic regression. In the model, the latent classes connect the longitudinal responses and the missingness process under the assumption of conditional independence. Parameters are estimated by the method of maximum likelihood estimation based on the above assumptions and the tetrachoric correlation between responses within the same subject. We compare the proposed method with the shared parameter model and the weighted GEE model using the areas under the ROC curves in the simulations and the application to the smoking cessation data set. The simulation results indicate that the proposed two-latent-class model performs well under different missing procedures. The application results show that our proposed method is better than the shared parameter model and the weighted GEE model.

doi:10.1080/03610920802585849

PMCID: PMC2879593
PMID: 20523912

Area under ROC curve; Informative dropout; Latent class; Tetrachoric correlation

Randomized trials with dropouts or censored data and discrete time-to-event type outcomes are frequently analyzed using the Kaplan–Meier or product limit (PL) estimation method. However, the PL method assumes that the censoring mechanism is noninformative and when this assumption is violated, the inferences may not be valid. We propose an expanded PL method using a Bayesian framework to incorporate informative censoring mechanism and perform sensitivity analysis on estimates of the cumulative incidence curves. The expanded method uses a model, which can be viewed as a pattern mixture model, where odds for having an event during the follow-up interval (tk−1,tk], conditional on being at risk at tk−1, differ across the patterns of missing data. The sensitivity parameters relate the odds of an event, between subjects from a missing-data pattern with the observed subjects for each interval. The large number of the sensitivity parameters is reduced by considering them as random and assumed to follow a log-normal distribution with prespecified mean and variance. Then we vary the mean and variance to explore sensitivity of inferences. The missing at random (MAR) mechanism is a special case of the expanded model, thus allowing exploration of the sensitivity to inferences as departures from the inferences under the MAR assumption. The proposed approach is applied to data from the TRial Of Preventing HYpertension.

doi:10.1093/biostatistics/kxr048

PMCID: PMC3297827
PMID: 22223746

Clinical trials; Hypertension; Ignorability index; Missing data; Pattern-mixture model; TROPHY trial

Summary

In this article we develop a latent class model with class probabilities that depend on subject-specific covariates. One of our major goals is to identify important predictors of latent classes. We consider methodology that allows estimation of latent classes while allowing for variable selection uncertainty. We propose a Bayesian variable selection approach and implement a stochastic search Gibbs sampler for posterior computation to obtain model averaged estimates of quantities of interest such as marginal inclusion probabilities of predictors. Our methods are illustrated through simulation studies and application to data on weight gain during pregnancy, where it is of interest to identify important predictors of latent weight gain classes.

doi:10.1111/j.1541-0420.2010.01502.x

PMCID: PMC3035762
PMID: 21039399

Bayesian model averaging; Finite mixture model; Markov chain Monte Carlo; Multinomial logit model; Variable selection

Latent class models (LCMs) are used increasingly for addressing a broad variety of problems, including sparse modeling of multivariate and longitudinal data, model-based clustering, and flexible inferences on predictor effects. Typical frequentist LCMs require estimation of a single finite number of classes, which does not increase with the sample size, and have a well-known sensitivity to parametric assumptions on the distributions within a class. Bayesian nonparametric methods have been developed to allow an infinite number of classes in the general population, with the number represented in a sample increasing with sample size. In this article, we propose a new nonparametric Bayes model that allows predictors to flexibly impact the allocation to latent classes, while limiting sensitivity to parametric assumptions by allowing class-specific distributions to be unknown subject to a stochastic ordering constraint. An efficient MCMC algorithm is developed for posterior computation. The methods are validated using simulation studies and applied to the problem of ranking medical procedures in terms of the distribution of patient morbidity.

doi:10.1198/jasa.2011.ap10058

PMCID: PMC3324040
PMID: 22505787

Factor analysis; Latent variables; Mixture model; Model-based clustering; Nested Dirichlet process; Order restriction; Random probability measure; Stick breaking

Incomplete multi-level data arise commonly in many clinical trials and observational studies. Because of multi-level variations in this type of data, appropriate data analysis should take these variations into account. A random effects model can allow for the multi-level variations by assuming random effects at each level, but the computation is intensive because high-dimensional integrations are often involved in fitting models. Marginal methods such as the inverse probability weighted generalized estimating equations can involve simple estimation computation, but it is hard to specify the working correlation matrix for multi-level data. In this paper, we introduce a latent variable method to deal with incomplete multi-level data when the missing mechanism is missing at random, which fills the gap between the random effects model and marginal models. Latent variable models are built for both the response and missing data processes to incorporate the variations that arise at each level. Simulation studies demonstrate that this method performs well in various situations. We apply the proposed method to an Alzheimer’s disease study.

doi:10.1002/sim.5394

PMCID: PMC3631603
PMID: 22733392

estimating equation; latent variable; missing at random; missing response; multi-level

Summary

Asthma is an important chronic disease of childhood. An intervention programme for managing asthma was designed on principles of self-regulation and was evaluated by a randomized longitudinal study.The study focused on several outcomes, and, typically, missing data remained a pervasive problem. We develop a pattern–mixture model to evaluate the outcome of intervention on the number of hospitalizations with non-ignorable dropouts. Pattern–mixture models are not generally identifiable as no data may be available to estimate a number of model parameters. Sensitivity analyses are performed by imposing structures on the unidentified parameters.We propose a parameterization which permits sensitivity analyses on clustered longitudinal count data that have missing values due to non-ignorable missing data mechanisms. This parameterization is expressed as ratios between event rates across missing data patterns and the observed data pattern and thus measures departures from an ignorable missing data mechanism. Sensitivity analyses are performed within a Bayesian framework by averaging over different prior distributions on the event ratios. This model has the advantage of providing an intuitive and flexible framework for incorporating the uncertainty of the missing data mechanism in the final analysis.

doi:10.1111/j.1467-9876.2008.00628.x

PMCID: PMC2975948
PMID: 21072316

Gibbs sampling; Longitudinal data; Non-linear mixed effects models; Poisson outcomes; Randomized trials; Transition Markov models

A spatial latent class analysis model that extends the classic latent class analysis model by adding spatial structure to the latent class distribution through the use of the multinomial probit model is introduced. Linear combinations of independent Gaussian spatial processes are used to develop multivariate spatial processes that are underlying the categorical latent classes. This allows the latent class membership to be correlated across spatially distributed sites and it allows correlation between the probabilities of particular types of classes at any one site. The number of latent classes is assumed fixed but is chosen by model comparison via cross-validation. An application of the spatial latent class analysis model is shown using soil pollution samples where 8 heavy metals were measured to be above or below government pollution limits across a 25 square kilometer region. Estimation is performed within a Bayesian framework using MCMC and is implemented using the OpenBUGS software.

doi:10.1016/j.csda.2008.07.037

PMCID: PMC2705170
PMID: 20161235

Mixture model; multinomial probit; latent variables

Summary

In this article we study a joint model for longitudinal measurements and competing risks survival data. Our joint model provides a flexible approach to handle possible nonignorable missing data in the longitudinal measurements due to dropout. It is also an extension of previous joint models with a single failure type, offering a possible way to model informatively censored events as a competing risk. Our model consists of a linear mixed effects submodel for the longitudinal outcome and a proportional cause-specific hazards frailty submodel (Prentice et al., 1978, Biometrics 34, 541-554) for the competing risks survival data, linked together by some latent random effects. We propose to obtain the maximum likelihood estimates of the parameters by an expectation maximization (EM) algorithm and estimate their standard errors using a profile likelihood method. The developed method works well in our simulation studies and is applied to a clinical trial for the scleroderma lung disease.

doi:10.1111/j.1541-0420.2007.00952.x

PMCID: PMC2751647
PMID: 18162112

Cause-specific hazard; Competing risks; EM algorithm; Joint modeling; Longitudinal data; Mixed effects model

SUMMARY

Lin et al. (http://www.biostatsresearch.com/upennbiostat/papers/, 2006) proposed a nested Markov compliance class model in the Imbens and Rubin compliance class model framework to account for time-varying subject noncompliance in longitudinal randomized intervention studies. We use superclasses, or latent compliance class principal strata, to describe longitudinal compliance patterns, and time-varying compliance classes are assumed to depend on the history of compliance. In this paper, we search for good subject-level baseline predictors of these superclasses and also examine the relationship between these superclasses and all-cause mortality. Since the superclasses are completely latent in all subjects, we utilize multiple imputation techniques to draw inferences. We apply this approach to a randomized intervention study for elderly primary care patients with depression.

doi:10.1002/sim.2909

PMCID: PMC2810145
PMID: 17477334

longitudinal compliance class model; noncompliance; principal stratification; latent class model; multiple imputation; geriatric depression

Background

This paper introduces a new constrained model and the corresponding algorithm, called unsupervised Bayesian linear unmixing (uBLU), to identify biological signatures from high dimensional assays like gene expression microarrays. The basis for uBLU is a Bayesian model for the data samples which are represented as an additive mixture of random positive gene signatures, called factors, with random positive mixing coefficients, called factor scores, that specify the relative contribution of each signature to a specific sample. The particularity of the proposed method is that uBLU constrains the factor loadings to be non-negative and the factor scores to be probability distributions over the factors. Furthermore, it also provides estimates of the number of factors. A Gibbs sampling strategy is adopted here to generate random samples according to the posterior distribution of the factors, factor scores, and number of factors. These samples are then used to estimate all the unknown parameters.

Results

Firstly, the proposed uBLU method is applied to several simulated datasets with known ground truth and compared with previous factor decomposition methods, such as principal component analysis (PCA), non negative matrix factorization (NMF), Bayesian factor regression modeling (BFRM), and the gradient-based algorithm for general matrix factorization (GB-GMF). Secondly, we illustrate the application of uBLU on a real time-evolving gene expression dataset from a recent viral challenge study in which individuals have been inoculated with influenza A/H3N2/Wisconsin. We show that the uBLU method significantly outperforms the other methods on the simulated and real data sets considered here.

Conclusions

The results obtained on synthetic and real data illustrate the accuracy of the proposed uBLU method when compared to other factor decomposition methods from the literature (PCA, NMF, BFRM, and GB-GMF). The uBLU method identifies an inflammatory component closely associated with clinical symptom scores collected during the study. Using a constrained model allows recovery of all the inflammatory genes in a single factor.

doi:10.1186/1471-2105-14-99

PMCID: PMC3681645
PMID: 23506672