Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
J Am Stat Assoc. Author manuscript; available in PMC 2010 June 1.
Published in final edited form as:
J Am Stat Assoc. 2009 June 1; 104(486): 429–438.
doi:  10.1198/016214508000000904
PMCID: PMC2746699

Joint Models for the Association of Longitudinal Binary and Continuous Processes With Application to a Smoking Cessation Trial

Xuefeng Liu, Assistant Professor, Michael J. Daniels, Professor, Chair, and Bess Marcus, Professor


Joint models for the association of a longitudinal binary and a longitudinal continuous process are proposed for situations in which their association is of direct interest. The models are parameterized such that the dependence between the two processes is characterized by unconstrained regression coefficients. Bayesian variable selection techniques are used to parsimoniously model these coefficients. A Markov chain Monte Carlo (MCMC) sampling algorithm is developed for sampling from the posterior distribution, using data augmentation steps to handle missing data. Several technical issues are addressed to implement the MCMC algorithm efficiently. The models are motivated by, and are used for, the analysis of a smoking cessation clinical trial in which an important question of interest was the effect of the (exercise) treatment on the relationship between smoking cessation and weight gain.

Keywords: Calibrated posterior predictive p-value, Data augmentation, Dependence, Joint models, Markov chain Monte Carlo, Parameter expansion, Stochastic search variable selection


In some longitudinal studies, although one time-varying outcome may be of primary interest, several related processes are measured. Examples in smoking cessation studies include smoking status and weight change and smoking status and alcohol use. In such studies, the association between the processes can reveal a great deal about the mechanism of behavior change. For example, a motivation for using exercise as an adjunct therapy for smoking cessation is to reduce the dependence between weight gain and relapse back to smoking and between the fear of weight gain and the inability to make a successful quit attempt. In this study we built models in the setting of two processes: a longitudinal binary process (e.g., smoking cessation) and a longitudinal continuous process (e.g., weight change). Our primary interest was to model the association between the two processes. We apply the models to a recent smoking cessation trial (Marcus et al. 2003), where the investigators were interested in the relation between smoking status and weight change.

Our approach builds on and extends recent research on joint modeling of mixed outcomes and Bayesian variable selection. For joint modeling, a well-known technique of joint modeling of mixed outcomes is based on introducing a partly observed random variable following a bivariate normal distribution, where one component defines the continuous outcome and the second, latent component defines the binary outcome through the common probit transformation. Authors taking this type of approach include Catalano and Ryan (1992), Cox and Wermuth (1992), Dunson (2000), Fitzmaurice and Laird (1995), Gueorguieva and Agresti (2001), Regan and Catalano (1999), Roy and Lin (2000), and Sammel, Ryan, and Legler (1997). In the present work, we extend this approach to longitudinal data with T individual measurements by considering a partly observed random variable following a 2T-variate normal distribution where the first T components define the binary outcomes by applying probit transformations and the last T components are continuous outcomes. Moreover, the main question of interest in previous studies was the effect of some treatment or therapy on the mean of the response vector. The effect of the treatment on the association between two outcomes was not of main interest. We propose similar models in which the association between two processes is of concern. To do this, we use a Bartlett decomposition of the covariance matrix (Bartlett 1933).

The association matrix induced by the Bartlett decomposition is high-dimensional and expected to be sparse, so we borrow ideas from the Bayesian variable selection literature to reduce the number of parameters (George and McCulloch 1993, 1997; Smith and Kohn 2002). Other related work includes that of Carlin and Chib (1995), Chipman (1996), Hoeting, Raftery, and Madigan (1996), and Wakefield and Bennett (1996). George and McCulloch developed stochastic search variable selection (SSVS) to select promising subsets of a set of covariates X1, . . ., Xp for further consideration in regression models. Smith and Kohn (2002) proposed similar techniques for modeling a covariance matrix with high dimension for longitudinal data. In this study we construct a hierarchical prior to parsimoniously model the association between two longitudinal processes by extending the ideas of Smith and Kohn (2002). The hierarchical specification has the advantage that potential 0’s in the association matrix can be identified and estimates of the parameters can be calculated to account for the model uncertainty associated with determining which elements are 0’s.

We develop a Markov chain Monte Carlo (MCMC) algorithm to estimate the posterior distribution of the parameters in the models. To implement the MCMC algorithm more efficiently, we address several technical issues, including efficiently sampling from truncated multivariate normal (TMVN) distributions and efficiently sampling a correlation matrix from its full conditional distribution. Geweke (1991) and Robert (1995) proposed a Gibbs sampling algorithm to sample from a TMVN distribution. This algorithm is somewhat inefficient in that it requires repeatedly evaluating conditional means and variances from univariate conditional normals and can result in high autocorrelations in the chain. Barnard, McCulloch, and Meng (2000) and Chib and Greenberg (1998) suggested using the Griddy Gibbs (GG) sampler and the random-walk Metropolis-Hastings (RW-MH) algorithm to sample a correlation matrix. Although the GG sampler is simple to implement, it is not computationally efficient. The RW-MH algorithm has the problem of potentially slow mixing. We propose better ways to handle these issues.

The article is organized as follows. We introduce a smoking cessation clinical trial that motivates this research in Section 2. In Section 3 we propose joint models and hierarchical priors used to parsimoniously model the association between two processes. In Section 4 we describe MCMC sampling techniques to estimate the posterior distribution of parameters and address several technical issues regarding efficient implementation of the MCMC algorithm. We discuss the deviance information criterion (DIC) for model comparison and the calibrated posterior predictive p value for goodness of fit in Section 5. Finally, we present the results and conclusions from the analysis of the clinical trial in Section 6.


Commit to Quit II (CTQ II) (Marcus et al. 2003, 2005) was a 4-year randomized controlled clinical trial designed to test the efficacy of moderate-intensity physical activity as an aid for smoking cessation among women. This study was a logical progression of previous work (CTQ I) (Marcus, King, Albrecht, Parisi, and Abrams 1997; Marcus et al. 1999) on the efficacy of vigorous-intensity exercise to aid smoking cessation and weight regulation in women smokers, because moderate-intensity exercise is less arduous and can be performed by healthy individuals without medical supervision. In the CTQ II trial, 217 healthy women aged 18-65 who had regularly smoked 5 or more cigarettes per day for at least 1 year and who had routinely participated in moderate or vigorous intensity physical activity for 90 minutes or less each week were recruited and randomized to one of the two conditions (treatments): a moderate-intensity exercise condition or a contact condition. These two treatments are designated exercise and wellness. All recruited women participated in an 8-week cognitive-behavioral group-based smoking cessation program, followed by a 12-month follow-up. Participants in the exercise group were required to attend one supervised exercise session per week, on their smoking cessation treatment night. They were also given written instructions for home exercises. The duration and intensity of the exercise were gradually increased to 165 minutes per week, which could be performed onsite or at home. Participants in the contact group received lectures, films, and handouts on a variety of health and lifestyle issues. All participants were encouraged to attend makeup sessions if they failed to attend any session during the 8 weeks of treatments. Smoking status was determined through self-report and carbon monoxide testing at each session. In addition, participants were weighed on a weekly basis during the 8 weeks of treatment. This design allowed for a comparison of the effect of moderate-intensity physical activity plus standard smoking cessation with the effect of contact plus standard smoking cessation.

The primary outcome was quit status (a longitudinal binary outcome), but another outcome—weight change (a longitudinal continuous outcome)—also was measured. The investigators were interested in two questions:

  1. Does moderate-intensity exercise have significant effects on smoking cessation? In the study of the CTQ I trial, the investigators found significant differences between vigorous activity and contact control group through 12 months of follow-up.
  2. Does exercise effect smoking cessation by weakening the association between smoking status and weight gain?

Question 2 motivates our development of joint models for the association of longitudinal binary and continuous processes to examine the differential patterns of association across treatments in Section 3. But our approach also allows us to answer question 1.


3.1 Joint Models

Several authors have developed joint models for the analysis of multivariate longitudinal data using latent normal variables (Daniels and Normand 2006; Dunson 2003; Gueorguieva and Sanacora 2006). In this section we propose similar joint models for the association of a longitudinal binary and a longitudinal continuous process. In our setting the time-dependent covariance matrix is modeled as a function of predictors, however, and is of primary interest.

Denote the binary outcome (in our example, smoking status) for subject j in treatment i at week t(i = 1, . . . , m;j = 1, . . . , ni; t = 1, . . . , T) by Qij,t, and denote the continuous outcome (in our example, weight change) by Wij,t. Define the vectors of responses for binary and continuous outcomes as Qij = (Qij,1, . . . , Qij,T)’ and Wij = (Wij,1, . . . , Wij,T)’. Also define a vector of latent variables underlying the binary vector Qij as Yij = (Yij,1, . . . , Yij,T)’. Suppose that Vij is a vector of joint processes such that Vij=(Yij,Wij). Then the joint distribution of binary and continuous variables over time can be modeled using the multivariate normal specification (Yij,Wij)~N(Xijβ,Σi),

Vij=(YijWij)=Xijβ+[sm epsilon]i,

where Xij is the design matrix, β is the vector of regression coefficients, and

[sm epsilon]i~N(0,Σi)withΣi=(Σi,11Σi,21Σi,21Σi,22).

Using the probit formulation for the binary process, we have Qij,t = I{Yij,t > 0}. To estimate the association between Qij and Wij, we need models for Σi,12 as a function of treatment (and/or other subject specific covariates that might affect this relationship). But both Σi,12 and the entire covariance matrix Σi are difficult to model, because of positive definiteness constraints (Daniels and Kass 2001; Pourahmadi and Daniels 2002) and because it is high-dimensional for each subject. To address this problem, we factor the joint distribution of Yij and Wij into two components: a marginal model for Yij and a correlated regression model for Wij given Yij, by extending the ideas of Fitzmaurice and Laird (1995) and Gueorguieva and Agresti (2001). Let


then the new models can be expressed as

Yij=X1ijβ1+[sm epsilon]1i


Wij=X2ijβ2+Bi(YijX1ijβ1)+[sm epsilon]2i.

where Bi = Σi,21 Σ-1i,11 is the matrix that reflects association between Yij and Wij, ε1i ~ N(0, Σi,11), and [sm epsilon]2i~N(0,Σi,22*) with Σi,22*=Σi,22Σi,21Σi,111Σi,12. The reparameterization of Σi to (Σi,11,Bi,Σi,22*) is known in the literature as the Bartlett decomposition of a covariance matrix (Bartlett 1933).

It is easy to see that (2) is a correlated probit model and (3) is a standard correlated regression model, conditional on the latent variable Yij. For identifiability, it is common to restrict Σi,11 to be a correlation matrix, Ri,11 (Chib and Greenberg 1998); in the rest of this article, we use Ri,11 instead of Σi,11 as notation for the covariance matrix in (2). The advantage of this factorization is that the components of Bi in (3) are directly related to the variance and correlation terms in Σi,12. In addition, this factorization provides a convenient parameterization for examining the association between Yij (Qij) and Wij, because the components of the Bi matrix are unconstrained.

Beside being unconstrained, the association matrix Bi in model (3) can be easily interpreted. The tth row of Bi reflects the association of the continuous process at week t with the binary process at all weeks (t = 1, . . . , T). In particular, it corresponds to the regression model wij,t[mid ]Yij=x2ij,tβ2+bi,t(YijX1ijβ1)+[sm epsilon]2i,t where bi,t = (bi,t1, . . . , bi,tT)’ is the tth row of Bi. Because the covariates associated with bi,t, Yij - X1ijβ1, are centered with variance 1 (recall that the marginal covariance matrix of Yij is a correlation matrix), the components of Bi are standardized regression coefficients. This property of the components of Bi will facilitate between-component comparisons and motivate ideas for modeling it parsimoniously.

3.2 Priors for Parameters in Joint Models

For Bayesian inference, we need to specify priors for parameters in the models described in Section 3.1. Let bi denote the column vector obtained by stringing the rows of Bi (i = 1, . . . , m); that is, bi = (Bi,11, . . . , Bi,1T, . . . , Bi,TT)’. Let R1 = (R1,11, . . . , Rm,11)’, b=(b1,,bm), and Σ2*=(Σ1,22*,,Σm,22*). We write the joint prior in our models as

π(β,R1,b,Σ2*)=[product]i=1mπ(β)π(Ri,11)π(bi[mid ]β,Σi,22*)π(Σi,22*).

This specification implies that β, bi, and Σi,22* are a priori jointly independent of Ri,11 and, marginally, β and Σi,22* are a priori independent. Because we have little prior information for β, Σi,22* and Ri,11, we specify flat priors for β and Σi,22* and a joint uniform prior (derived in Barnard et al. 2000) for Ri,11,

π(β)[proportional, variant]1,
π(Σi,22*)[proportional, variant]I{Σi,22*[set membership](,)T(T+1)/2andΣi,22*is positive definite},


π(Ri,11)[proportional, variant]I{ri,jk:ri,jk=1(j=k),[mid ]ri,jk[mid ]<1(jk)andRi,11is positive definite},

where ri,jk (jk, j, k = 1, . . . ,T) is the off-diagonal element of the jth row and kth column in Ri,11.

3.2.1 Prior for the Elements of the Association Matrix

We now provide details on the prior π(bi[mid ]β,Σi,22*). Because the components of bi are the regression coefficients of Wij on Yij, we expect many of the components of bi to be 0’s based on conditional independence (Markov-type) arguments for longitudinal data. Consider, for example, the regression for wij,t. Once we condition on {Yij,k : t - 1 ≤ kt + 1} (current value and lag 1 value forward and backward), we might expect wij,t to be independent of {Yij,k : k > t + 1, k < t - 1} (values more than lag 1 away). To incorporate these features into the model, we specify a hierarchical prior distribution that essentially allows components of bi to be 0’s, borrowing ideas from Smith and Kohn (2002). This also will facilitate reducing the number of dependence parameters, which can be quite large.

A key feature of the hierarchical prior is that each component of bi (recall that each component is on the same scale because they are standardized regression coefficients) can be exactly 0 with positive probability. To attain this, we introduce the latent indicator vector δi associated with bi such that


where bi,tl is the component of the tth row and lth column in Bi and δi,tl is the corresponding binary indicator of δi that is associated with bi,tl. The nonzero components of the vector bi (i.e., the components for which δi,tl = 1) are given a normal prior (conditional on δi,tl = 1) with mean

E[bi[mid ]Y,β,δ,Σ22*]=(j=1niCijδ(niΣi,22*)1Cijδ)1×j=1niCijδ(niΣi,22*)1zij

and covariance matrix

var[bi[mid ]Y,β,δ,Σ22*]=(j=1niCijδ(nii,22*)1Cijδ)1,

where Zij = Wij - X2ijβ2, Cijδ is obtained by removing from Cij the columns corresponding to zero elements of bi, and Cij is T × T2 matrix with elements that are functions of Yij and the tth row, Cij,t=(0ij,1,,0ij,(t1),(YijX1ijβ1)t,0ij,(t+1),,0ij,T).The vector δi is then given the prior

π(δi[mid ]pi)=[product]t=1T[product]l=1Tπ(δi,tl[mid ]pi)
=[product]t=1T[product]l=1TBernoulli(pi[mid ]t1[mid ]1/a0+1),π(pi)=Beta(ri,λi),

where δi = (δi,11, . . . , δi,1T, . . . , δi,T1, . . . , δi,TT)’; δi,tl (i = 1, . . . , m; l, t = 1, . . . , T) is the element of δi associated with bi,tl, which is the regression coefficient of wij,t on yij,l, a0 is a tuning parameter; and (ri, λi) are the corresponding hyperparameters.

The prior for bi, given δi, is derived based on

π(bi[mid ]δi,β,Σi,22*)[proportional, variant]f1/ni(Wi[mid ]Yi,β,bi,δi,Σi,22*).

The rationale for (10) is that the prior provides only 1/nth of the weight provided by the likelihood.

Based on this prior construction, the quantity pi[mid ]tl[mid ]1/a0+1 in prior (8) can be considered the prior probability that bi,tl will require a nonzero estimate, and |t - l|1/a0 implies that bi,tl is more likely to be 0 as |t - l|1/a0 grows larger. Here a0 is a tuning parameter that controls the rate of decay of the probability of a nonzero component as a function of lag. Expression (8) implies that the components of Bi become smaller a priori as they move away from the main diagonal (in the longitudinal setting, become smaller as they move farther away in time). This exponent also can be adjusted if we expect, a priori, a lagged relationship.

3.2.2 Using the Prior for the Association Matrix in Practice

A complication with the prior for bi is that it is a function of (Y,β,Σi,22*), which complicates the form of the full conditional distributions for the MCMC algorithm described in the Web appendix, Section I ( To address this issue, we replace the mean and covariance for the prior with μ^b and covariance [Sigma]b, defined later. These quantities are computed from an MCMC run with bi = 0 as kE(bi[mid ]Y(k),β(k),δ,Σ22(k)*) and Σ^biδ=kvar(bi[mid ]Y(k),β(k),δ,Σ22(k)*), where k indexes the posterior sample. An alternative would be to just use a mean and variance with the posterior means of (Y,β,Σ22*) plugged in.

3.2.3 Propriety of the Joint Posterior Distribution

Given that we have little prior information on β, Σi,22* and Ri,11, we propose using flat priors for β, Σi,22*, and Ri,11. Because the priors for β and Σi,22* are improper, we need to check that the joint posterior distribution is integrable. In the Web appendix, Section II (, we state (and prove) a theorem that guarantees integrability of the posterior when four easily checked conditions are met. The proof of this theorem extends some results of Chen and Shao (1999) for the propriety of posterior distributions for multivariate categorical response models and of Daniels (2006) for the propriety of the posterior for linear regression with cor related and/or heterogeneous errors.


4.1 The MCMC Sampling Algorithm

We develop an MCMC algorithm to sample from the posterior distribution of the parameter θ=(β,R1,b,δ,p,Σ2*), where δ=(δ1,,δm), p = (p1, . . . , pm)’, and R1, b, and Σ2* are defined as in (4). To simplify implementation of the algorithm, we include a data augmentation (DA) step (Tanner and Wong 1987) that we use to impute the latent data and the missing values. Let Qobs be the vector of observed binary outcomes, Qmis be the vector of missing binary outcomes, Wobs be the vector of observed continuous outcomes, and Wmis be the vector of missing continuous outcomes. Let Yobs denote the latent vector associated with Qobs and let Ymis denote the latent vector related to Qmis. Define Y to be (Yobs,Ymis) Q to be (Qobs,Qmis) and W to be (Wobs,Wmis). We use the generic notation f for the distribution of responses, π for the prior and posterior distributions of related parameters, and L for the likelihood function. Our algorithm comprises a DA imputation (DAI) step and a posterior sampling step as follows:

  1. DAI step. Sample latent data Y and missing values Wmis from f (Y, Wmis|θ,Qobs, Wobs). To do this, factor this distribution as
    f(Y,Wmis[mid ]θ,Qobs,Wobs)=f(Wmis[mid ]θ,Yobs,Ymis,Wobs)f(Ymis[mid ]θ,Yobs,Wobs)×f(Yobs[mid ]θ,Qobs,Wobs).
  2. PS step (posterior sampling step). Generate θ from f(θ|Y, W) using Gibbs sampling. The DAI step involves sampling in the order from [Yobs|θ, Qobs, Wobs], [Ymis|θ, Yobs, Wobs], and [Wmis|θ, Yobs, Ymis, Wobs]. Once we obtain latent data and missing values, we need to sample from full conditional distributions of the components of θ. This can be completed using the Gibbs sampler. All of the full conditional distributions for the Gibbs sampler are derived in the Web appendix, Section I (

4.2 Technical Issues

To implement the MCMC algorithm efficiently, two technical issues must be addressed: imputing the latent data and sampling the correlation matrix.

4.2.1 Imputing the Latent Data

One of the challenges with the multivariate probit models is the simulation of latent variables from the TMVN distribution of Yobs given (θ, Qobs, Wobs). We propose an algorithm to sample from this distribution efficiently.

For simplicity, assume that Y is a T × 1 vector and Y ~ N(μ, Σ)IU1, where U1 [set membership] CT is a truncation region. If we partition Y, μ, and Σ as




then we have yt[mid ]Y(t)=y(t)~N(μt*,σtt*)IU1t where


Here U1t = {yt [set membership] C : (yt, y(-t)) [set membership] U1}. Geweke (1991) and Robert (1995) proposed a Gibbs sampling algorithm for sampling Y. The kernel of the Markov chain y(k)=(y1(k),,yT(k)) in this algorithm was obtained by successively generating the components of y from their full conditional distributions π(yt(k)[mid ]y1(k),,yt1(k),,yt+1(k1),,yT(k1)). A disadvantage of this algorithm is that it requires repeatedly computing the T means and variances given in (12) and often results in high autocorrelations. The following proposition provides a simple way to sample from the TMVN distribution without the need to compute (12) each time; we expect it to provide lower autocorrelations as well.

Proposition 1

Suppose that Y~ TN(μ, Σ)IU1, where U1 is the truncation region of Y. Decompose Σ as PP’, where P is a lower triangular matrix. If U1 is a convex set, then simulating Y from TN(μ, Σ)IU1 is equivalent to first sampling Z from TN(ν,IT × T)IU2 and then translating back to Y through Y = PZ. Here, ν = P-1μ, and U2 is the transformed version of U1 (see the Web appendix, Section II for the proof).

This proposition is motivated by the integral calculation method mentioned by Chib and Greenberg (1998, p. 354). The idea behind this proposition is to obtain a more efficient implementation of the Gibbs sampler based on the new set of T conditional distributions of components of Z. These distributions are simple in the sense that we do not need to compute means and variances in (12), and the univariate truncation intervals U2t can be easily derived. For example, in our model, Y~ TN(μ, R)IU1 with U1 = {y [set membership] CT: S1y ≤ 0}, where S1 is an diagonal matrix with S1 (t, t) = 1 if Qt = 0 and -1 if Qt = 1 (t = 1, . . . , T). By the transformation Z = P-1Y, we have Z~ TN(ν, IT × T)IU2, where U2 = {z [set membership] CT:S2z ≤ 0} and S2 = S1P.

Thus, at iteration k, zt(k)[mid ]z1(k),,zt1(k),zt+1(k1),,zt(k1)~N(νt,1)IU2t, where νt is the tth element of ν and U2t = {zt[set membership]C:S2z ≤ 0}. Let S2(-t) be the matrix {s1, . . . , st-1, st+1, . . . , sT}’, and let z(-t) denote the vector {z1, . . . , zt-1, zt+1, . . . , zT}’. Then U2t is given by

U2t={zt[set membership]C:stztS2(t)z(t)}.

The Markov chain y(k)=(y1(k),,yT(k)) in our implementation of the Gibbs sampler can be obtained by first generating all components of z one by one from π(zt(k)[mid ]zt(k),,zt1(k),zt+1(k1),,zT(k1))(t = 1, . . . , T), and then translating back to y(k) by y(k) = Pz(k).

4.2.2 Sampling the Correlation Matrix

Sampling correlation matrices in MCMC algorithms can be problematic. In addition to the positive definite constraint of covariance matrices, they have diagonal elements fixed at 1. The ideas for data augmentation and parameter expansion, introduced by Liu, Rubin, and Wu (1998), Liu and Wu (1999), and van Dyk and Meng (2001) to speed up convergence of algorithms (EM, DA, or others), provide a useful tool for addressing this problem. Liu (2007) and Liu and Daniels (2006) developed two-stage parameter expanded reparameterization and Metropolis-Hastings (PX-RPMH) algorithms for sampling a correlation matrix R by extending this idea. In these algorithms, the difficulty of simulating R can be overcome by creating an “expanded” model in which R can be transformed to a less constrained covariance matrix Ψ by borrowing the scale parameters from an expansion parameter matrix. In what follows, we derive an PX-RPMH algorithm for sampling the correlation matrix Ri,11 in the joint models in Section 3. For notational convenience, we denote Ri,11 as Ri in this section.

Define θ(-Ri) to be the parameter vector not including Ri, and define Di to be the expansion parameter, which is a diagonal matrix that we introduce to transform Ri into a less constrained covariance matrix, Ψi = DiRiDi. Consider the following one-to-one mapping from {Yij, Ri,Bi to {Yij*,Ψi,Bi*}:


where Σj=1niYij,k*2=1 for any t = 1, . . . , T. Given β, the step that draws Yij implicitly draws Yij* and Di, because Σj=1ni(Yij,tx1ij,tβ)2=Di,tt1Σj=1niYij,t*2=Di,tt1, where Di,tt is the tth element of Di and x1ij,t is the tth row of X1ij. The space for (Yij*,ψi,Bi*) is higher-dimensional than that for (Yij, Ri, Bi), because Ri has fewer parameters than ψi. The constraints Σj=1niYij,t*2=1 for any t = 1, . . . , T, are needed to make the candidate transformation a one-to-one mapping. By specifying the candidate prior for Ri, given by

π(Ri)[proportional, variant][mid ]Ri[mid ]ai/2I{ri,jk:ri,jk=1(j=k),[mid ]ri,jk[mid ]<1(jk)andRi,11is positive definite},

where ai is a constant to be determined, we can derive a (parameter-expanded) candidate density (PXCD) for Ψi based on the following proposition. Note that the candidate prior is introduced solely to derive a candidate density for the Metropolis-Hastings algorithm. It is not used for inference.

Proposition 2

If we choose priors as specified in Section 3.2, then, from the likelihood function for the complete data in (2), transformation (14) and candidate prior (15), we obtain

π(Ψi[mid ]Yi*,Bi*,β)[proportional, variant][mid ]Ψi[mid ](νi+T+1)/2×exp{12tr(SiΨi1)},

where νi = ni - T, Si=Σj=1niYij*Yij*, Yi*=(Yi1*,,Yini*), and Yij*=Di(YijX1ijβ). That is, Ψi[mid ]Yi*,Bi,β, Bi, β has an inverse-Wishart distribution with degrees of freedom νi and scale parameter Si (see the Web appendix, Section II for the proof).

Proposition 2 gives the PXCD of Ψi to use as the proposal density in the Metropolis-Hastings stage. In this stage, we first simulate Ψi from (16), and then obtain the correlation matrix Ri through the reduction function P(Ψi)=Di1ΨiDi1. Second, we keep the candidate Ri with probability αi (the acceptance rate in the Metropolis-Hastings algorithm). Sampling Ri based on this algorithm is given in the following theorem.

Theorem 1

Assume that θ(-Ri) and Ri are a priori independent, that is, π(θ(-Ri)), Ri) = π(θ(-Ri)π(Ri). If we choose priors as specified in Section 3.2 for θ(-Ri) and Ri, then, under transformation (14) and candidate prior (15), simulating Ri is equivalent to simulating Ψi first from the inverse-Wishart distribution (16) and then translating it back to Ri through Ri=Di1ΨiDi1 in (14) and accepting the candidate Ri using a Metropolis-Hastings step with some acceptance rate αi, where αi=min{1,exp(ai2(log[mid ]Ri[mid ]log[mid ]Ri(k)[mid ]))} at iteration k + 1 (see the Web appendix, Section II for the proof).

Theorem 1 provides a simple way to simulate the correlation matrix in the models proposed in this study.

4.2.3 Efficiency of the New Algorithms

Previous work (Liu 2007; Liu and Daniels 2006) has shown that the PX-RPMH algorithm is more efficient than other methods, such as the GG sampler (Ritter and Tanner 1992) and RW-MH algorithms (Chib and Greenberg 1998), for sampling a correlation matrix. Those authors compared the performance of the algorithms in detail, and the algorithms will generalize to our setting, the joint models specified in Section 3.

For sampling from the TMVN as described in Section 5.2.1, we conducted several simulations to compare our method with the Gibbs sampling technique of Robert (1995). In particular, we evaluated the mixing of Markov chain output from the two algorithms by calculating the lag-n (n = 1, 2, 3,. . .) autocorrelation of each component of Y. (The faster the decay of autocorrelation, the faster the mixing.) We denote our algorithm by LD-A and that of Robert (1995) by R-A. The decay of the autocorrelation was much faster for the LD-A algorithm than for the R-A algorithm; see the Web appendix, Section III ( for additional details. We next provide some summary remarks.

Remark 1

One computational inefficiency of the R-A for sampling from the TMVN distribution is that we need to repeatedly compute conditional means and variances in (12). The LD-A does not require this updating. The computational gains from this aspect are minimal, however.

Remark 2

The mixing of the chain from the R-A grows slower as the truncation region U1 increases; however, the LD-A has the advantage of fast mixing, regardless of the volume of U1. Specifically, when the truncation region approaches infinity (i.e., no truncation), the LD-A provides an iid sample.

Remark 3

The correlation between components of Y has little influence on the mixing of the chain from the LD-A, whereas it affects the performance of the R-A. We see this when comparing the two algorithms for various choices for Σ.


5.1 Model Comparison

We now consider the problem of comparing alternative models. For the models in Section 3, competing models arise from restrictions on the prior for the association matrix Bi, correlation matrices Ri,11, and/or conditional covariance matrices Σi,22* For example, we might assume that Ri,11 = R11 and Σi,22*=Σ22* when there are insufficient data to estimate all of the parameters. Although Bayes factors and marginal likelihoods are appealing, these approaches are very difficult to implement in a complex hierarchical model, such as the one proposed in this article (Liu 2006). As a result, we derive the deviance information criterion (DIC) (Spiegelhalter, Best, Carlin, and van der Linde 2002) to compare the alternative models. A main reason that we use the DIC is that its computation will be a ready byproduct of the MCMC simulations. But we note its drawbacks, which include lack of invariance to reparameterization and potential ambiguity in the choice of likelihood.

The DIC is defined as a classical estimate of fit plus a penalty,


where D(θ) is the posterior mean of the deviance and pD=D(θ)D(θ) is the effective number of parameters in the model. Given that we have missing data from subjects dropping out, we use the observed data likelihood as the likelihood to construct the DIC (Celeux, Forbes, Robert, and Titterington 2006; Daniels and Hogan 2008). Thus D(θ) is defined as

D(θ)=2Eθ[mid ]Qobs,Wobs(logf(Qobs,Wobs[mid ]θ))+2logf(Qobs,Wobs)

and pD is given by

PD=2Eθ[mid ]Qobs,Wobs(logf(Qobs,Wobs[mid ]θ))+2logf(Qobs,Wobs[mid ]θ),

where θ = (β, Σ), and θ‾ is the posterior mean. Details of the computation of the DIC for the joint models are given in the Web appendix, Section IV (

5.2 Model Checking

The “best” model chosen from models under consideration according to the DIC still may not fit the data well. To check model fit, we use posterior predictive checks based on a discrepancy function (Gelman, Meng, and Stern 1996). The “significance” of these checks often is summarized with the posterior predictive p value (ppp) (Gelman et al. 1996); however, the ppp has the major problem that its distribution under the null tends to concentrate around .5 (Robins, Ventura, and van der Vaart 2000). As a result, we check the goodness of fit of the selected joint models based on the calibrated ppp (cppp) (Hjort, Dahl, and Steinbakk 2006), defined as


where u = (Q’,W’)’, ppp(uobs) is the ppp derived by Gelman, Mechelen, Verbeke, Heitjan, and Meulders (2004), and U has the distribution implied by the prior and the model. Note that uobs corresponds to the observed vector of responses along with the data augmented responses sampled at each iteration (i.e., Qmis and Wmis). We discuss the form of the discrepancy function in the example. Hjort et al. (2006) showed that the distribution of ppp(U) is a Uniform(0,1) and that the cppp in (17) is a proper p value. To calculate ppp(U) in (17), we need to first produce a sample of U. We can obtain U by first sampling θ from (4) and then drawing U from f (U|θ). The problem arises in sampling θ from π(θ), because in (4), we specified flat priors for β and Σi,22*. We address this sampling issue next.

To approximately sample β and Σi,22* from flat priors, we use the following diffuse priors in place of the improper priors for β and Σi,22* As we did for the prior for bi in Section 3.2.2, we run the independence model with bi = 0 to obtain the posterior mean and covariance of β (denoted by μβ and Vβ) and the posterior mean of Σi,22* (denoted by μΣi,22*). Then the approximate diffuse prior for β can be set as a normal with mean μβ and covariance matrix Vβ multiplied by the number of subjects (as with a unit information prior [Kass and Wasserman 1996]), and the diffuse prior for Σi,22* can be set as an inverse-Wishart with degrees of freedom T and scale matrix μΣi,22*. Note that to sample Ri,11 from the uniform prior, we use the algorithm of Joe (2006).


We use the methodology described in Sections 3-5 to analyze the association of longitudinal quit status and weight change in the CTQ II clinical trial described in Section 2. We removed the observations at week 1 and 2 from the data, because the quit rates in these two weeks were very low (0 and 1.10%) due to the design of the study in which subjects were not supposed to try to quit smoking until week 3. Although participants were encouraged to make up missed sessions, there still existed intermittent missing values in quit status and/or weight gain. In addition, a large number of subjects dropped out before the end of the experiment. In what follows, we assume this missingness is ignorable.

For the mean of the two longitudinal responses as a function of covariates, we set β in (1) to be the vector of means at each time point across treatments; thus, the tth row of the design matrix is a vector of 0’s, with a 1 in the tth slot. Exploratory analysis suggested setting a0 in (8) to be 4; this implies a slower decrease than the raw lag.

6.1 Comparing the Competing Models

We considered several models arising from restrictions on the association matrixes Bi, the prior for the association matrices Bi, the correlation matrices Ri,11 and conditional covariance matrices Σi,22*. We fit a total of nine models to the CTQ II trial data. Denote the kth alternative model by Mk. Table 1 gives the details of all of the models considered.

Table 1
Models under consideration

For each model in Table 1, we computed the DIC using the methodology derived in Section 5. The DICs for all of the models are given in Table 2. From Table 2, we can see that the DIC is the smallest for model M5 and the largest for model M3. In general, the models using shrinkage priors for association matrices fit better.

Table 2
Estimates of the DIC

6.2 Checking the Goodness of Fit

For the CTQ II data, we defined the discrepancy function to be weekly quit rates or weekly average weight gains. Table 3 gives the cppp for quit rates and average weight gains at each week across treatments for model 5 (M5). These p values were calculated based on comparison of 2,000 pairs of [ppp(u), ppp(uobs)], with the ppp obtained using 5,000 iterations after burn-in. From this table, we can see that there are no extreme p values (<.05 or >.95). These checks suggest that the joint models (M5) fit the mean structure of the CTQ II clinical trial data well.

Table 3
Calibrated posterior predictive p value for quit rate (QR) and average weight gain (AWG) at each week across treatments for model 5 (M5)

6.3 Inference on the Quit Rates and Association

Given the results presented in Sections 6.1 and 6.2, we base inference on model M5 (Table 1). We ran the MCMC algorithm described in Section 4.1 until convergence (determined by examining trace plots of multiple chains) and based inference on the last 10,000 iterations after burn-in.

Posterior means and 95% credible intervals (CIs) for quit rates across treatments are given in Table 4. The quit rates over time were slightly lower in the exercise treatment than in the wellness treatment. The 95% CI for the difference of quit rates between two treatments at the final week was (-.014, .150), marginally significant. This suggests that the exercise treatment does not have a positive effect on smoking cessation (more later).

Table 4
Posterior means and 95% CIs for quit rates

We now turn our attention to the association between smoking cessation and weight gain. The pi (i = 1, 2) defined in (9) can be viewed as a summary measure of the overall magnitude of the association between quit status and weight gain for the two treatments. The estimates of pi are p1 = .26 (no exercise) and p2= .18 (exercise); the 95% CI for their difference is (.025, .134). These results support the hypothesis that exercise weakens the association between quit status and weight gain.

This weakening also can be seen by examining the posterior means of association matrixes across treatments, as given in Table 5. We have removed from the table those elements of the Bi matrix with probabilities of the corresponding indicators being equal to 1 of <.1. The weakened associations between smoking cessation and weight gain is obvious by noting the presence of more 0’s under the exercise treatment and the larger magnitude of the (standardized) coefficients.

Table 5
Posterior means of the association matrices with 95% CIs for each treatment

Table 6 shows the posterior means of pairwise correlations with 95% credible intervals; correlations whose 95% credible intervals covered 0 are excluded. We can see that smoking cessation and weight gain appear to have a lagged correlation structure, and that exercise weakens pairwise correlations. In particular, we point out the 2 × 2 blocks in the upper right corners of pairwise correlation matrixes under both treatments (in bold-type). For the wellness treatment, the four pairwise correlations between weight gain at the beginning of the study and quit status at weeks 5 and 6 are all negative. This means that people who gain weight early in the trial are more likely to be smoking at the end. The corresponding correlations are essentially 0’s (no longer significant) under the exercise arm. In addition, looking at the last row in Table 6 for the wellness arm, the correlations indicate those who quit early in the study are more likely to gain weight by the end (week 6); the corresponding relationship in the exercise arm is weaker.

Table 6
Posterior means of pairwise correlations with 95% CIs for each treatment


We have developed joint models for the association of longitudinal binary and continuous processes and applied them to the analysis of the CTQ II clinical trial to gain insight into the joint evolution of smoking status and weight gain. The results show that moderate-intensity exercise was not successful for smoking cessation, but that it did appear to weaken the association between smoking status and weight gain, supporting the hypothesis that exercise has an effect on smoking cessation by weakening the association between quitting smoking and gaining weight.

But we should be cautious in overinterpreting these results, because of the low compliance in the exercise arm. We might expect the low compliance to negatively bias the smoking cessation results (probability of quitting was too low) on the exercise arm; that is, the intention-to-treat effect (randomization to the exercise arm) reported here might be expected to be quite different from the causal effect (adherence to the exercise regimen). But the ability to still see the weakened association between weight gain and smoking on the exercise arm despite the low compliance supports this as the mechanism of action for exercise as a therapy for smoking cessation. In future work, we will extend these joint models to estimate causal effects and allow for nonignorable dropout.

An alternative way to factor model (1) is to decompose it into a marginal model for Wij and a conditional probit regression model for Yij, conditioning on Wij. This factorization could potentially greatly increase the computational burden in sampling the correlation matrix if the diagonal elements of Σi,11 were still fixed at 1 (due to not being identified); however, this computational problem could be avoided by fixing the diagonal elements of the conditional covariance matrix Σi,11*=Σi,11Σi,12Σi,221Σi,21 to be l’s (i.e., make this matrix a correlation matrix). For interpretation of the mean parameters, β, this computationally simpler approach would require adjusting the β components corresponding to the longitudinal binary process with the diagonal elements of the marginal covariance matrix Σi,11, which in this case would have nonidentical diagonal elements.

The general methodology proposed here can be applied to analysis of other data sets where there are two processes and the question of interest is the association between the two processes. The methodology also can be directly extended to other longitudinal cases, such as modeling the association between longitudinal ordinal and continuous processes or between two continuous processes (Liu 2006).

Supplementary Material

web appendix

Contributor Information

Xuefeng Liu, Department of Biostatistics and Epidemiology, East Tennessee State University, Johnson City, TN 37614.

Michael J. Daniels, Department of Statistics, University of Florida, Gainesville, FL 32611.

Bess Marcus, Department of Psychiatry & Human Behavior, Brown University, Providence, RI 02912.


  • Barnard J, McCulloch R, Meng XL. Modeling Covariance Matrices in Terms of Standard Deviations and Correlations With Application to Shrinkage. Statistica Sinica. 2000;10:1281–1311.
  • Bartlett MS. On the Theory of Statistical Regression. Proceedings of the Royal Society of Edinburgh. 1933;53:260–283.
  • Carlin BP, Chib S. Bayesian Model Choice via Markov Chain Monte Carlo Methods. Journal of the Royal Statistical Society, Ser. B. 1995;57:473–484.
  • Catalano PJ, Ryan LM. Bivariate Latent Variable Models for Clustered Discrete and Continuous Outcomes. Journal of the American Statistical Association. 1992;87:651–658.
  • Celeux G, Forbes F, Robert CP, Titterington DM. “Deviance Information Criteria for Missing Data Models” (with discussion) Bayesian Analysis. 2006;1:651–706.
  • Chen M-H, Shao Q-M. Properties of Prior and Posterior Distributions for Multivariate Categorical Response Data Models. Journal of Multivariate Analysis. 1999;71:277–296.
  • Chib S, Greenberg E. Bayesian Analysis of Multivariate Probit Models. Biometrika. 1998;85:347–361.
  • Chipman H. Bayesian Variable Selection With Related Predictors. Canadian Journal of Statistics. 1996;24:17–36.
  • Cox DR, Wermuth N. Response Models for Mixed Binary and Quantitative Variables. Biometrika. 1992;79:441–461.
  • Daniels MJ. Bayesian Modelling of Several Covariance Matrices and Some Results on Propriety of the Posterior for Linear Regression With Correlated and/or Heterogeneous Errors. Journal of Multivariate Analysis. 2006;97:1185–1207.
  • Daniels MJ, Hogan JW. Missing Data in Longitudinal Studies: Strategies for Bayesian Modeling and Sensitivity Analysis. Chapman & Hall/CRC Press; Boca Ration, FL: 2008.
  • Daniels MJ, Kass RE. Shrinkage Estimators for Covariance Matrices. Biometrics. 2001;57:1173–1184. [PMC free article] [PubMed]
  • Daniels MJ, Normand SL. Longitudinal Profiling of Health Care Units Based on Continuous and Discrete Patient Outcomes. Biostatistics. 2006;7:1–15. [PMC free article] [PubMed]
  • Dunson DB. Bayesian Latent Variable Models for Clustered Mixed Outcomes. Journal of the Royal Statistical Society, Ser. B. 2000;62:355–366.
  • Dunson DB. Dynamic Latent Trait Models for Multidimensional Longitudinal Data. Journal of the American Statistical Association. 2003;98:555–563.
  • Fitzmaurice GM, Laird NM. Regression Models for a Bivariate Discrete and Continuous Outcome With Clustering. Journal of the American Statistical Association. 1995;90:845–852.
  • Gelman A, Mechelen IV, Verbeke G, Heitjan DF, Meulders M. Multiple Imputation for Model Checking: Completed-Data Plots With Missing and Latent Data. Biometrics. 2004;61:74–85. [PubMed]
  • Gelman A, Meng X-L, Stern H. “Posterior Predictive Assessment of Model Fitness via Realized Discrepancies” (with discussion) Statistica Sinica. 1996;6:733–807.
  • George EI, McCulloch RE. Variable Selection via Gibbs Sampling. Journal of the American Statistical Association. 1993;88:881–889.
  • George EI, McCulloch RE. Approaches for Bayesian Variable Selection. Statistica Sinica. 1997;7:339–373.
  • Geweke J. Efficient Simulation From the Multivariate Normal and Student t-Distributions Subject to Linear Constraints; Computer Sciences and Statistics, Proceedings of the 23rd Symposium Interface; 1991.pp. 571–578.
  • Gueorguieva RV, Agresti A. A Correlated Probit Model for Joint Modelling of Clustered Binary and Continuous Responses. Journal of the American Statistical Association. 2001;96:1102–1112.
  • Gueorguieva RV, Sanacora G. Joint Analysis of Repeatedly Observed Continuous and Ordinal Measures of Disease Severity. Statistics in Medicine. 2006;25:1307–1322. [PubMed]
  • Hjort NL, Dahl FA, Steinbakk GH. Post-Processing Posterior Predictive p Values. Journal of the American Statistical Association. 2006;101:1157–1174.
  • Hoeting J, Raftery AE, Madigan D. A Method for Simultaneous Variable Selection and Outlier Identification in Linear Regression. Computational Statistics and Data Analysis. 1996;22:251–270.
  • Joe H. Generating Random Correlation Matrices Based on Partial Correlations. Journal of Multivariate Analysis. 2006;97:2177–2189.
  • Kass RE, Wasserman L. The Selection of Prior Distributions by Formal Rules. Journal of the American Statistical Association. 1996;91:1343–1370. Corr. (1998), 93, 412.
  • Liu C, Rubin DB, Wu YN. Parameter Expansion to Accelerate EM: The PX-EM Algorithm. Biometrika. 1998;85:755–770.
  • Liu JS, Wu Y. Parameter Expansion for Data Augmentation. Journal of the American Statistical Association. 1999;94:1264–1274.
  • Liu XF. unpublished doctoral dissertation. University of Florida, Dept. of Statistics; 2006. Bayesian Methodology for Models With Multivariate (Longitudinal) Outcomes.
  • Liu XF. Parameter Expansion for Sampling a Correlation Matrix: An Efficient GPX-RPMH Algorithm. Journal of Statistical Computation and Simulation. 2008;78:1065–1076.
  • Liu XF, Daniels MJ. A New Algorithm for Simulating a Correlation Matrix Based on Parameter Expansion and Re-Parameterization. Journal of Computational and Graphical Statistics. 2006;16:897–914.
  • Marcus BH, Albrecht AE, King TK, Parisi AF, Pinto BM, Roberts M, Niaura RS, Abrams DB. The Efficacy of Exercise as an Aid for Smoking Cessation in Women. Archives of Internal Medicine. 1999;36:479–492. [PubMed]
  • Marcus BH, King TK, Albrecht AE, Parisi AF, Abrams D. Rationale, Design, and Baseline Data for Commit to Quit: An Exercise Efficacy Trial for Smoking Cessation Among Women. Preventive Medicine. 1997;26:586–597. [PubMed]
  • Marcus BH, Lewis BA, Hogan J, King TK, Albrecht AE, Bock B, Parisi AF, Niaura R, Abrams DB. The Efficacy of Moderate-Intensity Exercise as an Aid for Smoking Cessation in Women: A Randomized Controlled Trial. Nicotine & Tobacco Research. 2005;7:871–880. [PubMed]
  • Marcus BH, Lewis BA, King TK, Albrecht AE, Hogan J, Bock B, Parisi AF, Abrams DB. Rationale, Design, and Baseline Data for Commit to Quit II: An Evaluation of the Efficacy of Moderate-Intensity Physical Activity as an Aid to Smoking Cessation in Women. Preventive Medicine. 2003;36:479–492. [PubMed]
  • Pourahmadi M, Daniels MJ. Dynamic Conditionally Linear Mixed Models for Longitudinal Data. Biometrics. 2002;58:225–231. [PMC free article] [PubMed]
  • Regan MM, Catalano PJ. Likelihood Models for Clustered Binary and Continuous Outcomes: Application to Developmental Toxicology. Biometrics. 1999;55:760–768. [PubMed]
  • Ritter C, Tanner MA. Facilitating the Gibbs Sampler: The Gibbs Stopper and the Griddy-Gibbs Sampler. Journal of the American Statistical Association. 1992;87:861–868.
  • Robert CP. Simulation of Truncated Normal Variables. Statistics and Computing. 1995;5:121–125.
  • Robins JM, Ventura V, van der Vaart A. Asymptotic Distribution of P Values in Composite Null Models. Journal of the American Statistical Association. 2000;95:1143–1156.
  • Roy J, Lin X. Latent Variable Models for Longitudinal Data With Multiple Continuous Outcomes. Biometrics. 2000;56:1047–1054. [PubMed]
  • Sammel MD, Ryan LM, Legler JM. Latent Variable Models for Mixed Discrete and Continuous Outcomes. Journal of the Royal Statistical Society, Ser. B. 1997;59:667–678.
  • Smith M, Kohn R. Parsimonious Covariance Matrix Estimation for Longitudinal Data. Journal of the American Statistical Association. 2002;97:1141–1153.
  • Spiegelhalter DJ, Best NG, Carlin BP, van der Linde A. Bayesian Measures of Model Complexity and Fit” (with discussion) Journal of the Royal Statistical Society, Ser. B. 2002;64:583–639.
  • Tanner MA, Wong WH. The Calculation of Posterior Distribution by Data Augmentation” (with discussion) Journal of the American Statistical Association. 1987;82:528–550.
  • van Dyk DA, Meng XL. The Art of Data Augmentation” (with discussion) Journal of Computational and Graphical Statistics. 2001;10:1–111.
  • Wakefield JC, Bennett JE. The Bayesian Modelling of Covariates for Population Pharmacokinetic Models. Journal of the American Statistical Association. 1996;91:917–927.