Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2858597

Formats

Article sections

- Abstract
- 1 Introduction
- 2 The model, prior and posterior
- 3 Bayesian variable subset selection
- 4 Computation of DIC measures
- 5 Analysis of the BMT data
- 6 Discussion
- References

Authors

Related links

Lifetime Data Anal. Author manuscript; available in PMC 2010 April 22.

Published in final edited form as:

Published online 2008 October 3. doi: 10.1007/s10985-008-9101-5

PMCID: PMC2858597

NIHMSID: NIHMS128678

Joseph G. Ibrahim, Department of Biostatistics, University of North Carolina, Chapel Hill, NC 27599, USA, e-mail:ude.cnu.soib@miharbi;.

The publisher's final edited version of this article is available at Lifetime Data Anal

See other articles in PMC that cite the published article.

In this paper, we develop Bayesian methodology and computational algorithms for variable subset selection in Cox proportional hazards models with missing covariate data. A new joint semi-conjugate prior for the piecewise exponential model is proposed in the presence of missing covariates and its properties are examined. The covariates are assumed to be missing at random (MAR). Under this new prior, a version of the Deviance Information Criterion (DIC) is proposed for Bayesian variable subset selection in the presence of missing covariates. Monte Carlo methods are developed for computing the DICs for all possible subset models in the model space. A Bone Marrow Transplant (BMT) dataset is used to illustrate the proposed methodology.

Bayesian variable selection in survival analysis is still one of the most challenging problems encountered in practice due to issues regarding (i) prior elicitation, (ii) evaluation of a model selection criterion due to the complication of censoring, and (iii) numerical computation of the criterion for all possible models in the model space. In the context of survival analysis, these issues have been discussed in Ibrahim et al. (1999a, 2001a) and the many references therein. There have been numerous papers in the statistical literature on Bayesian variable selection and model comparison, including articles by George and McCulloch (1993, 1997); Laud and Ibrahim (1995); George et al. (1996); Raftery (1996); Smith and Kohn (1996); Raftery et al. (1997); Brown et al. (1998, 2002); Clyde (1999); Chen et al. (1999, 2003, 2008); Dellaportas and Forster (1999); Chipman et al. (1998, 2001, 2003); George (2000); George and Foster (2000); Ibrahim et al. (2000); Ntzoufras et al. (2003) and Clyde and George (2004). However, the literature on Bayesian variable selection in the presence of missing data and in particular, for survival data in the presence of missing covariates, is still quite sparse. Part of the reason for this is that in the presence of missing covariate data, models can become quite complex and closed forms are not even available in the simplest of models. Thus, computing quantities such as Bayes factors, posterior model probabilities, the Aikiake Information Criterion (AIC) (Akaike 1973), the Bayesian Information Criterion (BIC) (Schwarz 1978), and Deviance Information Criterion (DIC) (Spiegelhalter et al. 2002), for example, become serious computational challenges. For example, to compute BIC in the presence of missing covariate data, one would need to maximize the observed data likelihood. There are two challenging issues with this: (i) the observed data likelihood does not have a closed form for most models, even the linear model when the covariates are not normally distributed, and suitable approximation is often not available, and (ii) maximizing the observed data likelihood can be a huge challenge even if it is available in closed form. There are also several technical issues for computing AIC and BIC in the presence of missing covariates. One could argue that these measures are not well defined in the context of missing covariate data since the penalty term is not clearly defined. In particular, if we use the observed data likelihood obtained by averaging over the possible missing values of the covariates according to the missing covariate distribution, it is not clear how to appropriately define the dimensional penalty for AIC and BIC. We elaborate more on this issue in Sect. 5.

This issue becomes even more complex when computing Bayes factors, as one has to integrate over a very large space and the integrals easily become of very high dimension even in the simplest missing data problems. Specifically, it is well known that methods based on Bayes factors or posterior model probabilities, proper prior distributions are needed. It is a major task to specify prior distributions for all models in the model space, especially if the model space is large. For survival models with missing covariates, it becomes even more challenging to specify prior distributions, as in this case, one needs to specify priors not only for the regression coefficients in the survival model, but also the parameters involved in the models for missing covariates. The prior elicitation issue has been discussed in detail by several authors including Laud and Ibrahim (1995); Chen et al. (1999) and Ibrahim and Chen (2000). In addition, it is well known that Bayes factors and posterior model probabilities are generally sensitive to the choices of prior hyperparameters, and thus one cannot simply select vague proper priors to get around the elicitation issue. Even when informative prior distributions are available, computing Bayes factors and posterior model probabilities is difficult and expensive as one needs to compute prior and posterior normalizing constants for each model in the model space. It may be practically infeasible to compute these quantities in the context of variable subset selection for survival models with missing data. Alternatively, criterion based methods can be attractive in the sense that they do not require proper prior distributions in general, and thus have an advantage over posterior model probabilities in this sense. Several recent papers advocating the use of Bayesian criteria for model assessment include Geisser and Eddy (1979); Gelfand et al. (1992); Gelfand and Dey (1994); Ibrahim and Laud (1994); Laud and Ibrahim (1995); Gelman et al. (1996); Dey et al. (1997); Gelfand and Ghosh (1998); Ibrahim et al. (2001b); Spiegelhalter et al. (2002); Chen et al. (2004); Huang et al. (2005); Hanson (2006); Celeux et al. (2006) and Kim et al. (2007).

To overcome some of the methodologic and computational issues mentioned above, we develop two methodologies in this paper: (i) a class of semi-conjugate priors in the presence of MAR covariate data, and (ii) a variation of DIC for survival models with missing covariates. The proposed class of priors overcome the elicitation issues mentioned above as well as the computational challenges. The proposed priors make elicitation easier than other conventional informative priors by basing the elicitation on observable quantities rather than the parameters themselves, along with a scalar quantifying the confidence in that prediction. This is an especially attractive approach in variable selection contexts since in this context the regression coefficients for every model in the model space have a different contextual meaning and interpretation, and thus specifying hyperparameters for all of the models in the model space is a monumental task. This elicitation challenge can be overcome by focusing on constructing a prior based on a prediction for the response variable, as pointed out by Laud and Ibrahim (1995) and Chen and Ibrahim (2003). They are also computationally attractive in that they lead to full conditionals that are log-concave and hence easily sampled via Adaptive Rejection Sampling (ARS) (Gilks and Wild 1992) within Gibbs. Thus, sampling the posterior with these priors is computationally very efficient.

The proposed version of DIC is an extension of a version of DIC discussed in Huang et al. (2005) for generalized linear models with missing covariates. For survival data with censored observations and missing covariates, DIC has a computational advantage over other criterion-based methods, such as AIC or BIC. With the computational methods developed in Sect. 4, the DIC measures can be easily computed for all models in the model space for a moderate number of covariates. In contrast, computation of AIC or BIC becomes quite difficult and challenging for variable subset selection for survival data with censored observations and missing covariates.

The rest of this paper is organized as follows. Section 2 presents a detailed development of the semi-conjugate prior under the piecewise exponential model in the presence of MAR covariates. Section 3 sets up all necessary formulas for the survival models, priors, and posteriors in the context of variable subset selection and presents a novel version of DIC for survival data with missing covariates. Section 4 presents the computational algorithms for computing the DIC measures for all models in the model space. A detailed analysis of the BMT data is given in Sect. 5. We conclude the article with brief remarks in Sect. 6.

Let *y _{i}* denote the minimum of the censoring time

We consider the Cox proportional piecewise exponential hazards model for [*y _{i}*|

$$S({y}_{i}|{\mathit{x}}_{i},\mathit{\beta},\mathbf{\lambda})=\phantom{\rule{thinmathspace}{0ex}}\text{exp}\phantom{\rule{thinmathspace}{0ex}}\{-\text{exp}\phantom{\rule{thinmathspace}{0ex}}({\mathit{x}}_{i}^{\prime}\mathit{\beta}){H}_{0}({y}_{i}|\mathbf{\lambda})\},$$

(2.1)

where *H*_{0}(*t*|**λ**) is the baseline cumulative hazard function. The piecewise exponential model is assumed for the baseline hazard function *h*_{0}(*t*). Specifically, we first partition the time axis into *J* intervals: (*s*_{0}, *s*_{1}], (*s*_{1}, *s*_{2}], …, (*s*_{J−1}, *s _{J}*], where

$${H}_{0}(y|\mathbf{\lambda})={\lambda}_{j}(y-{s}_{j-1})+{\displaystyle \sum _{g=1}^{j-1}{\lambda}_{g}({s}_{g}-{s}_{g-1})}$$

(2.2)

for *s*_{j−1} < *y* ≤ *s _{j}*, where

We write ${\mathit{x}}_{i}^{\prime}=({\mathit{x}}_{1i}^{\prime},{\mathit{x}}_{2i}^{\prime})\prime $, where **x**_{1i} is a *k*_{1} × 1 vector of covariates that are observed for all *n* observations, **x**_{2i} is a *k*_{2} × 1 vector of covariates that have at least one missing value in the *n* observations, and *k*_{1} + *k*_{2} = *k* with *k*_{1} ≥ 0 and *k*_{2} ≥ 1. Furthermore, we let **x**_{2i,mis} denote the vector of covariates that are missing for the *i*th case and let **x**_{2i,obs} be the vector of covariates that are observed for the *i*th case. Let *D* = {(*y _{i}*, ν

$$\begin{array}{cc}L(\mathit{\beta},\mathbf{\lambda}|D)=\hfill & {\displaystyle \prod _{i=1}^{n}{\displaystyle \prod _{j=1}^{J}{\{{\lambda}_{j}\text{exp}({\mathit{x}}_{i}^{\prime}\mathit{\beta})\}}^{{\delta}_{ij}{\nu}_{i}}}}\hfill \\ \hfill & \times \phantom{\rule{thinmathspace}{0ex}}\text{exp}\phantom{\rule{thinmathspace}{0ex}}\left[-{\delta}_{ij}\text{exp}({\mathit{x}}_{i}^{\prime}\mathit{\beta})\phantom{\rule{thinmathspace}{0ex}}\left\{{\lambda}_{j}({y}_{i}-{s}_{j-1})+{\displaystyle \sum _{g=1}^{j-1}{\lambda}_{g}({s}_{g}-{s}_{g-1})}\right\}\right]\phantom{\rule{thinmathspace}{0ex}},\hfill \end{array}$$

(2.3)

where δ_{ij} = 1 if the *i*th subject failed or was right censored in the *j*th interval (*s*_{j−1}, *s _{j}*], and 0 otherwise.

We first specify a prior distribution for (**β, λ**). To this end, we extend the conjugate prior proposed by Chen and Ibrahim (2003) for the generalized linear model (GLM) to the piecewise exponential model in (2.1). Let *X* denote the *n* × *k* matrix with its *i*th row equal to ${\mathit{x}}_{i}^{\prime}$. Given *X*, we propose a semi-conjugate prior as follows:

$$\begin{array}{c}\pi (\mathit{\beta},\mathbf{\lambda}|{\mathit{y}}_{0},X,{a}_{0})\propto ({\displaystyle \prod _{i=1}^{n}{\displaystyle \prod _{j=1}^{J}{\{{\lambda}_{j}\phantom{\rule{thinmathspace}{0ex}}\text{exp}\phantom{\rule{thinmathspace}{0ex}}({\mathit{x}}_{i}^{\prime}\mathit{\beta})\}}^{{a}_{0}{\delta}_{0ij}}}}\hfill \\ \hspace{1em}\times \phantom{\rule{thinmathspace}{0ex}}\text{exp}\phantom{\rule{thinmathspace}{0ex}}\left[-{a}_{0}{\delta}_{0ij}\phantom{\rule{thinmathspace}{0ex}}\text{exp}({\mathit{x}}_{i}^{\prime}\mathit{\beta})\phantom{\rule{thinmathspace}{0ex}}\left\{{\lambda}_{j}({y}_{0i}-{s}_{j-1})+{\displaystyle \sum _{g=1}^{j-1}{\lambda}_{g}({s}_{g}-{s}_{g-1})}\right\}\right]\phantom{\rule{thinmathspace}{0ex}})\phantom{\rule{thinmathspace}{0ex}}{\pi}_{0}\left(\mathbf{\lambda}\right),\hfill \end{array}$$

(2.4)

where *a*_{0} > 0 is a scalar prior parameter, *y*_{0} = (*y*_{01}, …, *y*_{0n})′ is an *n* × 1 vector of prior parameters, δ_{0ij} = 1 if *s*_{j−1} < *y*_{0i} ≤ *s _{j}* and 0 otherwise, and π

The prior (2.4) is called semi-conjugate since, by ignoring π_{0}(**λ**), the prior has an identical form as the complete data likelihood given in (2.3). As discussed in Chen and Ibrahim (2003), *y*_{0i} can be viewed as a prior prediction for the marginal mean of *y _{i}*. Since

$$\pi (\mathit{\beta},\mathbf{\lambda}|{y}_{0},X,{a}_{0})\propto {\displaystyle \prod _{i=1}^{n}{\{{\lambda}_{1}\text{exp}({\mathit{x}}_{i}^{\prime}\mathit{\beta})\}}^{{a}_{0}}\text{exp}\{-{a}_{0}{y}_{0}{\lambda}_{1}\text{exp}({\mathit{x}}_{i}^{\prime}\mathit{\beta})\}{\pi}_{0}(\mathbf{\lambda}).}$$

(2.5)

We further specify π_{0}(**λ**) as follows

$${\pi}_{0}(\mathbf{\lambda})\propto \frac{1}{{\lambda}_{1}}{\displaystyle \prod _{j=2}^{J}{\lambda}_{j}^{{b}_{1}-1}\text{exp}(-{b}_{2}{\lambda}_{j}),}$$

(2.6)

where *b*_{1} > 0 and *b*_{2} > 0. Note that in (2.5), we assume an improper uniform initial prior for **β** and an improper Jeffreys-type initial prior for λ_{1}. Thus, π_{0}(**λ**) introduced in (2.4) and further specified in (2.6) is an improper prior. However, under some mild conditions, the prior (2.5) is proper and (log λ_{1}, **β**) has a prior mode of (−log *y*_{0}, 0, …, 0)′. We formally characterize these properties in the following theorem.

*Let X _{obs} denote a submatrix of X with rows consisting of those completely observed x_{i}’s and* ${\mathit{x}}_{\mathit{\text{mis}}}=({\mathit{x}}_{2i,\mathit{\text{mis}}}^{\prime},i=1,2,\dots ,n)\prime $.

The proof of Theorem 2.1 is given in Appendix A. We note that the conditions of Theorem 2.1 require at least *k* + 1 complete observations with linearly independent covariate vectors including an intercept. From Theorem 2.1, we see that when *y*_{01} = = *y*_{0n} = *y*_{0}, the prior mode of **β** is **0** and with this prior prediction for the *y _{i}*, both

Next, we specify the distribution for the missing covariates. Since we are primarily interested in inferences about **β**, we only need to model *x*_{2i} since **x**_{1i} is observed for all *n* observations. Therefore, we model *x*_{2i} conditioning on the completely observed covariates **x**_{1i} throughout. Using a sequence of one-dimensional conditional distributions proposed by Lipsitz and Ibrahim (1996) and Ibrahim et al. (1999b), we specify the distribution of the *k*_{2}-dimensional covariate vector *x*_{2i} = (*x*_{2i1}, *x*_{2i2}, …, *x*_{2ik2})′ as

$$\begin{array}{cc}f({\mathit{x}}_{2i}|{\mathit{x}}_{1i},\mathit{\alpha})=\hfill & f({x}_{2i1}|{\mathit{x}}_{i1},{\mathit{\alpha}}_{1})f({x}_{2i2}|{x}_{2i1},{\mathit{x}}_{i1},{\mathit{\alpha}}_{2})\dots \hfill \\ \hfill & f({x}_{2i{k}_{2}}|{x}_{2i,{k}_{2}-1},\dots ,{x}_{2i1},{\mathit{x}}_{i1},{\mathit{\alpha}}_{{k}_{2}}),\hfill \end{array}$$

(2.7)

where **α**_{l} is a vector of parameters for the *l*th conditional distribution, the **α**_{l}’s are distinct, and moreover, $\mathit{\alpha}=({\mathit{\alpha}}_{1}^{\prime},{\mathit{\alpha}}_{2}^{\prime},\dots ,{\mathit{\alpha}}_{{k}_{2}}^{\prime})\prime $. To complete the prior specification, we take independent priors for **α**_{1}, …, **α**_{p} so that

$$\pi (\mathit{\alpha})={\displaystyle \prod _{l=1}^{{k}_{2}}\pi ({\mathit{\alpha}}_{l}).}$$

(2.8)

Let ${\mathit{x}}_{\mathit{\text{obs}}}=(({\mathit{x}}_{1i}^{\prime},{\mathit{x}}_{2i,\mathit{\text{obs}}}^{\prime}),i=1,2,\dots ,n)\prime $. Using (2.5)–(2.8), the joint prior for **β, λ, x**_{mis}, and **α** is given by

$$\begin{array}{cc}\pi (\mathit{\beta},\mathbf{\lambda},{\mathit{x}}_{\mathit{\text{mis}}},\mathit{\alpha}|{y}_{0},{a}_{0},{\mathit{x}}_{\mathit{\text{obs}}})\propto \hfill & \left[{\displaystyle \prod _{i=1}^{n}{\{{\lambda}_{1}\text{exp}({\mathit{x}}_{i}^{\prime}\mathit{\beta})\}}^{{a}_{0}}\text{exp}\{-{a}_{0}{y}_{0}{\lambda}_{1}\text{exp}({\mathit{x}}_{i}^{\prime}\mathit{\beta})\}}\right]\hfill \\ \hfill & \hspace{1em}\hspace{1em}\times \phantom{\rule{thinmathspace}{0ex}}\left[{\displaystyle \prod _{i=1}^{n}f({\mathit{x}}_{2i}|{\mathit{x}}_{1i},\mathit{\alpha})}\right]\phantom{\rule{thinmathspace}{0ex}}{\pi}_{0}(\mathbf{\lambda})\pi (\mathit{\alpha}).\hfill \end{array}$$

(2.9)

Let *D _{obs}* = (

$$\pi (\mathit{\beta},\mathbf{\lambda},{\mathit{x}}_{\mathit{\text{mis}}},\mathit{\alpha}|{y}_{0},{a}_{0},{D}_{\mathit{\text{obs}}})\propto L(\mathit{\beta},\mathbf{\lambda}|D)\pi (\mathit{\beta},\mathbf{\lambda},{\mathit{x}}_{\mathit{\text{mis}}},\mathit{\alpha}|{y}_{0},{a}_{0},{\mathit{x}}_{\mathit{\text{obs}}}),$$

(2.10)

where *L*(**β, λ**|*D*) and π (**β, λ, x**

Let denote the model space. We enumerate the models in by *m* = 1, 2, …, , where is the dimension of and model denotes the full model. Also, let **β**^{()} = (β_{1}, β_{2}, …, β_{k})′ denote the regression coefficients for the full model including an intercept, and let ${\mathit{x}}_{i}^{(m)}$ and **β**^{(m)} denote *k _{m}* × 1 vectors of covariates and regression coefficients for model

Under model *m*, let ${D}_{m}=\{({y}_{i},{\nu}_{i},{\mathit{x}}_{1i}^{(m)},{\mathit{x}}_{2i,\mathit{\text{mis}}}^{(m)},{\mathit{x}}_{2i,\mathit{\text{obs}}}^{(m)}),i=1,2,\dots ,n\}$ denote the complete data and then the complete data likelihood function is given by

$$\begin{array}{cc}L({\mathit{\beta}}^{(m)},\mathbf{\lambda}|{D}_{m})=\hfill & {\displaystyle \prod _{i=1}^{n}{\displaystyle \prod _{j=1}^{J}{\left\{{\lambda}_{j}\phantom{\rule{thinmathspace}{0ex}}\text{exp}(({\mathit{x}}_{i}^{\left(m\right)})\prime {\mathit{\beta}}^{(m)})\right\}}^{{\delta}_{ij}{\nu}_{i}}}}\hfill \\ \hfill & \times \phantom{\rule{thinmathspace}{0ex}}\text{exp}\phantom{\rule{thinmathspace}{0ex}}[-{\delta}_{ij}\phantom{\rule{thinmathspace}{0ex}}\text{exp}\phantom{\rule{thinmathspace}{0ex}}\left\{({\mathit{x}}_{i}^{(m)})\prime {\mathit{\beta}}^{(m)}\right\}\phantom{\rule{thinmathspace}{0ex}}\{{\lambda}_{j}({y}_{i}-{s}_{j-1})\hfill \\ \hfill & \hspace{1em}\hspace{1em}+{\displaystyle \sum _{g=1}^{j-1}{\lambda}_{g}({s}_{g}-{s}_{g-1})}\}]\phantom{\rule{thinmathspace}{0ex}},\hfill \end{array}$$

(3.1)

where δ_{ij} is defined in (2.3). Using exactly the same order of the sequence of one-dimensional conditional distributions for the covariates **x**_{2i} in (2.7) by deleting ${\mathit{x}}_{2i}^{(-m)}$, we specify the distribution of the *k*_{2m}-dimensional covariate vector ${\mathit{x}}_{2i}^{(m)}=({x}_{2i1}^{(m)},{x}_{2i2}^{(m)},\dots ,{x}_{2i{k}_{2m}}^{(m)})\prime $ as

$$\begin{array}{cc}f({\mathit{x}}_{2i}^{(m)}|{\mathit{x}}_{1i}^{(m)},{\mathit{\alpha}}^{(m)})=\hfill & f({x}_{2i1}^{(m)}|{\mathit{x}}_{i1}^{(m)},{\mathit{\alpha}}_{1}^{(m)})f({x}_{2i2}^{(m)}|{x}_{2i1}^{(m)},{\mathit{x}}_{i1}^{(m)},{\mathit{\alpha}}_{2}^{(m)})\hfill \\ \hfill & \times \cdots \times f({x}_{2i{k}_{2m}}^{(m)}|{x}_{2i,{k}_{2m}-1}^{(m)},\dots ,{x}_{2i1}^{(m)},{\mathit{x}}_{i1}^{(m)},{\mathit{\alpha}}_{{k}_{2m}}^{(m)}),\hfill \end{array}$$

(3.2)

where ${\mathit{\alpha}}^{(m)}=({\mathit{\alpha}}_{1}^{(m)\prime},{\mathit{\alpha}}_{2}^{(m)\prime},\dots ,{\mathit{\alpha}}_{{k}_{2m}}^{(m)\prime})\prime $. It is important to note that in (3.2), **α**^{(m)} is a subvector of **α** in (2.7). We further write **α** = (**α**^{(m)′}, **α**^{(−m)′})′ where **α**^{(−m)} is **α** with **α**^{(m)} deleted. Similar to (2.8), the prior for **α**^{(m)} is specified as $\pi ({\mathit{\alpha}}^{(m)})={\displaystyle {\prod}_{l=1}^{{k}_{2m}}\pi ({\mathit{\alpha}}_{l}^{(m)}})$.

Let ${\mathit{x}}_{\mathit{\text{obs}}}^{(m)}=(({{\mathit{x}}_{1i}^{(m)}}^{\prime},{{\mathit{x}}_{2i,\mathit{\text{obs}}}^{(m)}}^{\prime}),i=1,2,\dots ,n)\prime \phantom{\rule{thinmathspace}{0ex}}\text{and}\phantom{\rule{thinmathspace}{0ex}}{\mathit{x}}_{\mathit{\text{mis}}}^{(m)}=({{\mathit{x}}_{2i,\mathit{\text{mis}}}^{(m)}}^{\prime},i=1,2,\dots ,n)\prime $. By applying the semi-conjugate prior (2.5) to model *m*, we have the joint prior for **β**^{(m)}, **λ**, ${\mathit{x}}_{\mathit{\text{mis}}}^{(m)}$, and **α**^{(m)} given by

$$\begin{array}{c}\pi \phantom{\rule{thinmathspace}{0ex}}\left({\mathit{\beta}}^{(m)},\mathbf{\lambda},{\mathit{x}}_{\mathit{\text{mis}}}^{(m)},{\mathit{\alpha}}^{(m)}|{y}_{0},{a}_{0},{\mathit{x}}_{\mathit{\text{obs}}}^{(m)}\right)\hfill \\ \propto \phantom{\rule{thinmathspace}{0ex}}\left({\displaystyle \prod _{i=1}^{n}{[{\lambda}_{1}\phantom{\rule{thinmathspace}{0ex}}\text{exp}\{{\mathit{x}}_{i}^{(m)\prime}{\mathit{\beta}}^{(m)}\}]}^{{a}_{0}}\phantom{\rule{thinmathspace}{0ex}}\text{exp}[-{a}_{0}{y}_{0}{\lambda}_{1}\phantom{\rule{thinmathspace}{0ex}}\text{exp}\{{\mathit{x}}_{i}^{(m)\prime}{\mathit{\beta}}^{(m)}\}]}\right)\hfill \\ \times \phantom{\rule{thinmathspace}{0ex}}\left[{\displaystyle \prod _{i=1}^{n}\phantom{\rule{thinmathspace}{0ex}}f\phantom{\rule{thinmathspace}{0ex}}\left({\mathit{x}}_{2i}^{(m)}|{\mathit{x}}_{1i}^{(m)},{\mathit{\alpha}}^{(m)}\right)}\right]\phantom{\rule{thinmathspace}{0ex}}{\pi}_{0}(\mathbf{\lambda})\pi ({\mathit{\alpha}}^{(m)}),\hfill \end{array}$$

(3.3)

where π_{0}(**λ**) and $f\phantom{\rule{thinmathspace}{0ex}}({\mathit{x}}_{2i,\mathit{\text{mis}}}^{(m)},{\mathit{x}}_{2i,\mathit{\text{obs}}}^{(m)}|{\mathit{x}}_{1i}^{(m)},{\mathit{\alpha}}^{(m)})$ are defined by (2.6) and (3.2), respectively. Note that all models in the model space share the same prior for **λ**, that is, the prior for **λ** is the same for all models in the model space. Let ${D}_{m,\mathit{\text{obs}}}=(\mathit{y},\mathit{\nu},{\mathit{x}}_{\mathit{\text{obs}}}^{(m)})$ denote the completely observed data. Under model *m*, the joint posterior distribution is given by

$$\begin{array}{c}\pi ({\mathit{\beta}}^{(m)},\mathbf{\lambda},{\mathit{x}}_{\mathit{\text{mis}}}^{(m)},{\mathit{\alpha}}^{(m)}|{y}_{0},{a}_{0},{D}_{m,\mathit{\text{obs}}})\propto L({\mathit{\beta}}^{(m)},\mathbf{\lambda}|{D}_{m})\hfill \\ \times \phantom{\rule{thinmathspace}{0ex}}\pi ({\mathit{\beta}}^{(m)},\mathbf{\lambda},{\mathit{x}}_{\mathit{\text{mis}}}^{(m)},{\mathit{\alpha}}^{(m)}|{y}_{0},{a}_{0},{\mathit{x}}_{\mathit{\text{obs}}}^{(m)}),\hfill \end{array}$$

(3.4)

where *L*(**β**^{(m)}, **λ**|*D _{m}*) and $\pi ({\mathit{\beta}}^{(m)},\mathbf{\lambda},{\mathit{x}}_{\mathit{\text{mis}}}^{(m)},{\mathit{\alpha}}^{(m)}|{y}_{0},{a}_{0},{\mathit{x}}_{\mathit{\text{obs}}}^{(m)})$ are given by (3.1) and (3.3), respectively.

We carry out Bayesian variable selection via DIC, originally proposed by Spiegelhalter et al. (2002). The use of DIC for missing data models has been discussed in detail in Celeux et al. (2006). Let ${\mathit{\theta}}^{(m)}=({\mathit{\beta}}^{(m)},\mathbf{\lambda},{\mathit{x}}_{\mathit{\text{mis}}}^{(m)})$. DIC is defined as follows:

$${\text{DIC}}_{m}={\text{Dev}}_{m}({\overline{\mathit{\theta}}}^{(m)})+2{p}_{m},$$

(3.5)

where Dev_{m}(**θ**^{(m)}) is a deviance function and ^{(m)} is the posterior mean of **θ**^{(m)}. In (3.5), *p _{m}* is the effective number of model parameters, which is calculated as

$${p}_{m}={\overline{\text{Dev}}}_{m}({\mathit{\theta}}^{(m)})-{\text{Dev}}_{m}({\overline{\mathit{\theta}}}^{(m)}),$$

(3.6)

where

$${\overline{\text{Dev}}}_{m}({\mathit{\theta}}^{(m)})=E[{\text{Dev}}_{m}({\mathit{\theta}}^{(m)})|{D}_{m,\mathit{\text{obs}}}]$$

(3.7)

and the expectation is taken with respect to the posterior distribution given in (3.4). Since we are primarily interested in inferences about the survival model, we define the deviance function, Dev_{m}(**θ**^{(m)}) in (3.5) as follows:

$${\text{Dev}}_{m}({\mathit{\theta}}^{(m)})=-2\text{log}L({\mathit{\beta}}^{(m)},\mathbf{\lambda}|{D}_{m}),$$

where *L*(**β**^{(m)}, **λ**|*D _{m}*) is given by (3.1). Following Huang et al. (2005), we compute Dev

$$\begin{array}{cc}{\text{Dev}}_{m}({\overline{\mathit{\theta}}}^{(m)})=-2{\displaystyle \sum _{i=1}^{n}}{\displaystyle \sum _{j=1}^{J}}\hfill & ({\delta}_{ij}{\nu}_{i}[\text{log}E[{\lambda}_{j}|{D}_{m,\mathit{\text{obs}}}]+E[({\mathit{x}}_{i}^{(m)})\prime {\mathit{\beta}}^{(m)}|{D}_{m,\mathit{\text{obs}}}]\hfill \\ \hfill & -{\delta}_{ij}\phantom{\rule{thinmathspace}{0ex}}\text{exp}\{E[({\mathit{x}}_{i}^{(m)})\prime {\mathit{\beta}}^{(m)}|{D}_{m,\mathit{\text{obs}}}]\}\{E[{\lambda}_{j}|{D}_{m,\mathit{\text{obs}}}]\hfill \\ \hfill & \times \phantom{\rule{thinmathspace}{0ex}}({y}_{i}-{s}_{j-1})+{\displaystyle \sum _{g=1}^{j-1}E[{\lambda}_{g}|{D}_{m,\mathit{\text{obs}}}]({s}_{g}-{s}_{g-1})}\left\}\right),\hfill \end{array}$$

(3.8)

where all expectations are taken with respect to the posterior distribution in (3.4). In (3.8), instead of computing $(E[{\mathit{x}}_{i}^{(m)}|{D}_{m,\mathit{\text{obs}}}])\prime E[\mathit{\beta}|{D}_{m,\mathit{\text{obs}}}],$ we compute $E[({\mathit{x}}_{i}^{(m)})\prime {\mathit{\beta}}^{(m)}|{D}_{m,\mathit{\text{obs}}}]$ in the presence of missing covariates, which yields a more appropriate dimensional penalty term *p _{m}*.

The DIC defined above is a Bayesian measure of predictive model performance, which is decomposed into a measure of fit and a measure of model complexity (*p _{m}*). The smaller the value of DIC, the better the model will predict new observations generated in the same way as the data. As discussed and shown in Chen et al. (2008), the performance of DIC is similar to AIC. Moreover, the DIC defined in (3.5) has a nice computational property for Bayesian variable selection, which will be discussed in detail in the next section.

To carry out Bayesian variable selection, we need to compute DIC_{m} in (3.5) for *m* = 1, 2, …, . Due to the complexity of the survival model in (3.1), analytical evaluation of DIC_{m} does not appear possible. Thus, a Monte Carlo (MC) method is needed to compute all DIC_{m}’s in the model space. To this end, we propose two approaches for computing the DIC_{m}’s. The first approach, called “the direct sampling method”, is based on direct Monte Carlo samples from each model in the model space. The second approach is “the single MC sample method,” which was proposed by Chen et al. (2008). The latter method requires only one Markov chain Monte Carlo (MCMC) sample from the posterior distribution under the full model and computes the Bayesian criterion simultaneously for all possible subset models in the model space. From (3.7) and (3.8), we observe that for DIC_{m} in (3.5), we need to compute the following quantities: (i) *E*[Dev_{m}(**θ**^{(m)})|*D*_{m,obs}]; (ii) $E[({\mathit{x}}_{i}^{(m)})\prime {\mathit{\beta}}^{(m)}|{D}_{m,\mathit{\text{obs}}}]$; and (iii) *E*[λ_{j}|*D _{m,obs}*] for

$${g}_{m}=E[g({\mathit{\theta}}^{(m)})|{D}_{m,\mathit{\text{obs}}}],$$

(4.1)

for various functions *g*, where ${\mathit{\theta}}^{(m)}=({\mathit{\beta}}^{(m)},\mathbf{\lambda},{\mathit{x}}_{\mathit{\text{mis}}}^{(m)})$ and the expectation is taken with respect to the joint posterior distribution in (3.4) under model *m*.

First, we discuss the direct sampling method. Using the Gibbs sampling algorithm given in Appendix B, we generate a Monte Carlo sample $\{{\mathit{\theta}}_{q}^{(m)},q=1,2,\dots ,Q\}$ from the joint posterior distribution in (3.4) under model *m*. Then, a Monte Carlo estimate of *g _{m}* is given by

$${\widehat{g}}_{m}=\frac{1}{Q}{\displaystyle \sum _{q=1}^{Q}g({\mathit{\theta}}_{q}^{(m)})}$$

(4.2)

for all *g*’s listed in (i)–(iii). Then, plugging various *ĝ _{m}*’s in (3.5) gives a Monte Carlo estimate of DIC

Next, we discuss the single MC sample method. Using the notation given in Sect. 3, we write $\mathit{\gamma}=(\mathit{\beta},\mathbf{\lambda},{\mathit{x}}_{\mathit{\text{mis}}},\mathit{\alpha}),{\mathit{\gamma}}^{(m)}=({\mathit{\beta}}^{(m)},\mathbf{\lambda},{\mathit{x}}_{\mathit{\text{mis}}}^{(m)},{\mathit{\alpha}}^{(m)})$, and ${\mathit{\gamma}}^{(-m)}=({\mathit{\beta}}^{(-m)},{\mathit{x}}_{\mathit{\text{mis}}}^{(-m)},{\mathit{\alpha}}^{(-m)})$, where ${\mathit{x}}_{\mathit{\text{mis}}}^{(-m)}$ is * x_{mis}* with ${\mathit{x}}_{\mathit{\text{mis}}}^{(m)}$ deleted and

$${C}_{m}={\displaystyle \int {\pi}^{*}({\mathit{\gamma}}^{(m)}|{y}_{0},{a}_{0},{D}_{m,\mathit{\text{obs}}})d{\mathit{\gamma}}^{(m)},}$$

(4.3)

where

$$\begin{array}{c}{\pi}^{*}({\mathit{\gamma}}^{(m)}|{y}_{0},{a}_{0},{D}_{m,\mathit{\text{obs}}})\hfill \\ =L({\mathit{\beta}}^{(m)},\mathbf{\lambda}|{D}_{m})({\displaystyle \prod _{i=1}^{n}}{[{\lambda}_{1}\text{exp}\{{\mathit{x}}_{i}^{(m)\prime}{\mathit{\beta}}^{(m)}\}]}^{{a}_{0}}\phantom{\rule{thinmathspace}{0ex}}\text{exp}\phantom{\rule{thinmathspace}{0ex}}[-{a}_{0}{y}_{0}{\lambda}_{1}\phantom{\rule{thinmathspace}{0ex}}\text{exp}\{{\mathit{x}}_{i}^{(m)\prime}{\mathit{\beta}}^{(m)}\}])\hfill \\ \hspace{1em}\times \phantom{\rule{thinmathspace}{0ex}}\left[{\displaystyle \prod _{i=1}^{n}f({\mathit{x}}_{2i}^{(m)}|{\mathit{x}}_{1i}^{(m)},{\mathit{\alpha}}^{(m)})}\right]\phantom{\rule{thinmathspace}{0ex}}{\pi}_{0}^{*}(\mathbf{\lambda})\pi ({\mathit{\alpha}}^{(m)}),\hfill \end{array}$$

(4.4)

${\pi}_{0}^{*}(\mathbf{\lambda})=\frac{1}{{\lambda}_{1}}{\displaystyle {\prod}_{j=2}^{J}}{\lambda}_{j}^{{b}_{1}-1}\phantom{\rule{thinmathspace}{0ex}}\text{exp}({-b}_{2}{\lambda}_{j})$, and *L*(**β**^{(m)}, **λ**|*D _{m}*), π

$${g}_{m}=E[g({\mathit{\theta}}^{(m)})|{D}_{m,\mathit{\text{obs}}}]={\displaystyle \int g({\mathit{\theta}}^{(m)})\frac{{\pi}^{*}({\mathit{\gamma}}^{(m)}|{y}_{0},{a}_{0},{D}_{m,\mathit{\text{obs}}})}{{C}_{m}}d{\mathit{\beta}}^{(m)},}$$

(4.5)

where *C _{m}* is defined in (4.3).

For any given function *g* such that *E*[|*g*(**θ**^{(m)})||*D _{m,obs}*] < ∞, we have the following identity

$${g}_{m}=\frac{{C}_{\mathcal{K}}}{{C}_{m}}E\phantom{\rule{thinmathspace}{0ex}}[g({\mathit{\theta}}^{(m)})w({\mathit{\gamma}}^{(-m)}|{\mathit{\gamma}}^{(m)})\frac{{\pi}^{*}({\mathit{\gamma}}^{(m)}|{y}_{0},{a}_{0},{D}_{m,\mathit{\text{obs}}})}{{\pi}^{*}(\mathit{\gamma}|{y}_{0},{a}_{0},{D}_{\mathit{\text{obs}}})}\left|{D}_{\mathit{\text{obs}}}\right]\phantom{\rule{thinmathspace}{0ex}},$$

(4.6)

where *C*_{} is the marginal likelihood given in (4.3) under the fill model, π*(**γ**|*y*_{0}, *a*_{0}, *D _{obs}*) is given in (4.4) corresponding to the full model, which is essentially the kernel of the joint posterior distribution in (2.10), and the expectation is taken with respect to the joint posterior distribution in (2.10) under the full model. In (4.6), w(

Observe that as a special case of (4.1), we have *g _{m}* = 1 when

$$\frac{{C}_{m}}{{C}_{\mathcal{K}}}=E\phantom{\rule{thinmathspace}{0ex}}[w({\mathit{\gamma}}^{(-m)}|{\mathit{\gamma}}^{(m)})\frac{{\pi}^{*}({\mathit{\gamma}}^{(m)}|{y}_{0},{a}_{0},{D}_{m,\mathit{\text{obs}}})}{{\pi}^{*}(\mathit{\gamma}|{y}_{0},{a}_{0},{D}_{\mathit{\text{obs}}})}\left|{D}_{\mathit{\text{obs}}}\right]\phantom{\rule{thinmathspace}{0ex}}.$$

(4.7)

Using (4.6) and (4.7), we have

$${g}_{m}=\frac{E\phantom{\rule{thinmathspace}{0ex}}[g({\mathit{\theta}}^{(m)})w({\mathit{\gamma}}^{(-m)}|{\mathit{\gamma}}^{(m)})\frac{{\pi}^{*}({\mathit{\gamma}}^{(m)}|{y}_{0},{a}_{0},{D}_{m,\mathit{\text{obs}}})}{{\pi}^{*}(\mathit{\gamma}|{y}_{0},{a}_{0},{D}_{\mathit{\text{obs}}})}\left|{D}_{\mathit{\text{obs}}}\right]}{E\phantom{\rule{thinmathspace}{0ex}}[w({\mathit{\gamma}}^{(-m)}|{\mathit{\gamma}}^{(m)})\frac{{\pi}^{*}({\mathit{\gamma}}^{(m)}|{y}_{0},{a}_{0},{D}_{m,\mathit{\text{obs}}})}{{\pi}^{*}(\mathit{\gamma}|{y}_{0},{a}_{0},{D}_{\mathit{\text{obs}}})}\left|{D}_{\mathit{\text{obs}}}\right]}\phantom{\rule{thinmathspace}{0ex}}.$$

(4.8)

We note that as the dimension of **λ** does not change across all models, π*(**λ**) cancels out in the ratio $\frac{{\pi}^{*}({\mathit{\gamma}}^{(m)}|{y}_{0},{a}_{0},{D}_{m,\mathit{\text{obs}}})}{{\pi}^{*}(\mathit{\gamma}|{y}_{0},{a}_{0},{D}_{\mathit{\text{obs}}})}$.

Let {**γ**_{q} = (**β**_{q}, **λ**_{q}, * x_{mis,q}*,

$${\widehat{g}}_{m}=\frac{{\sum}_{q=1}^{Q}g({\mathit{\theta}}_{q}^{(m)})w({\mathit{\gamma}}_{q}^{(-m)}|{\mathit{\gamma}}_{q}^{(m)})\frac{{\pi}^{*}({\mathit{\gamma}}_{q}^{(m)}|{y}_{0},{a}_{0},{D}_{m,\mathit{\text{obs}}})}{{\pi}^{*}({\mathit{\gamma}}_{q}|{y}_{0},{a}_{0},{D}_{\mathit{\text{obs}}})}}{{\sum}_{q=1}^{Q}w({\mathit{\gamma}}_{q}^{(-m)}|{\mathit{\gamma}}_{q}^{(m)})\frac{{\pi}^{*}({\mathit{\gamma}}_{q}^{(m)}|{y}_{0},{a}_{0},{D}_{m,\mathit{\text{obs}}})}{{\pi}^{*}({\mathit{\gamma}}_{q}|{y}_{0},{a}_{0},{D}_{\mathit{\text{obs}}})}}.$$

(4.9)

Under certain regularity conditions, such as ergodicity, we have

$$\underset{Q\to \infty}{\text{lim}}{\widehat{g}}_{m}={g}_{m},$$

implying that *ĝ _{m}* is consistent.

As shown in Chen et al. (2008) the optimal choice of *w*(**γ**^{(−m)} | **γ**^{(m)}) is the conditional posterior distribution of **γ**^{(−m)} given **γ**^{(m)} under the full model in the sense that *ĝ _{m}* achieves the minimum asymptotic variance. However, the optimal choice of

$$w({\mathit{\gamma}}^{(-m)}|{\mathit{\gamma}}^{(m)})=w({\mathit{\beta}}^{(-m)}|{\mathit{\beta}}^{(m)},\mathbf{\lambda},{\mathit{x}}_{\mathit{\text{mis}}})w({\mathit{\alpha}}^{(-m)}|{\mathit{\alpha}}^{(m)},{\mathit{x}}_{\mathit{\text{mis}}})w({\mathit{x}}_{\mathit{\text{mis}}}^{(-m)}|{\mathit{x}}_{\mathit{\text{mis}}}^{(m)},{\mathit{\alpha}}^{(m)}).$$

(4.10)

Note that in (4.10), when model *m* includes all missing covariates **x**_{2i}, we do not need to compute $w({\mathit{x}}_{\mathit{\text{mis}}}^{(-m)}|{\mathit{x}}_{\mathit{\text{mis}}}^{(m)},{\mathit{\alpha}}^{(m)})$ as in this case, ${\mathit{x}}_{\mathit{\text{mis}}}^{(-m)}$ is a null vector in the sense that it has zero dimension. In (4.10), a good *w*(**β**^{(−m)}|**β**^{(m)}, **λ**, * x_{mis}*), which is close to the optimal choice, can be constructed based on the asymptotic approximation to the joint posterior posterior. Let

$${\widehat{\mathit{\beta}}}^{(-m)}({\mathit{\beta}}^{(m)},\mathbf{\lambda},{\mathit{x}}_{\mathit{\text{mis}}})=\underset{{\mathit{\beta}}^{(-m)}}{\text{arg max}}\phantom{\rule{thinmathspace}{0ex}}\text{log}\phantom{\rule{thinmathspace}{0ex}}{\pi}^{*}(\mathit{\beta}|\mathbf{\lambda},{\mathit{x}}_{\mathit{\text{mis}}},{y}_{0},{a}_{0},{D}_{\mathit{\text{obs}}}),$$

(4.11)

where

$$\begin{array}{c}\text{log}\phantom{\rule{thinmathspace}{0ex}}{\pi}^{*}(\mathit{\beta}|\mathbf{\lambda},{\mathit{x}}_{\mathit{\text{mis}}},{y}_{0},{a}_{0},{D}_{\mathit{\text{obs}}})\hfill \\ ={\displaystyle \sum}_{i=1}^{n}{\displaystyle \sum}_{j=1}^{J}\phantom{\rule{thinmathspace}{0ex}}[{\delta}_{ij}{\nu}_{i}{\mathit{x}}_{i}^{\prime}\mathit{\beta}-{\delta}_{ij}\text{exp}({\mathit{x}}_{i}^{\prime}\mathit{\beta})\phantom{\rule{thinmathspace}{0ex}}\{{\lambda}_{j}({y}_{i}-{s}_{j-1})+{\displaystyle \sum}_{g=1}^{j-1}{\lambda}_{g}({s}_{g}-{s}_{g-1})\left\}\right]\hfill \\ \hspace{1em}+{\displaystyle \sum}_{i=1}^{n}[{a}_{0}{\mathit{x}}_{i}^{\prime}\mathit{\beta}-{a}_{0}{y}_{0}{\lambda}_{1}\text{exp}({\mathit{x}}_{i}^{\prime}\mathit{\beta})]\hfill \end{array}$$

(4.12)

and then compute

$$\begin{array}{c}{\widehat{\mathrm{\Sigma}}}^{(-m)}({\mathit{\beta}}^{(m)},\mathbf{\lambda},{\mathit{x}}_{\mathit{mis}})\hfill \\ ={[-\frac{{\partial}^{2}\text{log}\phantom{\rule{thinmathspace}{0ex}}{\pi}^{*}(\mathit{\beta}|\mathbf{\lambda},{\mathit{x}}_{\mathit{mis}},{y}_{0},{a}_{0},{D}_{\mathit{obs}})}{\partial {\mathit{\beta}}^{(-m)}\partial {\mathit{\beta}}^{(-m)\prime}}{|}_{{\mathit{\beta}}^{(-m)}={\widehat{\mathit{\beta}}}^{(-m)}({\mathit{\beta}}^{(m)},\mathbf{\lambda},{\mathit{x}}_{\mathit{mis}})}]}^{-1}.\hfill \end{array}$$

Thus, a good *w*(**β**^{(−m)} | **β**^{(m)}, **λ**, * x_{mis}*) can be constructed as follows:

$$\begin{array}{c}w({\mathit{\beta}}^{(-m)}|{\mathit{\beta}}^{(m)},\mathbf{\lambda},{\mathit{x}}_{\mathit{\text{mis}}})\hfill \\ \text{}={(2\pi )}^{-\frac{k-{k}_{m}}{2}}{|{\widehat{\mathrm{\Sigma}}}^{(-m)}({\mathit{\beta}}^{(m)},\mathbf{\lambda},{\mathit{x}}_{\mathit{\text{mis}}})|}^{-\frac{1}{2}}\phantom{\rule{thinmathspace}{0ex}}\text{exp}\phantom{\rule{thinmathspace}{0ex}}\{-\frac{1}{2}({\mathit{\beta}}^{(-m)}-{\widehat{\mathit{\beta}}}^{(-m)}({\mathit{\beta}}^{(m)},\mathbf{\lambda},{\mathit{x}}_{\mathit{\text{mis}}}))\prime \hfill \\ \hspace{1em}\times \phantom{\rule{thinmathspace}{0ex}}{[{\widehat{\mathrm{\Sigma}}}^{(-m)}({\mathit{\beta}}^{(m)},\mathbf{\lambda},{\mathit{x}}_{\mathit{\text{mis}}})]}^{-1}({\mathit{\beta}}^{(-m)}-{\widehat{\mathit{\beta}}}^{(-m)}({\mathit{\beta}}^{(m)},\mathbf{\lambda},{\mathit{x}}_{\mathit{\text{mis}}}))\},\hfill \end{array}$$

(4.13)

which approximates the joint conditional posterior π(**β**^{(−m)} | **β**^{(m)}, **λ**, * x_{mis}*,

$$w({\mathit{x}}_{\mathit{\text{mis}}}^{(-m)}|{\mathit{x}}_{\mathit{\text{mis}}}^{(m)},{\mathit{\alpha}}^{(m)})=\frac{1}{Q}{\displaystyle \sum _{q=1}^{Q}w({\mathit{x}}_{\mathit{\text{mis}}}^{(-m)}|{\mathit{x}}_{\mathit{\text{mis}}}^{(m)},{\mathit{\alpha}}^{(m)},{\mathit{\alpha}}_{q}^{(-m)}),}$$

(4.14)

where

$$w({\mathit{x}}_{\mathit{\text{mis}}}^{(-m)}|{\mathit{x}}_{\mathit{\text{mis}}}^{(m)},{\mathit{\alpha}}^{(m)}{\mathit{\alpha}}_{q}^{(-m)})\propto {\displaystyle \prod _{i=1}^{n}f({\mathit{x}}_{2i}|{\mathit{x}}_{1i},{\mathit{\alpha}}^{(m)},{\mathit{\alpha}}_{q}^{(-m)}),}$$

and $f({\mathit{x}}_{2i}|{\mathit{x}}_{1i},{\mathit{\alpha}}^{(m)},{\mathit{\alpha}}_{q}^{(-m)})$ is given by (2.7) under the full model.

The BMT data set consists of *n* = 2397 cases who received HLA-identical sibling transplant from 1995 to 2004 for AML or ALL in CR1 (pre-transplant status = 1st complete remission) with graft source of BM or PB/PB+BM. Infants were excluded (age < 2 year old). The outcome variable, *y _{i}* in years, was the time from transplant to death or end of follow up, and ν

Let *x*_{1} = disease, *x*_{2} = age, *x*_{3} = yeartx, *x*_{4} = karnofprg, *x*_{5} = gsource, **x**_{6} = (sexmatch1, sexmatch2, sexmatch3)′, **x**_{7} = (regimprg1, regimprg2, regimprg3)′ **x**_{8} = (prevgvh11, prevgvh12, prevgvh13, prevgvh14)′, *x*_{9}=(sexmatch1, sexmatch2, sexmatch3)′ and *x*_{10} = log(wbcdx). For these 10 covariates, *x*_{1}, *x*_{2}, …, **x**_{8} were completely observed for all cases and **x**_{9} and *x*_{10} had missing information. There were 488 (20.36%) individuals with cytogenetics (**x**_{9}) missing and 230 (9.6%) individuals with WBC missing, and 96 individuals with both cytogenetics and WBC missing. Overall, there were 623 (25.99%) individuals with at least one covariate missing. We assume that the missing covariates are MAR. In all computations, we standardized all completely observed covariates.

For the BMT data, we fit the piecewise exponential model given by (2.1) and (2.2) for the outcome variable *y _{i}*, where

$$\begin{array}{ccc}P({\mathit{x}}_{9}\hfill & =\hfill & (0,0,0)\prime |{x}_{1},{x}_{2},\dots ,{\mathit{x}}_{8},{\mathit{\alpha}}_{9})\hfill \\ \hfill & =\hfill & F({\alpha}_{9,10}+{\alpha}_{91}{x}_{1}+\cdots +{\alpha}_{95}{x}_{5}+{\mathit{\alpha}}_{96}^{\prime}{\mathit{x}}_{6}+{\mathit{\alpha}}_{97}^{\prime}{\mathit{x}}_{7}+{\mathit{\alpha}}_{98}^{\prime}{\mathit{x}}_{8}),\hfill \\ P({\mathit{x}}_{9}\hfill & =\hfill & \left(1,0,0\right)\prime |{x}_{1},{x}_{2},\dots ,{\mathit{x}}_{8},{\mathit{\alpha}}_{9})\hfill \\ \hfill & =\hfill & F({\alpha}_{9,20}+{\alpha}_{91}{x}_{1}+\cdots +{\alpha}_{95}{x}_{5}+{\mathit{\alpha}}_{96}^{\prime}{\mathit{x}}_{6}+{\mathit{\alpha}}_{97}^{\prime}{\mathit{x}}_{7}+{\mathit{\alpha}}_{98}^{\prime}{\mathit{x}}_{8})\hfill \\ \hfill & \hfill & -F({\alpha}_{9,10}+{\alpha}_{91}{x}_{1}+\cdots +{\alpha}_{95}{x}_{5}+{\mathit{\alpha}}_{96}^{\prime}{\mathit{x}}_{6}+{\mathit{\alpha}}_{97}^{\prime}{\mathit{x}}_{7}+{\mathit{\alpha}}_{98}^{\prime}{\mathit{x}}_{8}),\hfill \\ P({\mathit{x}}_{9}\hfill & =\hfill & (0,1,0)\prime |{x}_{1},{x}_{2},\dots ,{\mathit{x}}_{8},{\mathit{\alpha}}_{9})\hfill \\ \hfill & =\hfill & F({\alpha}_{9,30}+{\alpha}_{91}{x}_{1}+\cdots +{\alpha}_{95}{x}_{5}+{\mathit{\alpha}}_{96}^{\prime}{\mathit{x}}_{6}+{\mathit{\alpha}}_{97}^{\prime}{\mathit{x}}_{7}+{\mathit{\alpha}}_{98}^{\prime}{\mathit{x}}_{8})\hfill \\ \hfill & \hfill & -F({\alpha}_{9,20}+{\alpha}_{91}{x}_{1}+\cdots +{\alpha}_{95}{x}_{5}+{\mathit{\alpha}}_{96}^{\prime}{\mathit{x}}_{6}+{\mathit{\alpha}}_{97}^{\prime}{\mathit{x}}_{7}+{\mathit{\alpha}}_{98}^{\prime}{\mathit{x}}_{8}),\hfill \end{array}$$

and $P({\mathit{x}}_{9}=(0,0,1)\prime |{x}_{1},{x}_{2},\dots ,{\mathit{x}}_{8},{\mathit{\alpha}}_{9})=1-F({\alpha}_{9,30}+{\alpha}_{91}{x}_{1}+\cdots +{\alpha}_{95}{x}_{5}+{\mathit{\alpha}}_{96}^{\prime}{\mathit{x}}_{6}+{\mathit{\alpha}}_{97}^{\prime}{\mathit{x}}_{7}+{\mathit{\alpha}}_{98}^{\prime}{\mathit{x}}_{8})$, where *F*(*u*) = exp(*u*)/{1+exp(*u*)}, α_{9,10} ≤ α_{9,20} ≤ α_{9,30}, ${\mathit{\alpha}}_{96}^{\prime}=({\alpha}_{96,1},{\alpha}_{96,2},{\alpha}_{96,3})$, ${\mathit{\alpha}}_{97}^{\prime}=({\alpha}_{97,1},{\alpha}_{97,2},{\alpha}_{97,3})$, ${\mathit{\alpha}}_{98}^{\prime}=({\alpha}_{98,1},{\alpha}_{98,2},{\alpha}_{98,3},{\alpha}_{98,4})$ and ${\mathit{\alpha}}_{9}=({\alpha}_{9,10},{\alpha}_{9,20},{\alpha}_{9,30},{\alpha}_{91},\dots ,{\alpha}_{5},{\mathit{\alpha}}_{96}^{\prime},{\mathit{\alpha}}_{97}^{\prime},{\mathit{\alpha}}_{98}^{\prime})\prime $. We note that α_{9,10}, α_{9,20}, and α_{9,30} are three intercepts in the proportional odds logistic regression model. Furthermore, *f* (*x*_{10}|*x*_{1}, *x*_{2}, …, **x**_{8}, **x**_{9}, **α**_{10}) is taken to be the density of a $N({\alpha}_{10,0}+{\alpha}_{10,1}{x}_{1}+\cdots +{\alpha}_{10,5}{x}_{5}+{\mathit{\alpha}}_{10,6}^{\prime}{\mathit{x}}_{6}+{\mathit{\alpha}}_{10,7}^{\prime}{\mathit{x}}_{7}+{\mathit{\alpha}}_{10,8}^{\prime}{\mathit{x}}_{8}+{\mathit{\alpha}}_{10,9}^{\prime}{\mathit{x}}_{9},{\alpha}_{10,10})$ distribution, where α_{10,10} > 0 denotes the variance. The prior for (**β**, λ) is given by (2.5) and (2.6). In (2.5), we consider several values for *a*_{0} such as *a*_{0} = 0.1, 0.01, 0.001, and 0.0001 and in (2.6), we take *b*_{1} = *b*_{2} = 0.001. For the parameters in the models for the missing covariates, an inverse gamma prior with scale and shape parameters equal to 0.001 is specified for α_{10,10}, and independent normal priors, *N*(0, 1000), are specified for all other parameters. We wish to compare the following 2^{10} = 1024 models: no covariates (null model), (*x*_{1}), …, (*x*_{10}), (*x*_{1}, *x*_{2}), …, (*x*_{1}, *x*_{2}, *x*_{3}, *x*_{4}, *x*_{5}, **x**_{6}, **x**_{7}, **x**_{8}, **x**_{9}, *x*_{10}) (full model). In all computations, the Gibbs sampling algorithm given in Appendix B was used to sample from the posterior distributions and 10,000 Gibbs samples after a burn-in of 1,000 iterations were used to compute all DIC measures and other posterior estimates. The convergence of the Gibbs sampling algorithm was checked using several diagnostic procedures as recommended by Cowles and Carlin (1996).

We first carry out the complete case (CC) analysis of the BMT data. There were *n** = 1,774 subjects with all ten covariates completely observed. In the CC analysis, we first perform subset variable selection using the AIC and BIC criteria, since with no missing data, these two criteria can be easily computed. Let *L _{cc}*(

$${\text{AIC}}_{m}=-2\phantom{\rule{thinmathspace}{0ex}}\text{log}{L}_{cc}({\widehat{\mathit{\beta}}}^{(m)},\widehat{\mathbf{\lambda}}|{D}_{cc,m})+2{p}_{m},$$

(5.1)

where ^{(m)} and are the maximum likelihood estimates of **β**^{(m)} and **λ** and *p _{m}* =

$${\text{BIC}}_{m}=-2{L}_{cc}({\widehat{\mathit{\beta}}}^{(m)},\widehat{\mathbf{\lambda}}|{D}_{cc,m})+[\text{log}({n}^{*})]{p}_{m}.$$

(5.2)

Table 1 shows the best three AIC or BIC models for *J* = 10, 15, and 20. From Table 1, we see that the best three AIC models are (*x*_{1}, *x*_{2}, *x*_{4}, **x**_{9}), (*x*_{1}, *x*_{2}, *x*_{4}, *x*_{5}, **x**_{9}), and (*x*_{1}, *x*_{2}, *x*_{3}, *x*_{4}, *x*_{5}, **x**_{9}), and the best three BIC models are (*x*_{1}, *x*_{2}, **x**_{9}), (*x*_{1}, *x*_{2}, *x*_{4}, *x*_{9}), and (*x*_{1}, *x*_{2}, *x*_{5}, **x**_{9}). In Table 1, under the model (*x*_{1}, *x*_{2}, *x*_{3}, *x*_{4}, *x*_{5}, **x**_{9}), the values of *p _{m}* =

For the CC case, we used the direct sampling method to compute the DIC values for all 1024 models under various choices of *J* and *a*_{0}. The results based on the best three DIC models under various choices of *J* and *a*_{0} are shown in Table 2. From Table 2, we see that when *a*_{0} is small, for example, *a*_{0} = 0.001 or *a*_{0} = 0.0001, the DIC values are very close to the corresponding values of AIC and the values of *p _{m}* in DIC are also very close to those in AIC. We also observe that there is a convex pattern in the DICs as functions of

For the best two DIC models with *J* = 15 and *a*_{0} = 0.001, we also computed the posterior means (Estimates), the posterior standard deviations (SD’s), and 95% highest posterior density (HPD) intervals of the model parameters. The results are shown in Table 3. Under model (*x*_{1}, *x*_{2}, *x*_{4}, **x**_{9}), all 95% HPD intervals do not contain 0, indicating the importance of all these covariates. The results given in Table 3 indicate that an ALL patient has a higher risk of death compared to an AML patient, an older patient has a higher risk of death, a higher Karnofsky score at pre-transplant leads to a lower risk of death, and a patient with poor cytogenetics is likely to have a high risk of death. Under model (*x*_{1}, *x*_{2}, *x*_{4}, *x*_{5}, **x**_{9}), all covariates except for gsource have 95% HPD intervals that do not contain 0.

Posterior estimates of **β** under best two DIC models with *J* = 15 and *a*_{0} = 0.001 for the completely observed BMT data

Next, we carry out an all case (AC) analysis of the BMT data, that is, an analysis incorporating all of the cases. In the AC case, due to the additional complication of modeling the missing covariates, AIC and BIC are computationally infeasible, as discussed earlier and in fact, one could even argue that these measures are not well defined here since the penalty term is not clearly defined. In particular, if we use the marginal likelihood *L*(^{(m)},|*D _{m}*) and then average over all of the possible missing values of the covariates according to the missing covariate distribution, it is not clear how to appropriately define the dimensional penalty

For the best two DIC models with *J* = 15 and *a*_{0} = 0.001, we also computed the posterior estimates of the model parameters, and the results are shown in Table 5. Under both the best two DIC models, all covariates except for gsource in the survival model for the time from transplant to death have 95% HPD intervals that do not contain 0. As **x**_{9} is the only missing covariate in both models, using (3.2), we only need to model **x**_{9} via the proportional odds logistic regression model conditional on the other covariates, namely, disease, age, karnofprg, gsource, and prevgvh1 for model (*x*_{1}, *x*_{2}, *x*_{4}, *x*_{5}, **x**_{8}, **x**_{9}) and disease, age, karnofprg, and gsource for model (*x*_{1}, *x*_{2}, *x*_{4}, *x*_{5}, *x*_{9}). The corresponding posterior estimates for these two missing covariate models are also shown in Table 5. Under these two models for the missing covariate cytoabnew (**x**_{9}), we see that all covariates except for karnofprg have 95% HPD intervals that do not contain 0. Under the second best model, Table 3 compares the posterior estimates from the AC analysis to those of the CC analysis. In Table 3, we see that the AC analysis leads to smaller posterior standard deviations and shorter HPD intervals for all parameters in the survival model. In particular, gsource is nearly “significant” in the response model and “significant” in the covariate model in the AC analysis, where significance means that the 95% HPD interval does not contain 0.

We have proposed a joint semi-conjugate prior for the regression coefficients **β** and piecewise hazard parameters **λ** and examined their theoretical properties in the piecewise exponential model for right censored survival data. The proposed prior is quite attractive in the context of variable subset selection for survival data with missing covariates. It is proper and the functional form of the prior is immediately determined for all models once the functional form of the prior is written for the full model. In addition, the prior is completely specified by only one hyper-parameter, namely, *a*_{0}. This indeed makes the elicitation of priors for all models in the model space much easier. Otherwise, prior elicitation would be an enormous task. In addition, we have empirically shown that the DIC measure can be used to guide the choice of *a*_{0} to achieve the best posterior predictive performance. In Sect. 5, for the BMT data, we see that the best model for the AC is different than the one based on a CC analysis. This empirical result demonstrates that one cannot do variable selection just based on the completely observed cases. In fact, it is important to use all cases in performing variable selection.

Our computational methods in this paper are intended for situations where the number of models in the model space can be enumerated, so with this in mind, our proposed procedure works best when the number of covariates is 9–15. We have considered two Monte Carlo methods for computing the DIC measures. The direct sampling method is easy to implement. However, care needs to be taken in monitoring convergence of the Gibbs sampling algorithm for each model in the model space. On the other hand, the single MC sample method requires only one Gibbs sample from the posterior distribution under the full model. Thus, one needs to monitor convergence of the Gibbs sampling algorithm only once. However, in this case, one needs to construct a “good” weight function $w({\mathit{\gamma}}_{q}^{(-m)}|{\mathit{\gamma}}_{q}^{(m)})$ to obtain an efficient single MC sample method. The choice of $w({\mathit{\gamma}}_{q}^{(-m)}|{\mathit{\gamma}}_{q}^{(m)})$ proposed in Sect. 4 works well. However, it requires finding the conditional posterior modes, which may be computationally expensive. Finding a less efficient but less computationally expensive weight function is an important future project, which is currently under investigation. We note that both Monte Carlo methods can be easily implemented using multiple computers. Thus, a parallel computing system can greatly speed up the computation of the DIC measures for variable selection. With a Linux cluster, the proposed computational procedure can work well when the number of covariates is up to 20.

Another important criterion used in model assessment is Bayesian Model Averaging (BMA). Since we have focused this paper on variable selection and selecting a set of top models, we have not addressed the issue of BMA at all, as this is an entirely different topic with different inferential goals and different computational strategies. The performance of the proposed semi-conjugate priors in the presence of MAR covariates and the effects of covariates such as sexmatch within the BMA context will be explored in future work.

The authors wish to thank Dr. Mei-Jie Zhang for providing the BMT data. The authors also wish to thank the Editor-in-Chief, the Editor, and a referee for their helpful comments and suggestions, which have improved the paper. This research was partially supported by NIH grants #GM 70335 and #CA 74015.

Observe that the marginal prior of (λ_{1}, **β**) is of the form

$$\pi ({\lambda}_{1},\mathit{\beta}|{y}_{0},X,{a}_{0})\propto {\displaystyle \prod _{i=1}^{n}{\{{\lambda}_{1}\text{exp}({\mathit{x}}_{i}^{\prime}\mathit{\beta})\}}^{{a}_{0}}\text{exp}}\{-{a}_{0}{y}_{0}{\lambda}_{1}\text{exp}({\mathit{x}}_{i}^{\prime}\mathit{\beta})\}\frac{1}{{\lambda}_{1}}.$$

Let β_{0} = log λ_{1}. We have

$$\pi ({\beta}_{0},\mathit{\beta}|{y}_{0},X,{a}_{0})\propto {\displaystyle \prod _{i=1}^{n}{\{\text{exp}({\beta}_{0}+{\mathit{x}}_{i}^{\prime}\mathit{\beta})\}}^{{a}_{0}}\text{exp}\{-{a}_{0}{y}_{0}\text{exp}({\beta}_{0}+{\mathit{x}}_{i}^{\prime}\mathit{\beta})\}.}$$

(A.1)

Since ${X}_{\mathit{\text{obs}}}^{*}$ is of full rank, then *X** = (**1**, *X*) is of full rank. It is easy to show that π(β_{0}, **β**|*y*_{0}, *X*, *a*_{0}) is log-concave in (β_{0}, **β**′)′. This implies that π(β_{0}, **β**|*y*_{0}, *X*, *a*_{0}) has a unique mode. Set

$$\frac{\partial}{\partial ({\beta}_{0},\mathit{\beta}\prime )\prime}\text{log}\phantom{\rule{thinmathspace}{0ex}}\pi ({\beta}_{0},\mathit{\beta}|{y}_{0},X,{a}_{0})={a}_{0}{\displaystyle \sum _{i=1}^{n}[1-{y}_{0}\phantom{\rule{thinmathspace}{0ex}}\text{exp}({\beta}_{0}+{\mathit{x}}_{i}^{\prime}\mathit{\beta})](1,{\mathit{x}}_{i}^{\prime})\prime =0.}$$

(A.2)

Thus, (−log *y*_{0}, 0, …, 0)′ is the unique solution of (A.2). This implies that (−log *y*_{0}, 0, …, 0)′ is the unique prior mode of (log λ_{1}, **β**).

For (ii), it suffices to prove that π(β_{0}, **β**|*y*_{0}, *X*, *a*_{0}) in (A.1) is proper since π_{0}(λ_{2}, …, λ* _{J}*) is a proper prior. We write

$$\pi *({\beta}_{0},\mathit{\beta}|{y}_{0},X,{a}_{0})={\displaystyle \prod _{i=1}^{n}{\{\text{exp}({\beta}_{0}+{\mathit{x}}_{i}^{\prime}\mathit{\beta})\}}^{{a}_{0}}\text{exp}\{-{a}_{0}{y}_{0}\text{exp}({\beta}_{0}+{\mathit{x}}_{i}^{\prime}\mathit{\beta})\}.}$$

It is easy to observe that

$${\{\text{exp}({\beta}_{0}+{\mathit{x}}_{i}^{\prime}\mathit{\beta})\}}^{{a}_{0}}\text{exp}\{-{a}_{0}{y}_{0}\text{exp}({\beta}_{0}+{\mathit{x}}_{i}^{\prime}\mathit{\beta})\}\le {K}_{0}$$

for *i* = 1, 2, …, *n*, where *K*_{0} > 0 is a constant. Since ${X}_{\mathit{\text{obs}}}^{*}$ is of full rank, there exist *i*_{1} < *i*_{2} < < *i*_{k+1} such that **x**_{i1}, **x**_{i2}, …, **x**_{ik+1} are completely observed and the (*k*+1) × (*k*+1) matrix ${X}^{**}=((1,{\mathit{x}}_{{i}_{g}}^{\prime}),g=1,2,\dots ,k+1)$ is of full rank. Let * u* = (

$$\begin{array}{c}\int {\pi}^{*}({\beta}_{0},\mathit{\beta}|{y}_{0},X,{a}_{0})d{\beta}_{0}d\mathit{\beta}\hfill \\ \text{}\le {K}_{0}^{n-k-1}\int {\displaystyle \prod _{g=1}^{k=1}}{\{\text{exp}({\beta}_{0}+{\mathit{x}}_{{i}_{g}}^{\prime}\mathit{\beta})\}}^{{a}_{0}}\text{exp}\{-{a}_{0}{y}_{0}\text{exp}({\beta}_{0}+{\mathit{x}}_{{i}_{g}}^{\prime}\mathit{\beta})\}d{\beta}_{0}d\mathit{\beta}\hfill \\ \text{}={K}_{1}{\displaystyle \prod _{g=1}^{k=1}{\displaystyle \int \text{exp}({a}_{0}{u}_{g})\phantom{\rule{thinmathspace}{0ex}}\text{exp}\{-{a}_{0}{y}_{0}\phantom{\rule{thinmathspace}{0ex}}\text{exp}({u}_{g})\}d{u}_{g}\infty ,}}\hfill \end{array}$$

(A.3)

which completes the proof of Theorem 2.1.

In this appendix, we discuss how to sample from the posterior distribution under the full model given in (2.10). To this end, we propose a Gibbs sampling algorithm, which requires sampling from the following full conditional distributions in turn:

- [
**β**|**λ**,,**x**_{mis}*a*_{0},*D*];_{obs} - [
**λ**|**β**,,**x**_{mis}*a*_{0},*D*];_{obs} - [
|**x**_{mis}**β**,**λ**,**α**,*a*_{0},*D*];_{obs} - [
**α**|,**x**_{mis}*a*_{0},*D*]._{obs}

We briefly discuss how we sample from each of the above full conditional distributions. For (i), the full conditional density of **β** given **λ**, * x_{mis}*,

$$\begin{array}{c}\pi (\mathit{\beta}|\mathbf{\lambda},{\mathit{x}}_{\mathit{\text{mis}}},{a}_{0},{D}_{\mathit{\text{obs}}})\propto {\displaystyle \prod _{i=1}^{n}\text{exp}}\{({\nu}_{i}+{a}_{0}){\mathit{x}}_{i}^{\text{'}}\mathit{\beta}-{a}_{0}{y}_{0}{\lambda}_{1}\text{exp}({\mathit{x}}_{i}^{\text{'}}\mathit{\beta})\}\hfill \\ \times \phantom{\rule{thinmathspace}{0ex}}\text{exp}\phantom{\rule{thinmathspace}{0ex}}[-{\displaystyle \sum _{j=1}^{J}{\delta}_{ij}\text{exp}\phantom{\rule{thinmathspace}{0ex}}({\mathit{x}}_{i}^{\text{'}}\mathit{\beta})\phantom{\rule{thinmathspace}{0ex}}\{{\lambda}_{j}({y}_{i}-{s}_{j-1})+{\displaystyle \sum _{g=1}^{j-1}{\lambda}_{g}({s}_{g}-{s}_{g-1})}}\}].\hfill \end{array}$$

It is easy to show that π(**β**|**λ**, * x_{mis}*,

$${\lambda}_{1}~\text{Gamma}\phantom{\rule{thinmathspace}{0ex}}(n{a}_{0}+{\displaystyle \sum _{i=1}^{n}{\delta}_{i1}{\nu}_{i,}}{\displaystyle \sum _{i=1}^{n}\{({a}_{0}{y}_{0}+{h}_{i1})\phantom{\rule{thinmathspace}{0ex}}\text{exp}({\mathit{x}}_{i}^{\prime}\mathit{\beta})\}}),$$

(B.1)

and

$${\lambda}_{j}~\phantom{\rule{thinmathspace}{0ex}}\text{Gamma}\phantom{\rule{thinmathspace}{0ex}}({b}_{1}+{\displaystyle \sum _{i=1}^{n}{\delta}_{ij}{\nu}_{i},{b}_{2}+}{\displaystyle \sum _{i=1}^{n}{h}_{ij}\phantom{\rule{thinmathspace}{0ex}}\text{exp}({\mathit{x}}_{i}^{\prime}\mathit{\beta})}),$$

(B.2)

for *j* = 2, …, *J*. Hence, sampling the λ_{j} from (B.1) and (B.2) is straightforward.

For (iii), given **β**, **λ**, and **α**, the **x**_{2i,mis}’s are conditionally independent, and the conditional distribution for **x**_{2i,mis} is

$$\begin{array}{c}\pi \phantom{\rule{thinmathspace}{0ex}}({\mathit{x}}_{2i,\mathit{\text{mis}}}|\mathit{\beta},\mathbf{\lambda},\mathit{\alpha},{a}_{0},{D}_{\mathit{\text{obs}}})\propto f({\mathit{x}}_{2i}|{\mathit{x}}_{1i},\mathit{\alpha})\phantom{\rule{thinmathspace}{0ex}}\text{exp}\{({\nu}_{i}+{a}_{0}){\mathit{x}}_{i}^{\prime}\mathit{\beta}\}\phantom{\rule{thinmathspace}{0ex}}\text{exp}\{-{a}_{0}{y}_{0}{\lambda}_{1}\hfill \\ \hspace{1em}\times \phantom{\rule{thinmathspace}{0ex}}\text{exp}({\mathit{x}}_{i}^{\prime}\mathit{\beta})\}\phantom{\rule{thinmathspace}{0ex}}\text{exp}\phantom{\rule{thinmathspace}{0ex}}[-{\displaystyle \sum _{j=1}^{J}{\delta}_{ij}\phantom{\rule{thinmathspace}{0ex}}\text{exp}({\mathit{x}}_{i}^{\prime}\mathit{\beta})}\phantom{\rule{thinmathspace}{0ex}}\{{\lambda}_{j}({y}_{i}-{s}_{j-1})+{\displaystyle \sum _{g=1}^{j-1}{\lambda}_{g}({s}_{g}-{s}_{g-1})}\}]\phantom{\rule{thinmathspace}{0ex}}.\hfill \end{array}$$

Thus, the conditional distribution of *x*_{2i,mis} depends on the form of *f* (*x*_{2i}|*x*_{1i}, **α**). In Sect. 5, for the BMT data, *f* (*x*_{2i}|*x*_{1i}, **α**) is a product of a proportional odds logistic density and a normal density, and hence, sampling *x*_{2i,mis} is relatively straightforward. In fact, the conditional distribution for (cytoabnew1, cytoabnew2, cytoabnew3) is multinomial while the conditional distribution for log(wbcdx) is log-concave, which can be sampled via the adaptive rejection algorithm of Gilks and Wild (1992). For (iv), the full conditional distribution is $\pi \phantom{\rule{thinmathspace}{0ex}}(\mathit{\alpha}|{\mathit{x}}_{\mathit{\text{mis}}},{a}_{0},{D}_{\mathit{\text{obs}}})\propto {\displaystyle {\prod}_{i=1}^{n}f({\mathit{x}}_{2i}|{\mathit{x}}_{1i},\mathit{\alpha})\pi (\mathit{\alpha})}$. For various covariate distributions specified through a series of one dimensional conditional distributions, sampling **α** is straightforward. For example, in Section 5, the full conditional distribution for each component of **α**_{9} is log-concave, and hence we can sample these α_{9j}’s via the adaptive rejection algorithm of Gilks and Wild (1992), and the full conditional distributions for the components of **α**_{10} are either normal or inverse gamma, which are easy to sample from.

Joseph G. Ibrahim, Department of Biostatistics, University of North Carolina, Chapel Hill, NC 27599, USA, e-mail:ude.cnu.soib@miharbi..

Ming-Hui Chen, Department of Statistics, University of Connecticut, Storrs, CT 06269, USA, e-mail: ude.nnocu.tats@nehchm.

Sungduk Kim, Division of Epidemiology, Statistics and Prevention Research, National Institute of Child Health and Human Development, NIH, Rockville, MD 20852, USA, e-mail: vog.hin.liam@2smik.

- Akaike H. Information theory and an extension of themaximum likelihood principle. In: Petrov BN, Csaki F, editors. International symposium on information theory; Budapest: Akademia Kiado; 1973. pp. 267–281.
- Brown PJ, Vanucci M, Fearn T. Multivariate Bayesian variable selection and prediction. J R Stat Soc B. 1998;60:627–641.
- Brown PJ, Vanucci M, Fearn T. Bayes model averaging with selection of regresors. J R Stat Soc B. 2002;64:519–536.
- Celeux G, Forbes F, Robert CP, Titterington DM. Deviance information criteria for missing data models (with discussion) Bayesian Anal. 2006;1:651–674.
- Chen MH, Ibrahim JG. Conjugate priors for generalized linear models. Stat Sinica. 2003;13:461–476.
- Chen MH, Ibrahim JG, Yiannoutsos C. Prior elicitation, variable selection, and Bayesian computation for logistic regression models. J R Stat Soc B. 1999;61:223–242.
- Chen MH, Ibrahim JG, Shao QM, Weiss RE. Prior elicitation for model selection and estimation in generalized linear mixed models. J Stat Plan Inference. 2003;111:57–76.
- Chen MH, Dey DK, Ibrahim JG. Bayesian criterion based model assessment for categorical data. Biometrika. 2004;91:45–63.
- Chen MH, Huang L, Ibrahim JG, Kim S. Bayesian variable selection and computation for generalized linear models with conjugate priors. Bayesian Anal. 2008;3:585–614. [PMC free article] [PubMed]
- Chipman HA, George IE, McCulloch RE. Bayesian CART model search (with discussion) J Am Stat Assoc. 1998;93:935–960.
- Chipman HA, George IE, McCulloch RE. The practical implementation of Bayesian model selection (with discussion) In: Lahiri P, editor. Model selection. Beachwood: Institute of Mathematical Statistics; 2001. pp. 63–134.
- Chipman HA, George IE, McCulloch RE. Bayesian treed generalized linear models (with discussion) In: Bernardo JM, Bayarri M, Berger JO, Dawid AP, Heckerman D, Smith AFM, editors. Bayesian statistics. vol 7. Oxford: Oxford University Press; 2003. pp. 85–103.
- Clyde M. Bayesian model averaging and model search strategies (with discussion) In: Bernardo JM, Berger JO, Dawid AP, Smith AFM, editors. Bayesian statistics. vol 6. Oxford: Oxford University Press; 1999. pp. 157–185.
- Clyde M, George IE. Model uncertainty. Stat Sci. 2004;19:81–94.
- Cowles MK, Carlin BP. Markov chain Monte Carlo convergence diagnostics: a comparative review. J Am Stat Assoc. 1996;91:883–904.
- Dellaportas P, Forster JJ. Markov chain Monte Carlo model determination for hierarchical and graphical log-linear models. Biometrika. 1999;86:615–633.
- Dey DK, Chen MH, Chang H. Bayesian approach for the nonlinear random effects models. Biometrics. 1997;53:1239–1252.
- Geisser S, Eddy W. A predictive approach to model selection. J Am Stat Assoc. 1979;74:153–160.
- Gelfand AE, Dey DK. Bayesian model choice: asymptotics and exact calculations. J R Stat Soc B. 1994;56:501–514.
- Gelfand AE, Dey DK, Chang H. Model determinating using predictive distributions with implementation via sampling-based methods (with discussion) In: Bernardo JM, Berger JO, Dawid AP, Smith AFM, editors. Bayesian statistics. vol 4. Oxford: Oxford University Press; 1992. pp. 147–167.
- Gelfand AE, Ghosh SK. Model choice: a minimum posterior predictive loss approach. Biometrika. 1998;85:1–13.
- Gelman A, Meng XL, Stern HS. Posterior predictive assessment of model fitness via realized discrepancies (with discussion) Stat Sinica. 1996;6:733–807.
- George EI. The variable selection problem. J Am Stat Assoc. 2000;95:1304–1308.
- George EI, Foster DP. Calibration and empirical Bayes variable selection. Biometrika. 2000;87:731–747.
- George EI, McCulloch RE. Variable selection via Gibbs sampling. J Am Stat Assoc. 1993;88:881–889.
- George EI, McCulloch RE. Approaches for Bayesian variable selection. Stat Sinica. 1997;7:339–374.
- George EI, McCulloch RE, Tsay R. Two approaches to Bayesian model selection with applications. In: Berry D, Chaloner K, Geweke J, editors. Bayesian analysis in statistics and econometrics: essays in honor of Arnold Zellner. New York: Wiley; 1996. pp. 339–348.
- Gilks WR, Wild P. Adaptive rejection sampling for Gibbs sampling. J R Stat Soc C (Appl Stat) 1992;41:337–348.
- Hanson TE. Inference for mixtures of finite polya tree models. J Am Stat Assoc. 2006;101:1548–1565.
- Huang L, Chen MH, Ibrahim JG. Bayesian analysis for generalized linear models with nonignorably missing covariates. Biometrics. 2005;61:767–780. [PubMed]
- Ibrahim JG, Chen MH. Power prior distributions for regression models. Stat Sci. 2000;15:46–60.
- Ibrahim JG, Laud PW. A Predictive approach to the analysis of designed experiments. J Am Stat Assoc. 1994;89:309–319.
- Ibrahim JG, Chen MH, McEachern SN. Bayesian variable selection for proportional hazards models. Can J Stat. 1999a;27:701–717.
- Ibrahim JG, Lipsitz SR, Chen MH. Missing covariates in generalized linear models when the missing data mechanism is nonignorable. J R Stat Soc B. 1999b;61:173–190.
- Ibrahim JG, Chen MH, Ryan LM. Bayesian variable selection for time series count data. Stat Sinica. 2000;10:971–987.
- Ibrahim JG, Chen MH, Sinha D. Bayesian survival analysis. New York: Springer-Verlag; 2001a.
- Ibrahim JG, Chen MH, Sinha D. Criterion based methods for Bayesian model assessment. Stat Sinica. 2001b;11:419–443.
- Ibrahim JG, Chen MH, Lipsitz SR, Herring AH. Missing data methods in regression models. J Am Stat Assoc. 2005;100:332–346.
- Kim S, Chen MH, Dey DK, Gamerman D. Bayesian dynamic models for survival data with a cure fraction. Lifetime Data Anal. 2007;13:17–35. [PubMed]
- Laud PW, Ibrahim JG. Predictive model selection. J R Stat Soc B. 1995;57:247–262.
- Lipsitz SR, Ibrahim JG. A conditional model for incomplete covariates in parametric regression models. Biometrika. 1996;83:916–922.
- Little RJA, Rubin DB. Statistical analysis with missing data. 2nd edn. New York: Wiley; 2002.
- Ntzoufras I, Dellaportas P, Forster JJ. Bayesian variable and link determination for generalised linear models. J Stat Plan Inference. 2003;111:165–180.
- Raftery AE. Approximate Bayes factors and accounting for model uncertainty in generalised linear models. Biometrika. 1996;83:251–266.
- Raftery AE, Madigan D, Hoeting JA. Bayesian model averaging for linear regression models. J Am Stat Assoc. 1997;92:179–191.
- Rubin DB. Inference and missing data. Biometrika. 1976;63:581–592.
- Schwarz G. Estimating the dimension of a model. Ann Stat. 1978;6:461–464.
- Smith M, Kohn R. Nonparametric regression using Bayesian variable selection. J Econom. 1996;75:317–343.
- Spiegelhalter DJ, Best NG, Carlin BP, van der Linde A. Bayesian measures of model complexity and fit (with discussion) J R Stat Soc B. 2002;64:583–639.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |