Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2909687

Formats

Article sections

- Abstract
- 1 Introduction
- 2 IM Priors for Generalized Linear Models
- 3 IM Ridge Priors for GLMs when p > n
- 4 Empirical Studies
- 5 Analysis of nucleosomal positioning data
- 6 Discussion
- Supplementary Material
- References

Authors

Related links

Stat Sin. Author manuscript; available in PMC 2010 July 26.

Published in final edited form as:

Stat Sin. 2009; 19(4): 1641–1663.

PMCID: PMC2909687

NIHMSID: NIHMS120936

See other articles in PMC that cite the published article.

An important challenge in analyzing high dimensional data in regression settings is that of facing a situation in which the number of covariates *p* in the model greatly exceeds the sample size *n* (sometimes termed the “*p* > *n*” problem). In this article, we develop a novel specification for a general class of prior distributions, called Information Matrix (IM) priors, for high-dimensional generalized linear models. The priors are first developed for settings in which *p* < *n*, and then extended to the *p* > *n* case by defining a ridge parameter in the prior construction, leading to the Information Matrix Ridge (IMR) prior. The IM and IMR priors are based on a broad generalization of Zellner’s g-prior for Gaussian linear models. Various theoretical properties of the prior and implied posterior are derived including existence of the prior and posterior moment generating functions, tail behavior, as well as connections to Gaussian priors and Jeffreys’ prior. Several simulation studies and an application to a nucleosomal positioning data set demonstrate its advantages over Gaussian, as well as g-priors, in high dimensional settings.

In the analysis of data arising in many scientific applications, one often faces a scenario in which the number of variables (*p*) greatly exceeds the sample size (*n*), often termed the “*p* > *n*” problem. In these problems, fitting many types of statistical models leads to model nonidentifiability in which parameters cannot be estimated via maximum likelihood. The specification of proper priors can alleviate such a nonidentifiability problem and lead to proper posterior distributions as long as one uses a valid probability density for the data, i.e., ∫ *f*(*y*|θ) *dy* < ∞. Specification of proper priors in the *p* > *n* context is not easy since it is desirable that the prior (i) leads to existence of prior or posterior moments, which is not guaranteed and theoretically checking this is not easy; (ii) is relatively non-informative so that the data can essentially drive the inference; (iii) is at least somewhat semi-automatic in nature requiring relatively little or minimal hyper-parameter specification; and (iv) is easy to interpret and computationally feasible.

Properties (i) – (iv) are important to investigate in considering a class of priors for posterior inference. The existence of posterior moments is important since Bayesian inference often relies on the estimation of posterior quantities (e.g. means), via Markov chain Monte Carlo (MCMC) or other sampling procedures. Computing Bayes factors and other model assessment criteria often requires the existence of posterior moments (Chen, Ibrahim, and Yiannoutsos (1999)). Constructing estimates through a black box use of MCMC, without knowing that these estimates are well defined for the class of priors considered, may lead to nonsensical posterior inference. Non-informativeness is desirable in model selection and model assessment settings, or wherever prior information is not available. The specification of semi-automatic priors has been advocated by Spiegelhalter and Smith (1982), Zellner (1986), Mitchell and Beauchamp (1988), Berger and Pericchi (1996), Bedrick, Christensen, and Johnson (1996), Chen, Ibrahim, and Yiannoutsos (1999), and Ibrahim and Chen (2003), and is attractive in high dimensional settings where it is difficult to find contextual interpretations for all parameters.

In the *p* > *n* paradigm, there has been little work on the specification of desirable priors and, in particular, priors that attain (i) – (iv) above. West (2003) specifies singular *g*-priors (Zellner (1986)) for the Gaussian model in the context of Bayesian factor analysis for *p* > *n*. Liang, Paulo, Molina, Clyde, and Berger (2008) advocate the use of mixtures of g-priors to resolve consistency issues in model selection but their adaptation does not directly extend to cases where *p* > *n*. When *p* > *n*, *N _{p}*(0, γ

The class of priors we consider are called Information Matrix (IM) priors. They can be applied in any parametric regression context but we initially focus on generalized linear models (GLMs). The functional form of the IM prior is obtained once a parametric statistical model is specified for the data, the kernel of the IM prior being specified through the Fisher Information matrix.

Let (*y*_{1},…, *y _{n}*) be independent univariate response variables with density

$${I}_{\mathit{\text{ij}}}(\mathit{\beta})=-E\phantom{\rule{thinmathspace}{0ex}}(\frac{{\partial}^{2}}{\partial {\beta}_{i}\partial {\beta}_{j}}\text{log}(L(\mathit{\beta})))\phantom{\rule{thinmathspace}{0ex}},$$

where the expectation is with respect to *y*|**β**, and β_{i} is the *i-th* component of **β**, (*i, j* = 1, … *p*). Further, assume that the *p* × *p* Fisher information matrix, *I*(**β**), is non-singular, as is often the case when *p* < *n*. The general IM prior is now defined as

$${\pi}_{\mathit{\text{IM}}}(\mathit{\beta})\propto {|I(\mathit{\beta})|}^{1/2}\text{exp}\phantom{\rule{thinmathspace}{0ex}}\{-\frac{1}{2{c}_{0}}(\mathit{\beta}-{\mathit{\mu}}_{0})\prime I(\mathit{\beta})(\mathit{\beta}-{\mathit{\mu}}_{0})\},$$

(1.1)

where **μ**_{0} and *c*_{0} ≥ 0 are specified location and dispersion hyperparameters. The IM prior (1.1) captures the prior covariance of **β** via the Fisher information matrix, which seems an attractive specification since this matrix plays a major role in the determination of the large sample covariance of **β** in both Bayesian and frequentist inference. The use of the design matrix *X* is attractive since *X* may reveal redundant covariates. The prior (1.1) is semi-automatic, requiring specifications only for **μ**_{0} (which can be taken to be 0), and the scalar *c*_{0}. It should be mentioned that some recent work by Wang and George (2007) and Wang (2002) uses a related form of the prior for **β**, with two important differences: (i) the form of the prior for all GLMs is taken to be Gaussian, and (ii) the covariance is dependent on the *observed* information matrix, rather than the *expected* one, leading to a data-dependent prior, unlike our specification here. We now focus on the development of these priors and the resultant posterior estimators in the class of GLMs, where many theoretical and computational properties can be characterized for *p* < *n* and *p* > *n*.

We first consider the IM prior for GLMs when *p* < *n*, and where the dispersion parameter **ϕ** is known or intrinsically fixed, as for example in the binomial, Poisson, and exponential regression models. For a GLM, *y _{i}*|

$$p({y}_{i}|{\mathit{x}}_{i},\mathit{\beta},\varphi )=\text{exp}\phantom{\rule{thinmathspace}{0ex}}\{{a}_{i}^{-1}(\varphi )({y}_{i}{\theta}_{i}-b({\theta}_{i}))+c({y}_{i},\varphi )\},i=1,2,\dots ,n,$$

(2.1)

where the canonical parameter θ_{i} satisfies the equations θ_{i} = θ(η_{i}), and *a*(·), *b*(·) and *c*(·) determine a particular family in the class. The function θ(.) is the link function for the GLM, often referred to as the θ-link, and η_{i} = * x_{i}*′

$$p({y}_{i}|{x}_{i},\mathit{\beta})=\text{exp}\phantom{\rule{thinmathspace}{0ex}}\{{y}_{i}{\theta}_{i}-b({\theta}_{i})+c({y}_{i})\},i=1,2,\dots ,n.$$

(2.2)

Now, let *X* be the *n* × *p* matrix of covariates with *i-th* row ${\mathit{x}}_{i}^{\prime},\mathit{\theta}(X\mathit{\beta})$ be a component-wise function of *X* **β** that depends on the link, and * J* be a

$$p(\mathit{y}|\mathit{\beta})=\text{exp}\{\mathit{y}\prime \mathit{\theta}(X\mathit{\beta})-\mathit{J}\prime b(\mathit{\theta}(X\mathit{\beta}))+c(\mathit{y})\}.$$

(2.3)

For the class of GLMs (with ϕ = 1 and *w _{i}* = 1), the Fisher information matrix for

$$I(\mathit{\beta})=X\prime \mathrm{\Omega}(\mathit{\beta})X,$$

(2.4)

$$\text{where}\mathrm{\Omega}(\mathit{\beta})=\mathrm{\Delta}(\mathit{\beta})V(\mathit{\beta})\mathrm{\Delta}(\mathit{\beta}),$$

(2.5)

where *V* (**β**) is the *n* × *n* diagonal matrix of variance functions with *i-th* diagonal element ${\upsilon}_{i}=\upsilon ({\mathit{x}}_{i}^{\prime}\mathit{\beta})={d}^{2}b({\theta}_{i})/d{\theta}_{i}^{2}$, and Δ(**β**) is an *n* × *n* diagonal matrix of “link adjustments”, with *i-th* diagonal element ${\delta}_{i}=\delta ({\mathit{x}}_{i}^{\prime}\mathit{\beta})=d{\theta}_{i}/d{\eta}_{i}$. We denote the *i-th* diagonal element of Ω(**β**) by ω_{i}. We now can write the IM prior for **β** as

$${\pi}_{\mathit{\text{IM}}}(\mathit{\beta})\propto {|X\prime \mathrm{\Omega}(\mathit{\beta})X|}^{1/2}\text{exp}\phantom{\rule{thinmathspace}{0ex}}\{-\frac{1}{2{c}_{0}}(\mathit{\beta}-{\mathit{\mu}}_{0})\prime (X\prime \mathrm{\Omega}(\mathit{\beta})X)(\mathit{\beta}-{\mathit{\mu}}_{0})\}.$$

(2.6)

The prior in (2.6) can be viewed as a generalization of several types of priors. As *c*_{0} → ∞, (2.6) converges to Jeffreys’ prior for GLMs, given by

$${\pi}_{J}(\mathit{\beta})\propto {|X\prime \mathrm{\Omega}(\mathit{\beta})X|}^{1/2}.$$

(2.7)

Jeffreys’ prior is a popular noninformative prior for Bayesian inference in multiparameter settings involving regression coefficients, due to its invariance and local uniformity properties (Kass (1989, 1990)). Also, as long as *X*′Ω(**β**) *X* is bounded (for example in the logistic model) as **β** → ∞, the ratio of prior (2.6) to (2.7) is zero, hence showing the IM prior has lighter tails than Jeffreys’ in the limit. Jeffreys’ prior for GLMs is improper for most models, except for binomial regression models (Ibrahim and Laud (1991)). For the Gaussian linear model, (2.6) reduces to Zellner’s g-prior (Zellner (1986)). In this case, Ω(**β**) = σ^{2} *I*, where σ^{2} denotes the variance of *y _{i}* in the Gaussian linear model. For any given GLM, if Ω(

The tail behavior of the IM prior can be theoretically compared with the Gaussian prior for specific GLMs. To show that its tails are heavier than a Gaussian distribution, we need

$$\underset{\Vert \mathit{\beta}\Vert \to \infty}{\text{lim}}\frac{{\pi}_{\mathit{\text{IM}}}(\mathit{\beta})}{\varphi (\mathit{\beta};\nu ,\Sigma )}=\infty ,$$

(2.8)

where ϕ (**β**;ν, Σ) denotes the *p* dimensional multivariate normal density with mean ν and covariance matrix Σ. (Showing that the expression in (2.8) goes to zero indicates the tails are lighter than the Gaussian.) For the binomial regression model and the inverse Gaussian model (with canonical link), it can be shown that (2.8) holds, so that the IM prior under these models has heavier tails (Figure 1). For most GLMs, if the IM prior is proper and Ω(*β*) → 0 elementwise as ‖**β**‖ → ∞, then the tails of the IM prior will be heavier than those of Gaussian priors. When the elements of Ω(**β**) become infinite as ‖**β**‖ → ∞, the IM prior has lighter tails than the Gaussian prior, as with the Poisson model with canonical link (Figure 2). Although the IM prior for the Poisson model has lighter tails than a Gaussian prior, the IM prior for this model is much flatter in the “middle” part of the distribution and hence effectively acts as a more noninformative prior.

Density of IM prior under a logistic model with p = 1 compared with Jeffreys’ ( “Jeff”) and a Gaussian prior (“Norm”) for a simulated data set with n = 20. The plot is shown at three different scales. The total **...**

Density of IM prior under a Poisson model with p = 1 compared with a Gaussian prior for a simulated data set with n = 20, shown at two different scales.

When the dispersion parameter **ϕ** is unknown, the IM prior based on **α**, where **α** = (**β**, **ϕ**), is typically not proper for GLMs and has properties that are difficult to characterize. In these cases, one can construct the IM prior of **β** conditional on **ϕ**, and then specify a proper prior for **ϕ** such as a gamma prior or a product of independent gamma densities (Ibrahim and Laud (1991)). The conditional IM prior of **β** given **ϕ** is defined as

$${\pi}_{\mathit{\text{IM}}}(\mathit{\beta}|\mathit{\varphi})\propto {|I(\mathit{\beta}|\mathit{\varphi})|}^{1/2}\text{exp}\phantom{\rule{thinmathspace}{0ex}}\{-\frac{1}{2{c}_{0}}(\mathit{\beta}-{\mathit{\mu}}_{0})\prime I(\mathit{\beta}|\mathit{\varphi})(\mathit{\beta}-{\mathit{\mu}}_{0})\},$$

(2.9)

$$\begin{array}{cc}\text{where}& {I}_{\mathit{\text{ij}}}(\mathit{\beta}|\mathit{\varphi})=-\phantom{\rule{thinmathspace}{0ex}}E\phantom{\rule{thinmathspace}{0ex}}(\frac{{\partial}^{2}}{\partial {\beta}_{i}\partial {\beta}_{j}}\text{log}(L(\mathit{\beta},\mathit{\varphi}))),\end{array}$$

and *L*(**β**, **ϕ**) is the likelihood function of (**β**, **ϕ**). We specify the joint prior of (**β**, **ϕ**) as

$$\pi (\mathit{\beta},\mathit{\varphi})\propto {\pi}_{\mathit{\text{IM}}}(\mathit{\beta}|\mathit{\varphi})\pi (\mathit{\varphi}),$$

(2.10)

where π (**ϕ**) is a proper prior for **ϕ**. When *q* = 1, π(**ϕ**) can be taken to be a gamma prior and, for a general *q*, π(ϕ_{1}, …, ϕ_{q}) can be assumed to be a product of *q* independent gamma densities. For GLMs, following the results of Section 3, it can be shown that (2.10) is jointly proper for (**β**, **ϕ**), when **ϕ** is distributed as a product of gamma densities.

We now present theoretical results establishing conditions for the existence of the prior and posterior moment generating functions (MGFs) using the IM prior for GLMs. The proofs of these results are presented as special cases of more general theorems given in Section 3.

The sufficient condition for the existence of the prior MGF of **β** for the IM prior in (2.6) is that

$$\int \text{exp}\{\tau {\theta}^{-1}(r)\}{\left(\frac{{d}^{2}b(r)}{d{r}^{2}}\right)}^{\frac{1}{2}}\mathit{\text{dr}}<\infty ,$$

for τ in some (−, ). This is derived as a special case of Theorem 1, (Corollary 1.2), and matches the condition of MGF existence for Jeffreys’ prior when *p* < *n* (Ibrahim and Laud (1991)).

The necessary condition for existence of the prior MGF of **β** for the IM prior in (2.6) is the finiteness of the *p*-dimensional integral

$$\int {\displaystyle \prod _{i=1}^{p}{[{\upsilon}_{j}(\mathit{\beta}){\delta}_{j}^{2}(\mathit{\beta})]}^{1/2}\text{exp}\phantom{\rule{thinmathspace}{0ex}}\{-\frac{1}{2{c}_{0}}(\mathit{\beta}-{\mathit{\mu}}_{0})\prime I(\mathit{\beta})(\mathit{\beta}-{\mathit{\mu}}_{0})+\mathit{t}\prime \mathit{\beta}\}d\mathit{\beta},}$$

(2.11)

for any ** t** in some

A sufficient condition for the existence of the posterior MGF of **β** for the IM prior in (2.6) is the finiteness of the following one-dimensional integral for τ (−, ), some > 0:

$$\int \text{exp}\{\tau {\theta}^{-1}(r)+{\varphi}^{-1}w(yr-b(r))\}{\left(\frac{{d}^{2}b(r)}{d{r}^{2}}\right)}^{\frac{1}{2}}\mathit{\text{dr}}.$$

(2.12)

This is the same condition as for Jeffreys’ prior in Ibrahim and Laud (1991). Since the sufficient conditions for the prior and posterior are the same as those for using Jeffreys’ prior, the examples which satisfy Sufficiency conditions for Jeffreys’ prior in Ibrahim and Laud (1991) also satisfy the sufficient conditions here, e.g., the posterior MGFs exist for binomial, Poisson and Gamma GLMs if none of the observed *y*’s is zero.

When *p* > *n*, **β** is not identifiable in the likelihood function and the MLE of **β** does not exist. Specifying a proper prior guarantees a proper posterior in this setting; however, not all proper priors yield the existence of prior and posterior moments which are often the end objectives in Bayesian inference. To generalize the IM priors to a proper prior in the *p* > *n* case, we introduce a scalar “ridge” parameter λ in the prior construction, defining the IM Ridge (IMR) prior as

$${\pi}_{\mathit{\text{IMR}}}(\mathit{\beta})\propto {|X\prime \mathrm{\Omega}(\mathit{\beta})X+\mathrm{\lambda}I|}^{1/2}\text{exp}\{-\frac{1}{2{c}_{0}}(\mathit{\beta}-{\mathit{\mu}}_{0})\prime (X\prime \mathrm{\Omega}(\mathit{\beta})X+\mathrm{\lambda}I)(\mathit{\beta}-{\mathit{\mu}}_{0})\}.$$

(3.1)

Here λ can be considered a “ridge” parameter as used in regression models for high dimensional covariates, and to reduce effects of collinearity (Hoerl and Kennard (1970)). With the introduction of λ, the matrix *X*′Ω(**β**)*X* + λ*I* is always positive definite regardless of the rank of *X* and the form of Ω, for any GLM. As *c*_{0} → ∞ in 3.1, the IMR prior converges to a generalized Jeffreys’ prior given by
${\pi}_{J}^{*}(\mathit{\beta})\propto {|X\prime \mathrm{\Omega}(\mathit{\beta})X+\mathrm{\lambda}I|}^{1/2}$, which is useful in the *p* > *n* setting since the usual Jeffreys’ prior (2.7) does not exist when *X*′Ω(**β**)*X* is singular. This idea is similar in spirit (but considerably different in form) to the introduction of a constant matrix Λ added to the sample covariance matrix *S* in the construction of a data-dependent prior for the population covariance matrix Σ in multivariate normal settings (Schafer (1997)).

To understand the IMR prior in (3.1), first consider the IMR prior for the linear model. In this case, the IMR prior for **β**|σ^{2} is Gaussian and the posterior distribution is also Gaussian. Specifically, consider the linear model *Y* = *X***β** + , where ~ *N _{n}*(0, σ

$$\mathit{\beta}|\mathit{y},{\sigma}^{2}\phantom{\rule{thinmathspace}{0ex}}~\phantom{\rule{thinmathspace}{0ex}}N\phantom{\rule{thinmathspace}{0ex}}(\Sigma \phantom{\rule{thinmathspace}{0ex}}(\frac{1}{{c}_{0}}(X\prime X+\mathrm{\lambda}I){\mathit{\mu}}_{0}+X\prime \mathit{y}),{\sigma}^{2}\Sigma )\phantom{\rule{thinmathspace}{0ex}},$$

(3.2)

where Σ = [(1 + 1/*c*_{0})*X′X* + (λ/*c*_{0})*I _{p}*]

The prior MGF must exist for the posterior MGF to exist, when *p* > *n*. We first provide necessary and sufficient conditions for prior MGF existence.

*A sufficient condition for the existence of the MGF of β based on the prior (3.1) when p > n is that the one-dimensional integral*

$$\int \text{exp}\phantom{\rule{thinmathspace}{0ex}}\{-\frac{\mathrm{\lambda}M}{2{c}_{0}}{[{\theta}^{-1}(r)]}^{2}+{\tau}_{i}{\theta}^{-1}(r)\}{\left[\frac{{d}^{2}b(r)}{d{r}^{2}}\right]}^{\frac{1}{2}}\mathit{\text{dr}}<\infty ,$$

(3.3)

*for some* τ_{i} (−, ), *where* rank (*X*) = *n, M is such that*
$|{X}^{*}|\le {M}^{-\frac{p}{2}},{X}^{*\prime}=[X\prime \vdots {x}_{0}^{\prime}]$, *with x _{0} as a (p − n) × p matrix selected such that X^{*} is positive definite*.

Proofs of all theorems are given in the Appendices (in the online Article Supplement).

Let ω_{k}(β) denote the k-th diagonal element of Ω(**β**) in (2.5). If ${\sum}_{k=1}^{n}{\omega}_{k}(\beta )\le 1$, the prior MGF of **β** from (3.1) always exists, as the integral is dominated by a Gaussian MGF.

The sufficient condition (3.3) also holds and is identical in the p < n case and, additionally, reduces to the sufficient condition for Jeffreys’ prior if λ = 0.

The IMR prior thus can be useful (and more desirable than the IM prior) even in the *p* < *n* case, in the face of high-dimensionality, collinearity, or weak identifiability.

*If the prior MGF exists when p > n, the p-dimensional integral*

$$\int {a}_{s}(\mathit{\beta})\phantom{\rule{thinmathspace}{0ex}}\text{exp}\phantom{\rule{thinmathspace}{0ex}}\{-\frac{1}{2{c}_{0}}(\mathit{\beta}-{\mathit{\mu}}_{0})\prime (I(\mathit{\beta})+\mathrm{\lambda}I)(\mathit{\beta}-{\mathit{\mu}}_{0})+t\prime \mathit{\beta}\}d\mathit{\beta}$$

(3.4)

*is finite for some t* (−, ), *for s* = 0, 1, …, *p, where a _{s}*(

Note here that *a*_{0}(**β**) = |*I*(**β**)|, *a*_{p−1}(**β**) = trace(*I*(**β**)), and *a _{p}*(

$${{\displaystyle \int {a}_{11}(\beta )}}^{\frac{1}{2}}\text{exp}\phantom{\rule{thinmathspace}{0ex}}\{-\frac{1}{2{c}_{0}}{(\beta -{\mu}_{0})}^{2}({a}_{11}(\beta )+\mathrm{\lambda})+t\beta \}d\beta $$

is finite, where
${a}_{11}(\beta )=b\u2033(\theta )\frac{d\theta}{d\eta}$. If this condition is not satisfied, the MGF does not exist. If the necessity condition is satisfied for *p* = 1, we need to check the condition for larger *p*; if for any *p* we find that the necessity condition is not satisfied, the prior MGF does not exist.

*The result in Theorem 2 holds when n > p, and simplifies to (2.11) when λ = 0*.

*A sufficient condition for the existence of the posterior MGF for π _{I M R}(β) is that*

$$\int \text{exp}\phantom{\rule{thinmathspace}{0ex}}\{-\frac{{M}_{1}\mathrm{\lambda}}{2{c}_{0}}{[{\theta}^{-1}(r)]}^{2}+\tau {\theta}^{-1}(r)+{\varphi}^{-1}w(\mathit{\text{yr}}-b(r))\}{\left[\frac{{d}^{2}b(r)}{d{r}^{2}}\right]}^{1/2}\mathit{\text{dr}}$$

(3.5)

*is finite for some τ in an open neighborhood about zero, for j = 1, …, p. For j = n + 1, …, p, the condition is the same as for the existence of the prior MGF*.

*When p < n (and λ = 0), a sufficient condition for the existence of the posterior MGF is that*

$$\int \text{exp}\phantom{\rule{thinmathspace}{0ex}}\{\tau {\theta}^{-1}(r)+{\varphi}^{-1}w(\mathit{\text{yr}}-b(r))\}{\left[\frac{{d}^{2}b(r)}{d{r}^{2}}\right]}^{1/2}\mathit{\text{dr}}$$

(3.6)

*is finite for some τ in an open neighborhood about zero, and that the MLE exists*.

Here we additionally require that the MLE exists (i.e., the likelihood function is bounded above). However, existence of the prior MGF is not necessary.

The next result demonstrates the usage of the necessary and sufficient conditions in some specific examples of GLMs to determine existence of prior and posterior MGFs. Without loss of generality, we assume that **μ**_{0} = 0.

- According to condition (3.4), prior and posterior MGFs do not exist for the Gamma model with canonical link.

Note that, however, the prior and posterior MGFs do exist for the Gamma model with a log link, due to a result shown in Section 3.2. All derivations are given in the Appendices.

When the information matrix Ω(**β**) = Δ(**β**)*V*(**β**)Δ(**β**) is independent of **β**, the IMR prior reduces to a “ridge” g-prior for a Gaussian linear model, and is of Gaussian form, proper, and its MGF exists. This provides a quick way to determine existence of the prior and posterior MGF for models for which the Sufficiency conditions do not hold, but the necessary ones do. Examples are the gamma model with log-link and the Gaussian model with canonical link. For the gamma model with log link, note that *b*(θ) = −log(−θ), so that *b*″(θ) = υ_{i}(**β**) = 1/θ^{2}; and θ = −*e*^{−η}, thus $\frac{d\theta}{d\eta}={\delta}_{i}(\mathit{\beta})=\theta $. The diagonal elements of Ω(**β**) are ${\upsilon}_{i}(\mathit{\beta}){\delta}_{i}^{2}(\mathit{\beta})=1$. Hence Ω(**β**) reduces to an identity matrix, and π_{I M R}(**β**) is a Gaussian distribution with mean **μ**_{0} and covariance matrix *c*_{0}(*X′X* + λ*I*)^{−1}, which is always proper, and for which the MGF exists even if *p* > *n*.

To determine the influence of λ and *c*_{0} in the IMR prior on the posterior estimates, we first explore the case of Gaussian linear models, where closed forms exist. For simplicity, assume **μ**_{0} = 0.

*The posterior covariance matrix of β for the Gaussian linear model*,

$${\sigma}^{2}\Sigma ={\sigma}^{2}{\left[\right(1+\frac{1}{{c}_{0}})X\prime X+\frac{\mathrm{\lambda}}{{c}_{0}}{I}_{p}]}^{-1},$$

(3.7)

*satisfies*

$${\left(\frac{{c}_{0}}{\mathrm{\lambda}}\right)}^{p/2}\frac{{(1+\frac{{c}_{0}+1}{\mathrm{\lambda}}{x}_{0}^{(p)})}^{-n}}{{n}^{n/2}}\le |\Sigma |\le {\left(\frac{{c}_{0}}{\mathrm{\lambda}}\right)}^{p/2},$$

*Where*
${x}_{0}^{(p)}=\mathit{\text{max}}\{\mathit{\text{diag}}(X\prime X)\}$, *if c*_{0} > λ.

*If p* → ∞, *c*_{0} > λ, *and n is finite, then* |Σ| → ∞.

*The posterior bias of β, for the IMR prior, satisfies*

$$\frac{1}{p}{\displaystyle \sum _{i=1}^{p}\text{bias}({\beta}_{p})\le \frac{1}{p}{\displaystyle \sum _{i=1}^{p}|{\beta}_{i}|}}.$$

*Equivalently, for the determinant of the bias matrix, D* = Σ*X′X*−*I*, ‖ *D* β ‖^{2}= **β**′*D′D***β** ≤ **β**′**β**.

It is also of interest to compare the bias and MSE of estimates arising from the use of the IMR prior to those obtained using a Gaussian prior, *N* (0, *c*_{0}*I _{p}*). For simplicity, assume σ

$$\frac{|{D}_{\mathit{\text{IM}}}|}{|{D}_{N}|}=\frac{|{\{(1+{c}_{0})X\prime X+\mathrm{\lambda}I\}}^{-1}X\prime X{c}_{0}-I|}{|{\{{(X\prime X+{c}_{0}I)}^{-1}X\prime X-I\}}^{-1}X\prime X{c}_{0}-I|}\frac{|X\prime X+\mathrm{\lambda}I||X\prime X+{c}_{0}I|}{|(1+{c}_{0})X\prime X+\mathrm{\lambda}I||{c}_{0}I|}$$

(3.8)

and, similarly, the ratio of the determinants of the mean square error (MSE) matrices is

$$\frac{|{\mathit{\text{MSE}}}_{\mathit{\text{IM}}}|}{|{\mathit{\text{MSE}}}_{N}|}=\frac{|(\Sigma X\prime X-I)\prime (\Sigma X\prime X-I)+\Sigma |}{|[{(X\prime X+{c}_{0}I)}^{-1}X\prime X-I]\prime [{(X\prime X+{c}_{0}I)}^{-1}X\prime X-I]+{(X\prime X+{c}_{0}I)}^{-1}|},$$

(3.9)

where Σ is as given in (3.7). Figure 3 depicts the behavior of these ratios for a set of choices of (*n, p*). The bias ratio (averaged over 5 data sets) decreases with an increase in *c*_{0}, and the IMR prior is uniformly better when *c*_{0} is sufficiently large (> 3) irrespective of whether *n* is larger or smaller than *p*. When *n* ≥ *p*, the MSE using the IMR prior is uniformly better when *c*_{0} is moderately large. However, when *p* > *n*, an increase in *c*_{0} leads to a an increase in the MSE with the IM prior, and a sharper Gaussian prior (with a large bias) is favored in terms of the MSE.

The parameter λ plays an important role in the construction of the IMR prior since its introduction makes the prior covariance matrix of **β** nonsingular regardless of *p*. One question is whether to take λ fixed or random in the model. Empirical experience suggests that if λ is random, any prior for λ would have to be quite sharp since there is no information in the data for estimating it. Empirical studies show that taking λ fixed gives similar results with far less computational effort. When taking 0 < λ ≤ 1, posterior inference about **β** appears quite robust for a wide range of values of λ in (0, 1) (Section 4.2). Note that when *c*_{0} is large, λ and the design matrix *X* have a small impact on the posterior analysis of **β**. Since our main aim is to develop a relatively noninformative IM prior for Bayesian inference, other practical hyperparameter choices include **μ**_{0} = **0** (consistent with Zellner’s g-prior) and a moderately large *c*_{0} (*c*_{0} ≥ 1).

We simulated data under a logistic model with sample size *n* = 20, varying the number of covariates in the range 1 ≤ *p* ≤ 500, to get an idea of the behavior of the IMR prior for *p* < *n* and *p* > *n*. For the prior π_{I M R}(**β**), setting **μ**_{0} = 0, the posterior density of **β** is

$$p(\mathit{\beta}|\mathit{y})\propto \frac{{|\Sigma |}^{-\frac{1}{2}}\phantom{\rule{thinmathspace}{0ex}}\text{exp}\phantom{\rule{thinmathspace}{0ex}}[{\displaystyle {\sum}_{i=1}^{n}{y}_{i}{\mathit{x}}_{i}^{\prime}\mathit{\beta}-\frac{1}{2}\mathit{\beta}\prime {\Sigma}^{-1}\mathit{\beta}]}}{{\displaystyle {\prod}_{i=1}^{n}(1+{e}^{{\mathit{x}}_{i}^{\prime}\mathit{\beta}})}},$$

(4.1)

where Σ = Σ(**β**) = *c*_{0}[*X*′Ω(**β**)*X* + λ*I _{p}*]

We investigated the performance of estimates using the IM prior with different settings of *c*_{0} and for values of *p* from 10 to 500, with a sample size *n* = 150. Data were generated from a logistic model with a design matrix *x* = (*x _{ij}*),

We tested the effect of λ on posterior estimates when *p* > *n*. Here we present results from a simulation with data from a logistic model with *p* = 100, *n* = 50, *c*_{0} = 1, and λ chosen at approximately equal intervals between 0 to 1 (λ > 0). Figure 6 shows that very small values of λ, close to zero, led to unstable estimates; however, interestingly, the estimates were remarkably consistent, exhibiting little or no difference over a large range of λ values, between 0.4 and 1. Since the results were robust to this wide range of λ for both the *p* < *n* and *p* > *n* cases, we chose the value λ = .5 for all analyses.

We compared the performance of estimates based on the IMR prior to Gaussian priors and a “g-prior” for the logistic model, as the sample size decreased relative to *p*. We generated sets of data for a fixed *p* = 100, while *n* varied between *p*/2 = 50 and 2*p* = 200. Five data sets were generated for each combination of (*n, p*), while **β** was generated from a *N*(**β**_{0}, *c*_{0}[* x′W*(

Sampling from the posterior distribution was done with a multivariate *t* trial density. In addition to the bias and MSE, we also computed the theoretical “asymptotic covariance” of the estimates from the inverse of the Hessian matrix of the log-posterior, as ${V}_{\mathit{\text{as}}}={[-\frac{{\delta}^{2}}{\delta {\beta}^{2}}\text{log}\phantom{\rule{thinmathspace}{0ex}}p(\mathit{\beta}|\mathit{y},\mathit{x})]}^{-1}$. *V _{as}* cannot be evaluated in closed form, so we used a numerical computation routine in R to get an approximate estimate. The results from the IM prior (Figure 7) were compared with estimates using (i)

In high-dimensional regression applications, an alternative method to estimating the full model is Bayesian model averaging (BMA) of the posterior estimates (Hoeting, Madigan, Raftery, and Volinsky (1999)). We compared predictions using the IMR priors in a full model to those obtained using BMA with a g-prior in a scenario where *p* > *n* in logistic regression. Since the g-prior is undefined when *p* > *n*, the model averaging procedure was restricted to involve only *p* < *n* models. IMR was compared with (i) BMA using BIC, in a stepwise variable selection algorithm (Yeung, Bumgarner, and Raftery (2005)) implemented as the iBMA routine in the R package “BMA”, and (ii) BMA under Zellner’s g-prior with a marginal likelihood evaluated using the generalized Laplace approximation (Bollen, Ray, and Zavisca (2005)) with model selection through the evolutionary Monte Carlo (EMC) algorithm (Liang and Wong (2000)). The approximation to the marginal likelihood is given by $p(\mathit{y})\approx \text{exp}[{\displaystyle {\sum}_{i=1}^{n}\{{y}_{i}{\mathit{x}}_{i}^{\prime}\widehat{\mathit{\beta}}-\text{log}(1+{e}^{{\mathit{x}}_{i}^{\prime}\widehat{\mathit{\beta}}})\}}]{|{c}_{0}{(X\prime X)}^{-1}|}^{\frac{1}{2}}\varphi (\widehat{\mathit{\beta}};\mathbf{0},{[X\prime \mathrm{\Omega}(\widehat{\mathit{\beta}})X]}^{-1}+{c}_{0}{(X\prime X)}^{-1})$, where ϕ(* y*;

We first adapted a procedure proposed in Hoeting *et al.* (1999) to compare the predictive performance of the three methods. The data was randomly split into halves, and each model selection method was applied on the first half of the data (“training set”, T). The performance was then measured on the second half (“test set”, t) of the data through an approximation to the *predictive logarithmic score* (Good (1952)) which is given by the sum of the logarithms of the observed ordinates of the predictive density under the model *M* for each observation in the test set − ∑_{dεt} log *P*(*d|D ^{T} , M*), where

The misregulation of the chromatin structure in DNA is associated with the progression of cancer, aging, and developmental defects (Johnson (2000)). It is known that the accessibility of genetic information in DNA is dependent on the positioning of histone proteins packaging the chromatin, forming *nucleosomes* (Kornberg and Lorch (1999)), which in turn is dependent upon the underlying DNA sequence. Nucleosomes typically comprise regions of about 147 bp of DNA separated by stretches of “open” DNA (nucleosome-free regions, or NFRs). Nucleosome positioning is known to be influenced by di- and tri-nucleotide repeats (Thastrom, Bingham, and Widom (2004)), but overall, the sequence signals infiuencing positioning are relatively weak and difficult to detect.

To determine how sequence features affect nucleosome positioning, we obtained data from a genome-wide study of chromatin structure in yeast (Hogan, Lee, and Lieb (2006)). The data consist of normalized log-ratios of intensities measured for a tiled array for chromosome III, consisting of 50-mer oligonucleotide probes that overlap every 20 bp. We first fitted a two-state Gaussian hidden Markov model, or HMM (Juang and Rabiner (1991)) to determine the nucleosomal state for each probe. We then refrained from any further use of the probe-level microarray data, as it was our primary interest to determine whether certain sequence features are predictive of nucleosome and nucleosome-free positions in the genome, which would enable us to make predictions for genomic regions for which experimental data is currently unavailable or difficult to obtain.

In order to determine how sequence features might influence the positions of nucleosome-free regions, we concentrated on a region of about 1400 adjacent probes on yeast chromosome III. For each probe, the covariate vector was the set of observed frequencies of nucleotide k-tuples, with *k* = 1, 2, 3, 4. This led to a total of *p* = 340 covariates (without including an intercept in the model). The HMM-based classification gave the observed “state” of each probe, whether corresponding to a nucleosome-free region (NFR) or a nucleosome (N). The results reported here are with *c*_{0} = 1, results with *c*_{0} = 10 were essentially similar.

We carried out ten-fold cross validation to test the predictive power of the model. The set of probes, with the associated covariates, were divided into ten non-overlapping pairs of training sets (90% of probes) and test sets (10% of probes). Each training set thus had a sample size of *n* = 1260, which is greater than the number of covariates (340). For each training-test set pair, the logistic regression model was first fitted to the training data set (with the three priors: IMR, *N* (0, *I*) and *N* (0, 10^{6}*I*)), and the fitted values of **β** used to compute the posterior probabilities of classification into the NFR state, for the corresponding test set. The sensitivity and specificity of cross validation using the three different priors was compared, where any region having an estimated posterior probability of 50% or more with the logistic model was classified as an NFR. The IMR prior showed uniformly higher sensitivity while its specificity was comparable to the other two priors (Table 1). The IMR predicted a slightly higher percentage of NFRs than the true percentage, while the other two methods consistently underestimated the number of NFRs. Overall, using the IMR was about 16% more accurate in predicting NFRs compared to the next best method. We also compared the results with Bayesian model averaging, using a g-prior and using the set of 340 covariates- in which case it performed equally well as the full model with an IMR prior. The computational cost of using BMA was much higher than IMR- 5,000 iterations under this setting took about 104 hours on a 1.261 GHz Intel Pentium III compute node running Red Hat Enterprise Linux 4. In comparison, generating 5000 independent samples from the posterior distribution of **β** using the IMR prior required less than an hour.

Overall sensitivity and specificity of methods using three types of priors,under the full model, and BMA using the g-prior (gBMA), by ten-fold cross validation, on set (a): p = 340, n = 1260 and set (b): p = 1364, n = 1260. %pN F R: percentage of predicted **...**

Out of 340 covariates using the IMR prior, 93 and 91 coefficients had approximate 95% HPD intervals above and below zero. Among the significant dinucleotides, “aa”, “at”, “tg”, and “tt” had a positive effect on the possibility of being an NFR, while “ac”, “ag”, “cc”, “ct”, “gc”, and “gg” had the opposite effect. It was previously found that “aa” or “tt” repeats have an effect of making DNA rigid, and thus difficult to form nucleosomes, while “gg” and “cc” lead to less rigid DNA for which it is easier to form nucleosomes (Thastrom *et al.* (2004)). Thus these results seem reasonable compared to biological knowledge. On the other hand, we see that dinucleotides alone do not seem to have the strongest power to distinguish NFRs from nucleosomal regions, suggesting that the relationship between sequence factors and nucleosome positioning could be more complex than linear.

Next, we increased the number of predictors to test how far the improved model fit, by including the 5-tuple counts, would be offset by the increased covariate dimensionality. We repeated the same process for creating the 10 training-test data set pairs as in the earlier case, except that we included *k*-tuples for *k* = 1, …, 5, leading to *p* = 1364, with the training set size as 1260. We next carried out the ten-fold cross-validation procedure over each training-test set pair. As seen in Table 1, the IMR prior in this case is the only method that could predict even a proportion of the NFRs correctly. However, the overall predictive power using 5-mers decreased, due to a combination of overfitting with sparse data as well as induced bias due to the massive increase in dimensionality.

The above empirical studies are mainly to illustrate that the use of the IMR prior is beneficial in situations where the number of observations does not significantly exceed, or is even less than the number of observed covariates, and the lack of prior knowledge of the situation prevents selection of a fewer number of covariates in advance of the analysis. We observed some dissimilarities between the sets of covariates found significant between the two situations; however, we suspect the main reason for this is that the covariates are highly collinear (the average correlation between them ranges from −0.7 to 0.9) and the values are highly sparse, especially the counts for k-mers with a large *k*. The results, however, indicate that the positioning of nucleosomes may indeed depend on a number of underlying sequence factors, and not just a few dinucleotides, as was previously thought. The logistic regression model may be a simplification of the actual relationship between the covariates and response, but is a first step towards modeling a more complex, possibly non-linear relationship with the covariates. Using the logistic model to connect sequence features with nucleosomal state, rather than modeling the probe-level data as an intermediate step, is a direct attempt to determine how sequence features influence positioning. For instance, a model with high predictive power of correct nucleosomal state can provide useful surrogate information for other applications when experimental data, which are expensive to generate, are not available.

The proposed IM and IMR priors can be viewed as a broad generalization of the “g-prior” (Zellner (1986)) for Gaussian linear models, reducing to Jeffreys’ prior as a limiting case. Although the g-prior was originally conceived (and is still most frequently used) in the context of model selection, the proposed priors provide a desirable alternative to Gaussian or improper priors with high-dimensional data in generalized linear models. The IM and IMR priors appear to produce results similar to a diffuse Gaussian prior, but are computationally more stable with collinear variables. They provide a desirable alternative to Jeffreys’ prior, being proper for most GLMs, but giving more flexibility than Jeffreys’ prior in the choice of tuning parameters, and being less subjective than the choice of an arbitrary Gaussian prior. Theoretical and computational properties of the IM and IMR priors were investigated, demonstrating their effectiveness in a variety of situations. The IM and IMR priors for many GLMs are proper and their moment generating functions (MGFs) exist.

Numerical studies demonstrated that the IMR prior, even with the full model, compared favorably with a more complex Bayesian model averaging procedure with a *g*-prior that involves dimension reduction. With extremely high dimensional data, the BMA procedure becomes computationally infeasible in our examples. The BMA procedure could also be used with an IMR prior– it would be interesting to explore the possibility of improving variable selection methods in GLMs, as in Hans, Dobra, and West (2007) and Liang *et al.* (2008), through the use of an IMR prior instead of conventional priors. This would require the ability to compute accurate approximations for the marginal likelihoods, which is a complex problem outside the Gaussian family of priors. Future work is also needed in developing alternative methods for eliciting λ, such as choosing the λ that maximizes the marginal likelihood. Although our current focus is on GLMs, the IMR prior framework can be easily extended to a variety of other models, for instance, to survival and longitudinal models, and to others used in a multitude of scientific applications.

We would like to thank two anonymous referees whose valuable insights and suggestions led to major improvements in the overall clarity and presentation of this paper, and to thank Jason Lieb and Greg Hogan for making available the yeast nucleosomal array data. This research was supported in part by the NIH grants GM070335 and HG004946.

The appendices are available in the online version of the article at http://www.stat.sinica.tw/statistica.

- Bedrick EJ, Christensen R, Johnson W. A new perspective on priors for generalized linear models. J. Am. Stat. Assoc. 1996;91:1450–1460.
- Berger J, Pericchi L. The intrinsic Bayes factor for model selection and prediction. J. Amer. Stat. Assoc. 1996;91:109–122.
- Bollen K, Ray S, Zavisca J. A scaled unit information prior approximation to the Bayes factor; Technical report, SAMSI LVSSS Transition Workshop: Latent Variable Models in the Social Sciences; 2005.
- Chen MH, Ibrahim JG, Yiannoutsos C. Prior elicitation, variable selection and Bayesian computation for logistic regression models. J. Roy. Stat. Soc. B. 1999;61:223–242.
- Gilks WR, Thomas A, Spiegelhalter DJ. A language and program for complex Bayesian modelling. The Statistician. 1994;43:169–178.
- Gilks WR, Best NG, Tan KKC. Adaptive rejection Metropolis sampling. Applied Statistics. 1995;44:455–472.
- Good IJ. Rational decisions. J. Roy. Statist. Soc. B. 1952;14:107–114.
- Hans C, Dobra A, West M. Shotgun stochastic search for “large p” regression. J. Amer. Statist. Assoc. 2007;102:507–516.
- Hoerl AE, Kennard RW. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics. 1970;12:55–67.
- Hoeting JA, Madigan D, Raftery AE, Volinsky CT. Bayesian model averaging: a tutorial. Statist. Sci. 1999;14:382–417.
- Hogan GJ, Lee C-K, Lieb JD. Cell cycle-specified fluctuation of nucleosome occupancy at gene promoters. PLoS Genet. 2006;2:e158. [PubMed]
- Ibrahim J, Chen M-H. Conjugate priors for generalized linear models. Statistica Sinica. 2003;13:461–476.
- Ibrahim JG, Laud PW. On Bayesian analysis of generalized linear models using Jeffreys’ prior. J. Amer. Statist. Assoc. 1991;86:981–986.
- Johnson CA. Chromatin modification and disease. J Med Genet. 2000;37:905–915. [PMC free article] [PubMed]
- Juang B-H, Rabiner LR. Hidden Markov models for speech recognition. Technometrics. 1991;33:251–272.
- Kass RE. The geometry of asymptotic inference. Statistical Science. 1989;4:188–234.
- Kass RE. Data-translated likelihood and Jeffreys’ rules. Biometrika. 1990;77:107–114.
- Kornberg RD, Lorch Y. Twenty-five years of the nucleosome, fundamental particle of the eukaryote chromosome. Cell. 1999;98:285–294. [PubMed]
- Liang F, Wong WH. Evolutionary Monte Carlo: applications to
*c*model sampling and change point problem. Statistica Sinica. 2000;10:317–342._{p} - Liang F, Paulo R, Molina G, Clyde MA, Berger JO. Mixtures of g-priors for Bayesian variable selection. J. Am. Stat. Assoc. 2008;103:410–423.
- Mitchell TJ, Beauchamp JJ. Bayesian variable selection in linear regression. J. Am. Stat. Assoc. 1988;83:1023–1032.
- R Development Core Team. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2004. ISBN 3-900051-07-0.
- Schafer JL. Analysis of Incomplete Multivariate Data. London: Chapman and Hall; 1997.
- Spiegelhalter DJ, Smith AFM. Bayes factors for linear and log-linear models with vague prior information. J. Roy. Stat. Soc. B. 1982;44:377–387.
- Thastrom A, Bingham LM, Widom J. Nucleosomal locations of dominant DNA sequence motifs for histone-DNA interactions and nucleosome positioning. J Mol Biol. 2004;338:695–709. [PubMed]
- Wang X. Ph.D. thesis. Austin: The University of Texas; 2002. Bayesian Variable Selection for GLM.
- Wang X, George EI. Adaptive Bayesian criteria in variable selection for generalized linear models. Statistica Sinica. 2007;17:667–690.
- West M. Bayesian factor regression models in the “large p, small n” paradigm. Bayesian Statistics. 2003;7:723–732.
- Yeung K, Bumgarner R, Raftery A. Bayesian model averaging: Development of an improved multi-class, gene selection and classification tool for microarray data. Bioinformatics. 2005;21:2394–2402. [PubMed]
- Zellner A. In: On assessing prior distributions and Bayesian regression analysis with g-prior distributions, volume Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti. Goel PK, Zellner A, editors. North-Holland, Amsterdam: 1986. p. 233.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |