Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2990974

Formats

Article sections

- Abstract
- 1. Growing Gaussian location model
- 2. Global convergence of posterior
- 3. Behavior of the squared loss
- 4. Linear Functionals
- 5. Related work
- References

Authors

Related links

Inst Math Stat Collect. Author manuscript; available in PMC 2011 January 1.

Published in final edited form as:

Inst Math Stat Collect. 2010; 6: 87–98.

doi: 10.1214/10-IMSCOLL607PMCID: PMC2990974

NIHMSID: NIHMS199314

In Gaussian sequence models with Gaussian priors, we develop some simple examples to illustrate three perspectives on matching of posterior and frequentist probabilities when the dimension *p* increases with sample size *n*: (i) convergence of joint posterior distributions, (ii) behavior of a non-linear functional: squared error loss, and (iii) estimation of linear functionals. The three settings are progressively less demanding in terms of conditions needed for validity of the Bernstein-von Mises theorem.

The Bernstein-von Mises theorem is a formalization of conditions under which Bayesian posterior credible intervals agree approximately with frequentist confidence intervals constructed from likelihood theory. It is traditionally formulated in situations in which the number of parameters *p* is fixed and the sample size *n* → ∞. The situation is very different in high dimensional settings in which *p* is allowed to grow with *n*. In this primarily expository paper, we use simple Gaussian sequence models to draw some conclusions about when a version of Bernstein-von Mises can hold.

We begin with a somewhat informal statement of the classical theorem. Suppose that *Y*_{1}, …, *Y _{n}* are i.i.d. observations from a distribution

$${\ell}_{\theta}=\mathrm{log}{p}_{\theta}\left(y\right),$$

and, as usual, the score function vector and Fisher information matrix are given by

$${\stackrel{.}{\ell}}_{\theta}=(\partial \u2215\partial \theta )\mathrm{log}{p}_{\theta}\left(y\right)\phantom{\rule{1em}{0ex}}{I}_{\theta}={E}_{\theta}{\stackrel{.}{\ell}}_{\theta}{\stackrel{.}{\ell}}_{\theta}^{T}.$$

Writing *Y ^{n}* = (

$${L}_{n}\left(\theta \right)=\sum _{k=1}^{n}{\ell}_{\theta}\left({Y}_{k}\right),$$

and we write ${\widehat{\theta}}_{\mathit{MLE}}$ for a maximizer of *L _{n}*(θ). Classical likelihood theory says that any (nice) estimator satisfies the information bound

$$Va{r}_{\theta}\widehat{\theta}\ge {n}^{-1}{I}_{\theta}^{-1},$$

in the usual ordering of nonnegative definite matrices, and that the bound is asymptotically attained by the MLE, which is also asymptotically Gaussian:

$${\widehat{\theta}}_{MLE}\mid \theta ~{N}_{p}(\theta ,{n}^{-1}{I}_{\theta}^{-1}).$$

Now suppose that π(θ) is the density of a prior distribution with respect to Legesgue measure. Then the posterior distribution of θ given *Y ^{n}* is given by Bayes' rule; we denote it simply by

The Bernstein-von Mises theorem says, informally, that this posterior distribution is, in large samples, approximately normal with mean approximately the MLE, ${\widehat{\theta}}_{\mathit{MLE}}$ and variance matrix approximately ${n}^{-1}{I}_{{\theta}_{0}}^{-1}$ (here θ_{0} is the `true' value of θ generating the observations *Y*_{1}, …, *Y _{n}*). Using the scalar case for simplicity, and writing ${\sigma}_{n}^{2}={n}^{-1}{I}_{{\theta}_{0}}^{-1}$ and ${z}_{\alpha}={\stackrel{~}{\Phi}}^{-1}\left(\alpha \right)$, we have that an approximate 100(1 – α)% credible interval for θ would be given by ${\widehat{\theta}}_{\mathit{MLE}}\pm {z}_{\alpha \u22152}{\widehat{\sigma}}_{n}$. This is exactly the same as the frequentist confidence interval based on asymptotic normality of the MLE. Thus in large samples the effect of the prior density π disappears: “the data overwhelms the prior”.

A somewhat more formal statement uses the notion of variation distance between probability measures *P* and *Q*, and an equivalent expression in terms of the densities *p* = *dP/d*μ and *q* = *dQ/d*μ relative to a dominating measure μ:

$$\parallel P-Q\parallel =\underset{A}{\mathrm{max}}\mid P\left(A\right)-Q\left(A\right)\mid =\frac{1}{2}\int \mid p-q\mid d\mu .$$

Suppose that π(θ) is continuous and positive at the `true' value θ_{0}, and that θ → *P*_{θ} is differentiable in quadratic mean and satisfies a further mild separation condition, then

$$\parallel {P}_{\theta \mid {Y}^{n}}-N({\widehat{\theta}}_{MLE},{n}^{-1}{I}_{{\theta}_{o}}^{-1})\parallel \to 0.$$

(1)

in probability under ${P}_{{\theta}_{0}}^{n}$.

In other words, the variation distance between posterior and the approximating Gaussian distribution is a random variable depending on *Y ^{n}*, and which converges to zero in probability under repeated draws from

A development of the Bernstein-von Mises theorem as formulated above may be found in van der Vaart (1998, §10.2). A proof due to Bickel is given in Lehmann and Casella (1998, §6.8). Extension from independent to dependent sampling settings are possible, see e.g. Borwanker et al. (1971); Heyde and Johnstone (1979). For further references and methods of proof of the classical results, see Ghosh and Ramamoorthi (2003, §1.4 and §1.5).

In nonparametric and semiparametric settings the situation is very different. Even frequentist consistency of nonparametric Bayesian methods is a difficult issue with a large literature of both positive and negative results (e.g. Ghosh and Ramamoorthi (2003); Ghosal and van der Vaart, (2010)). One cannot therefore expect Bernstein-von Mises phenomena in any great generality for the full posterior.

In this largely expository paper, we do some simple calculations in symmetric Gaussian sequence models. The Gaussian sequence structure makes possible an elementary set of examples that avoid the technical challenges posed by, and sophistication needed for, posterior Gaussian approximation in high dimensional settings (see references in Section 5). Nevertheless, the Gaussian examples can conveniently illustrate some of the issues related to validity of the Bernstein-von Mises theorem in high dimensional models. Depending on the frequentist or Bayesian perspective, we assume that *p* = *p*(*n*) grows with *n*, and one, or both, of

$$\begin{array}{cc}\hfill & \left(\mathbf{D}\right)\mathrm{Data}:\stackrel{\u2012}{Y}\mid \theta ~{N}_{p}(\theta ,{\sigma}_{n}^{2}I),\phantom{\rule{1em}{0ex}}\mathrm{and}\hfill \\ \hfill & \left(\mathbf{P}\right)\mathrm{Prior}:\phantom{\rule{1em}{0ex}}\theta ~{N}_{p}(0,{\tau}_{n}^{2}I).\hfill \end{array}$$

The notation $\stackrel{\u2012}{Y}$ suggests an average (*Y*_{1}++*Y*_{n})/*n* of observations individually of variance ${\sigma}_{0}^{2}$, so that in this case ${\sigma}_{n}^{2}={\sigma}_{0}^{2}\u2215n$. [If *p* were held fixed, not depending on *n*, then ${\sigma}_{n}^{2}$ would match with the definition given in the introductory section.] We also allow the prior variance ${\tau}_{n}^{2}$ to depend on the sample size *n*.

Our goal is to compare the Bayesian posterior distribution $\mathcal{L}(\theta \mid Y)$ with frequentist distributions, in particular those of the MLE $\mathcal{L}({\widehat{\theta}}_{\mathit{MLE}}\mid \theta )$ and of the posterior mean Bayes estimator $\mathcal{L}({\widehat{\theta}}_{B}\mid \theta )$. A key simplification is that since both prior and likelihood are Gaussian, so also is the posterior distribution, and hence all the behavior will be determined by centering and scaling. Thus from standard results, the posterior is given by

$$\begin{array}{cc}\hfill \theta \mid \stackrel{\u2012}{Y}& =\stackrel{\u2012}{y}~{N}_{p}({w}_{n}y,{w}_{n}{\sigma}_{n}^{2}I),\hfill \\ \hfill & {w}_{n}={\tau}_{n}^{2}\u2215({\sigma}_{n}^{2}+{\tau}_{n}^{2}).\hfill \end{array}$$

(2)

*Remarks:* 1. The reference to Gaussian sequence models becomes clearer if, as will be helpful later, we write out assumptions (D) and (P) in co-ordinates:

$$\begin{array}{cc}\hfill & \left({\mathbf{D}}_{\mathrm{seq}}\right)\phantom{\rule{1em}{0ex}}\mathrm{Data}:\phantom{\rule{1em}{0ex}}{\stackrel{\u2012}{y}}_{k}={\theta}_{k}+{\sigma}_{n}{\u220a}_{k},\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}and\hfill \\ \hfill & \left({\mathbf{P}}_{\mathrm{seq}}\right)\phantom{\rule{1em}{0ex}}\mathrm{Prior}:\phantom{\rule{1em}{0ex}}{\theta}_{k}={\tau}_{n}{\zeta}_{k},\hfill \end{array}$$

with ε_{k} and ζ_{k} all i.i.d standard Gaussian, for *k* = 1, …, *p*(*n*).

Strictly speaking, the indexing by *n* of parameters σ_{n}, τ_{n} and *p*(*n*) creates a sequence of sequence models. However, one can, as needed for almost sure results, think of the infinite sequences {$({\u220a}_{k},{\zeta}_{k}),k\in \mathbb{N}$} as being drawn from a single common probability space.

2. We also consider the infinite sequence Gaussian white noise model

$${Y}_{t}={\int}_{0}^{t}f\left(s\right)ds+{\sigma}_{n}{W}_{t}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}0\le t\le 1$$

(3)

or equivalently, when expressed in any orthonormal basis {_{k}(*t*)} for *L*_{2}[0, 1],

$${y}_{k}={\theta}_{k}+{\sigma}_{n}{\u220a}_{k}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}{\u220a}_{k}\stackrel{\mathrm{ind}}{~}N(0,1),$$

(4)

where it is assumed that ${\sigma}_{n}={\sigma}_{0}\u2215\sqrt{n}$. For some examples, it is helpful to use doubly indexed orthonormal bases {${\phi}_{jk}\left(t\right),k=1,\dots ,{2}^{j},j\in \mathbb{N}$} such as arise with systems of orthonormal wavelets.

The forthcoming book Johnstone (2010) will have more on estimation in such Gaussian sequence models.

We develop three perspectives on the Bernstein-von Mises phenomenon:

- global convergence of the posterior,
- behavior of a non-linear functional ${\Vert \theta -\widehat{\theta}\Vert}^{2}$, and of

We shall see that these situations are progressively “less demanding” in terms of validity of the Bernstein-von Mises phenomenon. Indeed, case (1) requires that *w*_{n} → 1 at a sufficiently fast rate, while setting (2) needs only *w*_{n} → 1. In case (3), the formulation itself delivers *w*_{n} → 1, and covers at least all bounded linear functionals.

The first calculation considers the *p*—dimensional posterior distribution (2) and shows that the convergence in (1) occurs, even in the best possible case that θ_{0} = 0, only if the shrinkage factor *w*_{n} approaches 1 at a sufficiently fast *rate*.

Let θ_{0} = 0. The variation distance between posterior distribution ${P}_{\theta \mid {Y}^{n}}$ and $N({\widehat{\theta}}_{\mathit{MLE}},{n}^{-1}{I}_{{\theta}_{0}}^{-1})$ converges to zero inP_{θ0} — probability if and only if $\sqrt{p}{\sigma}_{n}^{2}\u2215{\tau}_{n}^{2}\to 0$, or equivalently, if

$${w}_{n}=1-o(1\u2215\sqrt{{p}_{n}}).$$

(5)

PROOF. We introduce notation *P*_{y,n}(*d*θ) for the posterior distribution of θ|*Y*^{n} = *y* and *Q*_{y,n}(*d*θ) for the distribution centered at ${\widehat{\theta}}_{\mathit{MLE}}=\stackrel{\u2012}{y}$. Thus

$$\begin{array}{cc}\hfill & {P}_{y,n}\left(d\theta \right)\leftrightarrow {N}_{p}({w}_{n}\stackrel{\u2012}{y},{\sigma}_{n}^{2}{w}_{n}I)\hfill \\ \hfill & {Q}_{y,n}\left(d\theta \right)\leftrightarrow {N}_{p}(\phantom{\rule{1em}{0ex}}\stackrel{\u2012}{y},\phantom{\rule{thinmathspace}{0ex}}{\sigma}_{n}^{2}I).\hfill \end{array}$$

(6)

Let $\rho (P,Q)=\int \sqrt{p}\sqrt{q}d\mu $ denote the Hellinger affinity between two probability measures *P,Q* having densities *p, q* with respect to a common dominating measure μ. We recall an elementary bound (van der Vaart, 1998, p. 212) for variation distance in terms of Hellinger distance and hence Hellinger affinity:

$$2[1-\rho (P,Q)]\le \parallel P-Q\parallel \le \sqrt{8}{[1-\rho (P,Q)]}^{1\u22152}$$

(7)

Thus *P*_{y,n} − *Q*_{y,n} → 0 if and only if ρ(*P*_{y,n}, *Q*_{y,n}) → 1. We recall also that affinity commutes with products:

$$\rho (\Pi {P}_{i},\Pi {Q}_{i})=\Pi \rho ({P}_{i},{Q}_{i}).$$

An elementary calculation shows that

$${\rho}^{2}(N({\theta}_{1},{\sigma}_{1}^{2}),N({\theta}_{2},{\sigma}_{2}^{2})=\left(\frac{2{\sigma}_{1}{\sigma}_{2}}{{\sigma}_{1}^{2}+{\sigma}_{2}^{2}}\right)\mathrm{exp}\{-\frac{{({\theta}_{1}-{\theta}_{2})}^{2}}{2({\sigma}_{1}^{2}+{\sigma}_{2}^{2})}\}.$$

(8)

When applied to *P*_{y,n} and *Q*_{y,n}, we set ${\theta}_{{1}_{i}}={w}_{n}{\stackrel{\u2012}{y}}_{i}$, ${\theta}_{{2}_{i}}={\stackrel{\u2012}{y}}_{i}$ and ${\sigma}_{1}^{2}={\sigma}_{n}^{2}{w}_{n}$, ${\sigma}_{2}^{2}={\sigma}_{n}^{2}$ to obtain

$$\rho ({P}_{y,n},{Q}_{y,n})=\mathrm{exp}\{-\frac{p}{2}\mathrm{log}\frac{1}{2}({w}_{n}^{1\u22152}+{w}_{n}^{-1\u22152})-\frac{{(1-{w}_{n})}^{2}{\parallel \stackrel{\u2012}{y}\parallel}^{2}}{4(1+{w}_{n}){\sigma}_{n}^{2}}\}.$$

(9)

Introduce ${r}_{n}={\sigma}_{n}^{2}\u2215{\tau}_{n}^{2}={w}_{n}^{-1}-1$. Suppose first that $p{r}_{n}^{2}\to 0$. Since *w*_{n} = (1 + *r*_{n})^{−1}, we have

$$\mathrm{log}\frac{1}{2}({w}_{n}^{1\u22152}+{w}_{n}^{-1\u22152})=-\frac{1}{2}\mathrm{log}(1+{r}_{n})+\mathrm{log}(1+\frac{1}{2}{r}_{n})\le 2{c}_{1}{r}_{n}^{3}$$

for ${r}_{n}\le {\scriptstyle \frac{1}{2}}$, say. When *p* → ∞, we have with probability tending to one that ${\Vert \stackrel{\u2012}{y}\Vert}^{2}\u2215{\sigma}_{n}^{2}<2p$, and so for ${r}_{n}\le {\scriptstyle \frac{1}{2}}$,

$$\frac{{(1-{w}_{n})}^{2}}{4(1+{w}_{n})}\frac{{\parallel \stackrel{\u2012}{y}\parallel}^{2}}{{\sigma}_{n}^{2}}\le \frac{2p{r}_{n}^{2}}{(1+{r}_{n})(1+{r}_{n}\u22152)}\le {c}_{2}p{r}_{n}^{2}.$$

Consequently, when $p{r}_{n}^{2}\to 0$,

$$\rho ({P}_{y,n},{Q}_{y,n})\ge \mathrm{exp}\{-{c}_{1}p{r}_{n}^{3}-{c}_{2}p{r}_{n}^{2}\}\to 1.$$

Suppose now that $p{r}_{n}^{2}$ does not approach 0. Again with probability tending to one, ${\Vert \stackrel{\u2012}{y}\Vert}^{2}\u2215{\sigma}_{n}^{2}>p\u22152$, and since ${\scriptstyle \frac{1}{2}}({w}_{n}^{1\u22152}+{w}_{n}^{-1\u22152})>1$, we have from (9) that

$$-\mathrm{log}\rho ({P}_{y,n},{Q}_{y,n})>\frac{p{(1-{w}_{n})}^{2}}{8(1+{w}_{n})}>{c}_{3}\mathrm{min}\{p{r}_{n}^{2},p\}$$

which cannot converge to zero if $p{r}_{n}^{2}$ does not.

*Remark.* If θ_{0} = θ_{0n} ≠ 0, so that the data mean differs from the prior mean, then the rate condition is replaced by

$${w}_{n}-1=o(1\u2215{q}_{n}\left({\theta}_{0n}\right)),\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}{q}_{n}\left({\theta}_{0n}\right)=\sqrt{{p}_{n}}+\Vert {\theta}_{0n}\Vert \u2215{\sigma}_{n}.$$

*Example.* We illustrate the result by considering estimation in the Gaussian white noise model (3). When expressed in a suitable orthonormal basis of wavelets, we obtain ${y}_{jk}\stackrel{\underset{~}{ind}}{}N({\theta}_{j,k},{\sigma}_{n}^{2})$, for *k* = 1, …, 2^{j}, and $j\in \mathbb{N}$. Pinsker's theorem (Pinsker, 1980) describes the minimax linear estimator of *f*, or equivalently of (θ_{jk}), under squared error loss when it is assumed that *f* has α mean square derivatives and shows that such minimax linear estimators are asymptotically minimax among *all* estimators as σ_{n} → 0.

Pinsker's estimator is necessarily posterior mean Bayes for a corresponding Gaussian prior. The mean square differentiability condition can be equivalently expressed in terms of the coefficients as

$$\sum _{j,k}{2}^{2j\alpha}{\theta}_{j,k}^{2}\le {C}^{2},$$

and the corresponding least favorable Gaussian prior puts

$${\theta}_{j,k}\stackrel{\mathrm{ind}}{~}N(0,{\tau}_{j}^{2})\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}{\tau}_{j}^{2}={\sigma}_{n}^{2}{({\mu}_{n}{2}^{-j\alpha}-1)}_{+},$$

(10)

where μ_{n} = *c*_{αn}(*C*/σ_{n})^{2α/(2α+1)}. The constant *c*_{αn} satisfies bounds independent of *n*, *c*_{1α} ≤ *c*_{αn} ≤ *c*_{2α}, whose precise values are unimportant here–for further details see Johnstone (2010).

We consider the validity of the Bernstein-von Mises phenomenon for the collection of coefficients {θ_{jk}, *k* = 1, …, 2^{j}} at a given level *j* = *j*(*n*)–possibly fixed, or possibly varying with *n*.

The prior variances ${\tau}_{j}^{2}$ decrease with *j*, and vanish above a “critical level” *j*_{*} = *j*_{*}(α, *C*;*n*). Since *j*_{*} ~ (2/(2α+1)) log(*C*/σ_{n}) grows with *n*, so does the number of parameters θ_{j*,k} at the critical level. From (10), we conclude that

$${\tau}_{j\ast}^{2}\u2215{\sigma}_{n}^{2}\le {2}^{\alpha}-1,$$

and hence that *w*_{n} ≤ 1 − 2^{−α} does not approach 1, so that the condition of Proposition 1 fails.

On the other hand, at a *fixed* level *j*_{0}, we have *p* = 2^{jo} fixed and ${\tau}_{{j}_{0}}^{2}\u2215{\sigma}_{n}^{2}={\mu}_{n}{2}^{-j\ast \alpha}-1\to \infty $, so that $\sqrt{p}{\sigma}_{n}^{2}\u2215{\tau}_{{j}_{0}}^{2}\to 0$ and so Proposition 1 applies. Thus we may say informally that the Bernstein-von Mises phenomenon holds at a fixed level but fails at the critical level.

In this section, we pay homage to a remarkable paper by Freedman (1999), itself stimulated by Cox (1993), which sets out the failure of the Bernstein-von Mises theorem in a simple sequence model of function estimation in Gaussian white noise. To further simplify the calculations, we use the growing Gaussian location model (D), (P), yielding results parallel to, but not identical with, Freedman's. Hence, define

$${T}_{n}(\theta ,Y)={\parallel \theta -{\widehat{\theta}}_{B}\parallel}^{2}=\sum _{k=1}^{p\left(n\right)}{({\theta}_{k}-{\widehat{\theta}}_{k})}^{2}.$$

The posterior distribution of θ|*Y* is described by (2); in particular the shrinkage factor ${w}_{n}={\tau}_{n}^{2}\u2215({\sigma}_{n}^{2}+{\tau}_{n}^{2})$ again plays a critical role.

**Theorem 2** (Bayesian) The posterior distribution $\mathcal{L}({T}_{n}\mid Y)$ is given by

$${T}_{n}={C}_{n}+\sqrt{{D}_{n}{Z}_{1n}},$$

where

$${C}_{n}=p{\sigma}_{n}^{2}{w}_{n}$$

(11)

$$\sqrt{{D}_{n}}=\sqrt{2p{\sigma}_{n}^{2}{w}_{n}}$$

(12)

and the random variable Z_{1n} has mean 0, variance 1 and converges in distribution to N(0, 1) as n → ∞.

*Proof*. From (2), the posterior distribution of *T _{n}* given

$${T}_{n}=p{\sigma}_{n}^{2}{w}_{n}+\sqrt{2p{\sigma}_{n}^{2}}{w}_{n}{Z}_{1}n,$$

and the theorem follows because $({\chi}_{p}^{2}-p)\u2215\sqrt{2p}\Rightarrow N(0,1)$ as *p* → ∞.

Turn now to the frequentist perspective, in which θ is a fixed and unknown (sequence of) parameters. We will therefore use the decomposition *y _{k}* = θ

$${\theta}_{k}-{\widehat{\theta}}_{B,k}=(1-{w}_{n}){\theta}_{k}-{w}_{n}{\sigma}_{n}{\u220a}_{k}.$$

(13)

Some of the conclusions will be valid only for “most” θ: to formulate this it is useful to give θ a distribution. The natural one to use is (**P**), despite the possible confusion arising because, for the frequentist, this is *not* an a priori law!

**Theorem 3** (Frequentist) The conditional distribution $\mathcal{L}({T}_{n}\mid \theta )$ is given by

$${T}_{n}={C}_{n}+\sqrt{{F}_{n}}{Z}_{2n}\left(\theta \right)+\sqrt{{G}_{n}\left(\theta \right)}{Z}_{3n}(\theta ,\u220a),$$

(14)

where C_{n} is as in Theorem 2, while Z_{3n} (θ, ε) has mean 0 and variance 1.

If θ is distributed according to (**P**), then Z_{2n}(θ) has mean 0, variance 1 and converges in distribution to N(0, 1) as n → ∞. In addition, if w_{n} ∞ w = 1 − cosω,

$$\begin{array}{c}\hfill \sqrt{{F}_{n}}~\sqrt{{D}_{n}}\mathrm{cos}\omega \hfill \\ \hfill \sqrt{{G}_{n}\left(\theta \right)}~\sqrt{{D}_{n}}\mathrm{sin}\omega ,\hfill \end{array}$$

(15)

and

$${Z}_{3n}(\theta ,\cdot )\Rightarrow N(0,1)$$

(16)

Formulas (15) and (16) hold as n → ∞, for almost all θ's generated from (**P**).

*Proof*. Using (13), and ${(1-{w}_{n})}^{2}{\tau}_{n}^{2}+{w}_{n}^{2}{\sigma}_{n}^{2}={\sigma}_{n}^{2}{w}_{n}$, we may write

$$\begin{array}{cc}\hfill {T}_{n}& =\sum _{k}{[(1-{w}_{n}){\theta}_{k}-{w}_{n}{\sigma}_{n}{\u220a}_{k}]}^{2}\hfill \\ \hfill & =p{\sigma}_{n}^{2}{w}_{n}+\sqrt{2p{\tau}_{n}^{2}}{(1-{w}_{n})}^{2}\cdot \frac{\Sigma {\theta}_{k}^{2}-p{\tau}_{n}^{2}}{\sqrt{2p{\tau}_{n}^{2}}}+{R}_{n}(\theta ,\u220a),\hfill \end{array}$$

(17)

with

$${R}_{n}(\theta ,\u220a)=-2{w}_{n}(1-{w}_{n}){\sigma}_{n}\Sigma {\theta}_{k}{\u220a}_{k}+{w}_{n}^{2}{\sigma}_{n}^{2}\Sigma ({\u220a}_{n}^{2}-1)$$

This leads immediately to the representation (14) after observing that ${\tau}_{n}^{2}(1-{w}_{n})={\sigma}_{n}^{2}{w}_{n}$ and setting

$$\begin{array}{cc}\hfill \sqrt{{F}_{n}}& =\sqrt{2p}{\sigma}_{n}^{2}{w}_{n}(1-{w}_{n}),\hfill \\ \hfill {Z}_{2n}\left(\theta \right)& =\Sigma ({\theta}_{k}^{2}-{\tau}_{n}^{2})\u2215\sqrt{2p}{\tau}_{n}^{2},\hfill \\ \hfill {G}_{n}\left(\theta \right)& =\mathrm{Var}{R}_{n}(\theta ,\u220a)=4{w}_{n}^{2}{(1-{w}_{n})}^{2}{\sigma}_{n}^{2}\Sigma {\theta}_{k}^{2}+{w}_{n}^{4}\cdot 2p{\sigma}_{n}^{4}\hfill \\ \hfill & ={G}_{1n}\left(\theta \right)+{G}_{2n}\hfill \end{array}$$

Turning to the final assertions, we may rewrite

$${R}_{n}(\theta ,\u220a)=\sqrt{{G}_{1n}\left(\theta \right)}{Z}_{4n}\left(\theta \right)+\sqrt{{G}_{2n}}{Z}_{5n},$$

where

$${Z}_{4n}\left(\theta \right)=\Sigma {\theta}_{k}{\u220a}_{k}\u2215{\left(\Sigma {\theta}_{k}^{2}\right)}^{1\u22152},\phantom{\rule{1em}{0ex}}{Z}_{5n}=(\Sigma {\u220a}_{k}^{2}-p)\u2215\sqrt{2p}.$$

Using again ${\sigma}_{n}^{2}{w}_{n}={\tau}_{n}^{2}(1-{w}_{n})$, we have

$${G}_{1n}\left(\theta \right)=2p{\sigma}_{n}^{4}\cdot {w}_{n}^{2}(2{w}_{n}-2{w}_{n}^{2})\cdot {p}^{-1}\sum _{1}^{p}{({\theta}_{k}\u2215{\tau}_{n})}^{2}.$$

For almost all θ's generated from (**P**), ${p}^{-1}{\Sigma}_{1}^{p}{({\theta}_{k}\u2215{\tau}_{n})}^{2}\to 1$, and since *G _{n}* (θ) =

Clearly *Z*_{4n}(θ) ~ *N*(0, 1), free of θ, while *Z*_{5n} *N* (0, 1) and so (16) follows.

*Remark*. The doctrinaire frequentist would not contemplate the joint distribution of (θ, *Y* ) in (**D**, **P**); but anyone else would observe that in that joint distribution, ${T}_{n}~{\sigma}_{n}^{2}{w}_{n}{\chi}_{\left(p\right)}^{2}$, as follows easily in two ways, either from the proof of Theorem 2, or from (17).

The Bernstein-von Mises theorem fails if lim *w _{n}* =

The top panel, for the Bayesian, has $\theta -{\widehat{\theta}}_{B}$ as a noise vector, and the posterior distribution $\mathcal{L}(T\mid Y)$ approximately N(C_{n}, D_{n}). The bottom panel, for the frequentist, shows the effect of the bias of $E[{\widehat{\theta}}_{B}\mid $ **...**

Under the assumption (**P**), ${\theta}_{i}\stackrel{\underset{~}{iid}}{}N(0,{\tau}_{n}^{2})$, the `wobble' in the frequentist mean be arbitrarily large relative to $\sqrt{{D}_{n}}$: from the law of the iterated logarithm, with probability one

$$\mathrm{lim}\phantom{\rule{thinmathspace}{0ex}}\mathrm{inf}{Z}_{2n}\left(\theta \right)\u2215\sqrt{2\mathrm{log}\mathrm{log}p}=1$$

.

By contrast, if lim *w _{n}* = 1, then the wobble disappears: $\sqrt{{F}_{n}}=o\left(\sqrt{{D}_{n}}\right)$ and the Bayesian SD equals the frequentist SD asymptotically: $\sqrt{{G}_{n}\left(\theta \right)}~\sqrt{{D}_{n}}$.

We turn now to the least demanding of our three scenarios for the Bernstein-von Mises theorem: the behavior of linear functionals. We change the setting slightly to the infinite sequence Gaussian white noise model (3). We consider linear functionals *Lf* such as integrals ${\int}_{B}f$ or derivatives *f*^{(r)} (*t*_{0}): if *f* has expansion $f\left(t\right)=\Sigma {\theta}_{k}{\phi}_{k}\left(t\right)$, then on setting ${a}_{k}=L{\phi}_{k}$, we have

$$Lf=\Sigma {\theta}_{k}L{\phi}_{k}=\Sigma {\theta}_{k}{a}_{k}.$$

Again, for maximum simplicity, we consider Gaussian priors on the coefficients:

$${\theta}_{k}\stackrel{\mathrm{ind}}{~}N(0,{\tau}_{k}^{2}).$$

(18)

In order that $\Sigma {\theta}_{k}^{2}<\infty $ with probability 1, it is necessary and sufficient that $\Sigma {\tau}_{n}^{2}<\infty $.

Consequently, the posterior laws are Gaussian

$${\theta}_{k}\mid {y}_{k}\stackrel{\mathrm{ind}}{~}N({w}_{kn}{y}_{k},{w}_{kn}{\sigma}_{n}^{2}),$$

again with

$${w}_{kn}={\tau}_{k}^{2}\u2215({\sigma}_{n}^{2}+{\tau}_{k}^{2})$$

(19)

so that the posterior mean estimate

$${\widehat{Lf}}_{n}=\sum _{n}{a}_{k}{w}_{kn}{y}_{k}.$$

*Centering at posterior mean*. For the Bayesian, the posterior distribution

$$Lf\mid y~N({\widehat{Lf}}_{n},{V}_{yn}),\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}{V}_{yn}={\sigma}_{n}^{2}\sum _{k}{a}_{k}^{2}{w}_{kn}$$

while for the frequentist, the conditional distribution

$${\widehat{Lf}}_{n}\mid f~N({E}_{f}{\widehat{Lf}}_{n},{V}_{Fn}),\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}{V}_{Fn}={\sigma}_{n}^{2}\sum _{k}{a}_{k}^{2}{w}_{kn}^{2}.$$

The Bayesian might use 100(1 − α)% posterior credible intervals of the form ${\widehat{Lf}}_{n}\pm {z}_{\alpha \u22152}\sqrt{{V}_{yn}}$, while the frequentist might employ 100(1 − α)% confidence intervals ${\widehat{Lf}}_{n}\pm {z}_{\alpha \u22152}\sqrt{{V}_{Fn}}$. This leads us to conider the variance ratio

$$\frac{{V}_{Fn}}{{V}_{yn}}=\frac{\Sigma {a}_{k}^{2}{w}_{kn}^{2}}{\Sigma {a}_{k}^{2}{w}_{kn}}<1,$$

(20)

from which we see that the frequentist intervals are narrower–because the frequentist bias ${E}_{f}L\widehat{f}-Lf$ is being ignored for now, along with the attendant implications for coverage (but see below).

As sample size *n* → ∞, the noise variance ${\sigma}_{n}^{2}\to 0$ and so for a given Gaussian prior (18), the weights (19) converge marginally: *w _{kn}* → 1 for each fixed

Let ${\mathbb{P}}_{f}^{n}$ denote the measure corresponding to (3). If Lf = ∫ af is a bounded linear functional, then the variation distance between Bayesian and frequentist distributions converges to zero:

$$\Vert N({\widehat{Lf}}_{n},{V}_{yn})-N({\widehat{Lf}}_{n},{V}_{Fn})\Vert \stackrel{{\mathbb{P}}_{f}^{n}}{\to}0.$$

(21)

*Proof.* We again use the Hellinger affinity (7) and apply (8) to the laws $P=N({\widehat{Lf}}_{n},{V}_{yn})$ and $Q=N({\widehat{Lf}}_{n},{V}_{Fn})$ to obtain

$${\rho}^{2}(P,Q)=\frac{2\sqrt{{V}_{Fn}\u2215{V}_{yn}}}{1+{V}_{Fn}\u2215{V}_{yn}}.$$

When $\Sigma {a}_{k}^{2}<\infty $, this convergence follows from (20) and the dominated convergence theorem.

*Remarks.* 1. Examples of bounded functionals include polynomials $a\left(t\right)={\Sigma}_{k=0}^{K}{c}_{k}{t}^{k}$ and “regions of interest” *a*(*t*) = *I*{*t* ε *B*}.

2. Examples of unbounded functionals are given by evaluation of a function (or its derivatives) at a point: *Lf* = *f*^{(r)}(*t*_{0}). We shall see that the variance ratio does not converge to 1, and so the Bernstein-von Mises theorem fails. Indeed, in the Fourier basis

$$\phi 0\left(t\right)\equiv 1,\phantom{\rule{1em}{0ex}}\{\begin{array}{cc}{\phi}_{2k-1}\left(t\right)=\sqrt{2}\mathrm{sin}2\pi kt\hfill & \hfill \\ {\phi}_{2k}\left(t\right)=\sqrt{2}\mathrm{sin}2\pi kt,\hfill & \hfill \end{array}\phantom{\}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}k=1,2,\dots $$

we find that, *a _{k}* =

$${V}_{jn}=2{\left(2\pi \right)}^{2r}{\sigma}_{n}^{2}\sum _{k}{k}^{2r}{(1+{\sigma}_{n}^{2}{k}^{2m})}^{-j}.$$

As λ → 0, sums of the form

$$\sum _{k=0}^{\infty}{k}^{p}{(1+\lambda {k}^{q})}^{-r}~k{\lambda}^{-(p+1)\u2215q},$$

with $\kappa =\kappa (p,r;q)={\int}_{0}^{\infty}{v}^{p}{(1+{v}^{q})}^{-r}dv=\Gamma (r-\mu )\Gamma \left(\mu \right)\u2215\left(q\Gamma \left(r\right)\right)$ and μ (*p*+1)/*q*. In the present case, with *p* = 2*r*, *q* = 2*m* and *r* = *j*, we conclude that

$$\frac{{V}_{Fn}}{{V}_{yn}}\to 1-\frac{2r+1}{2m}<1.$$

*Centering* at the *MLE.* For a bounded linear functional, the MLE $L{\widehat{f}}_{M}=\Sigma {a}_{k}{y}_{k}$ is well defined and unbiased, with mean $E\left(L{\widehat{f}}_{M}\right)=Lf$ and frequentist variance ${V}_{Mn}=\mathrm{Var}\left(L{\widehat{f}}_{M}\right)={\sigma}_{n}^{2}{\Sigma}_{k}{a}_{k}^{2}$. A frequentist might prefer to use 100(1 − α)% intervals $L{\widehat{f}}_{M}\pm {z}_{\alpha \u22152}\sqrt{{V}_{Mn}}$ which will have the correct coverage property. However, extra conditions are required for the Bernstein-von Mises result to hold in this case.

Assume that Lf = ∫ af is a bounded linear functional. Suppose also that the coefficients of θ_{k} = f, _{k} of the `true' f, and the variances ${\tau}_{k}^{2}$ of the Gaussian prior together satisfy Σ|a_{k}θ_{k}/τ_{k}|< ∞. Then the distance between Bayesian and frequentist distributions

$$\parallel N({\widehat{Lf}}_{n},{V}_{yn})-N(L{\widehat{f}}_{M},{V}_{Mn})\parallel \stackrel{{\mathbb{P}}_{f}^{n}}{\to}0.$$

(22)

*Proof*. The argument is a slight elaboration of that used in the previous proposition. We use (7) and $P=N({\widehat{Lf}}_{n},{V}_{yn})$ as before, but now $Q=N(L{\widehat{f}}_{M},{V}_{Mn})$ and (8) yields

$${\rho}^{2}(P,Q)=\frac{2\sqrt{{V}_{yn}{V}_{Mn}}}{{V}_{yn}+{V}_{Mn}}\mathrm{exp}\{-\frac{1}{2}\frac{{(L{\widehat{f}}_{M}-{\widehat{Lf}}_{n})}^{2}}{{V}_{yn}+{V}_{Mn}}\}.$$

As before ${V}_{yn}\u2215{V}_{Mn}=\Sigma {a}_{k}^{2}{w}_{kn}^{2}\u2215\Sigma {a}_{k}^{2}\to 1$ by dominated convergence. Using this and the expression ${V}_{Mn}={\sigma}_{n}^{2}\Sigma {a}_{k}^{2}$ and in view of the bounds (7), the conclusion (22) is equivalent to ${\sigma}_{n}^{-1}\mid L{\widehat{f}}_{M}-{\widehat{Lf}}_{n}\mid \stackrel{{\mathbb{P}}_{f}^{n}}{\to}0$. We may write

$${\sigma}_{n}^{-1}(L{\widehat{f}}_{M}-{\widehat{Lf}}_{n})\stackrel{\mathcal{D}}{=}\sum _{k}{a}_{k}{\sigma}_{n}^{-1}(1-{w}_{kn}){\theta}_{k}+\Sigma {a}_{k}(1-{w}_{kn}){\u220a}_{k}.$$

The stochastic term has mean 0 and variance $\Sigma {a}_{k}^{2}{(1-{w}_{kn})}^{2}\to 0$, again by dominated convergence. Thus we may focus on the deterministic term, and note that the merging in (22) occurs if and anly if

$${\sigma}_{n}\sum _{k}\frac{{a}_{k}{\theta}_{k}}{{\sigma}_{n}^{2}+{\tau}_{k}^{2}}\to 0.$$

The bound ${\sigma}_{n}{\tau}_{n}\u2215({\sigma}_{n}^{2}+{\tau}_{k}^{2})\le {\scriptstyle \frac{1}{2}}$ along with the dominated convergence theorem then shows that $\Sigma \mid {a}_{k}{\theta}_{k}{\tau}_{k}^{-1}\mid <\infty $ is a sufficient condition for (22) as claimed.

As remarked earlier, this paper avoids the important Gaussian approximation part of the Bernstein-von Mises phenomenon by focusing on examples with Gaussian likelihoods and priors. A growing literature addresses the approximation challenges; we give a brief listing here, and refer to the books (Ghosh and Ramamoorthi, 2003; Ghosal and van der Vaart, 2010) and the survey discussion in Ghosal (2010, §2.7) for more detailed discussion.

Ghosal (1997, 1999, 2000) developed posterior normality results for the full posterior in cases where the dimension of the parameter space increases sufficiently slowly. In each case, the emphasis is on conditions under which a non-Gaussian likelihood and appropriate prior sequence can yield approximate Guassian posteriors. However Ghosal (2000, Sec. 4) specializes his results to our setting (D) with ${\sigma}_{n}^{2}=1\u2215n$ and notes that one can choose priors-in general not Gaussian-so that the posterior distribution centered by the MLE is approximately Gaussian if *p*^{3}(log *p*)/*n* → 0.

In survival analysis, Bernstein-von Mises theorems for the cumulative hazard function are established by Kim and Lee (2004) and for the cumulative hazard and fixed dimensional covariate regression parameter in a proportional hazards model in Kim (2006).

Boucheron and Gassiat (2009) develop a Bernstein-von Mises theorem for discrete probability distributions of growing dimension, and consider application to functionals such as Shannon and Renyi entropies.

In a semiparametric setting, where a finite dimensional parameter of interest can be separated from an infinite dimensional nuisance parameter, Castillo (2008) obtains conditions leading to a Bernstein-von Mises theorem on the parametric part, clarifying an earlier work of Shen (2002).

Rivoirard and Rousseau (2009) give conditions under which Bernstein-von Mises holds for linear functionals of a nonparametrically specified probability density function.

This work was supported in part by NIH grant RO1 EB 001988. and NSF DMS 0906812.

- Borwanker J, Kallianpur G, Prakasa Rao BLS. The Bernstein-von Mises theorem for Markov processes. Ann. Math. Statist. 1971;42:1241–1253.
- Boucheron S, Gassiat E. A Bernstein-von Mises theorem for discrete probability distributions. Electron. J. Stat. 2009;3:114–148.
- Castillo I. A semiparametric Bernstein-von Mises theorem. 2008. submitted.
- Cox DD. An analysis of Bayesian inference for nonparametric regression. Ann. Statist. 1993;21(2):903–923.
- Freedman D. On the Bernstein-von Mises theorem with infinite dimensional parameters. Annals of Statistics. 1999;27:1119–1140.
- Ghosal S. Normal approximation to the posterior distribution for generalized linear models with many covariates. Math. Methods Statist. 1997;6(3):332–348.
- Ghosal S. Asymptotic normality of posterior distributions in high-dimensional linear models. Bernoulli. 1999;5(2):315–331.
- Ghosal S. Asymptotic normality of posterior distributions for exponential families when the number of parameters tends to infinity. J. Multivariate Anal. 2000;74(1):49–68.
- Ghosal S. The Dirichlet process, related priors and posterior asymptotics. In: Hjort NL, Holmes C, Müller P, Walker SG, editors. Bayesian Nonparametrics. Cambridge University Press; 2010. chapter 2.
- Ghosal S, van der Vaart A. Theory of Nonparametric Bayesian Inference. Cambridge University Press; 2010. in preparation.
- Ghosh JK, Ramamoorthi RV. Bayesian Nonparametrics. Springer-Verlag; New York: 2003. (Springer Series in Statistics).
- Heyde CC, Johnstone IM. On asymptotic posterior normality for stochastic processes. J. Roy. Statist. Soc. Ser. B. 1979;41(2):184–189.
- Johnstone IM. Function estimation and Gaussian sequence models. 2010. Book manuscript at www-stat.stanford.edu.
- Kim Y. The Bernstein-von Mises theorem for the proportional hazard model. Ann. Statist. 2006;34(4):1678–1700.
- Kim Y, Lee J. A Bernstein-von Mises theorem in the nonparametric right-censoring model. Ann. Statist. 2004;32(4):1492–1512.
- Lehmann EL, Casella G. Springer Texts in Statistics. second edn Springer-Verlag; New York: 1998. Theory of Point Estimation.
- Pinsker M. Optimal filtering of square integrable signals in Gaussian white noise. Problems of Information Transmission. 1980;1616:120–133. 52–68. originally in Russian in
*Problemy Peredatsii Informatsii*. - Rivoirard V, Rousseau J. Bernstein von Mises theorem for linear functionals of the density. 2009. submitted.
- Shen X. Asymptotic normality of semiparametric and nonparametric posterior distributions. J. Amer. Statist. Assoc. 2002;97(457):222–235.
- van der Vaart AW. Asymptotic statistics, Vol. 3 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press; Cambridge: 1998.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |