Inst Math Stat Collect. Author manuscript; available in PMC 2011 January 1.
Published in final edited form as:
Inst Math Stat Collect. 2010; 6: 87–98.
PMCID: PMC2990974
NIHMSID: NIHMS199314

# High dimensional Bernstein-von Mises: simple examples

## Abstract

In Gaussian sequence models with Gaussian priors, we develop some simple examples to illustrate three perspectives on matching of posterior and frequentist probabilities when the dimension p increases with sample size n: (i) convergence of joint posterior distributions, (ii) behavior of a non-linear functional: squared error loss, and (iii) estimation of linear functionals. The three settings are progressively less demanding in terms of conditions needed for validity of the Bernstein-von Mises theorem.

Keywords: high dimensional inference, Gaussian sequence, linear functional, squared error loss, posterior distribution, frequentist

The Bernstein-von Mises theorem is a formalization of conditions under which Bayesian posterior credible intervals agree approximately with frequentist confidence intervals constructed from likelihood theory. It is traditionally formulated in situations in which the number of parameters p is fixed and the sample size n → ∞. The situation is very different in high dimensional settings in which p is allowed to grow with n. In this primarily expository paper, we use simple Gaussian sequence models to draw some conclusions about when a version of Bernstein-von Mises can hold.

We begin with a somewhat informal statement of the classical theorem. Suppose that Y1, …, Yn are i.i.d. observations from a distribution Pθ having density pθ(y)dμ(y) where $θ∈Θ⊂Rp$. The log-likelihood for a single observation

$ℓθ=logpθ(y),$

and, as usual, the score function vector and Fisher information matrix are given by

$ℓ.θ=(∂∕∂θ)logpθ(y)Iθ=Eθℓ.θℓ.θT.$

Writing Yn = (Y1, …, Yn) for the full data, the log-likelihood

$Ln(θ)=∑k=1nℓθ(Yk),$

and we write $θ^MLE$ for a maximizer of Ln(θ). Classical likelihood theory says that any (nice) estimator satisfies the information bound

$Varθθ^≥n−1Iθ−1,$

in the usual ordering of nonnegative definite matrices, and that the bound is asymptotically attained by the MLE, which is also asymptotically Gaussian:

$θ^MLE∣θ~Np(θ,n−1Iθ−1).$

Now suppose that π(θ) is the density of a prior distribution with respect to Legesgue measure. Then the posterior distribution of θ given Yn is given by Bayes' rule; we denote it simply by Pθ|Yn.

The Bernstein-von Mises theorem says, informally, that this posterior distribution is, in large samples, approximately normal with mean approximately the MLE, $θ^MLE$ and variance matrix approximately $n−1Iθ0−1$ (here θ0 is the true' value of θ generating the observations Y1, …, Yn). Using the scalar case for simplicity, and writing $σn2=n−1Iθ0−1$ and $zα=Φ~−1(α)$, we have that an approximate 100(1 – α)% credible interval for θ would be given by $θ^MLE±zα∕2σ^n$. This is exactly the same as the frequentist confidence interval based on asymptotic normality of the MLE. Thus in large samples the effect of the prior density π disappears: “the data overwhelms the prior”.

A somewhat more formal statement uses the notion of variation distance between probability measures P and Q, and an equivalent expression in terms of the densities p = dP/dμ and q = dQ/dμ relative to a dominating measure μ:

$∥P−Q∥=maxA∣P(A)−Q(A)∣=12∫∣p−q∣dμ.$

Suppose that π(θ) is continuous and positive at the true' value θ0, and that θ → Pθ is differentiable in quadratic mean and satisfies a further mild separation condition, then

$∥Pθ∣Yn−N(θ^MLE,n−1Iθo−1)∥→0.$
(1)

in probability under $Pθ0n$.

In other words, the variation distance between posterior and the approximating Gaussian distribution is a random variable depending on Yn, and which converges to zero in probability under repeated draws from Pθ0.

A development of the Bernstein-von Mises theorem as formulated above may be found in van der Vaart (1998, §10.2). A proof due to Bickel is given in Lehmann and Casella (1998, §6.8). Extension from independent to dependent sampling settings are possible, see e.g. Borwanker et al. (1971); Heyde and Johnstone (1979). For further references and methods of proof of the classical results, see Ghosh and Ramamoorthi (2003, §1.4 and §1.5).

## 1. Growing Gaussian location model

In nonparametric and semiparametric settings the situation is very different. Even frequentist consistency of nonparametric Bayesian methods is a difficult issue with a large literature of both positive and negative results (e.g. Ghosh and Ramamoorthi (2003); Ghosal and van der Vaart, (2010)). One cannot therefore expect Bernstein-von Mises phenomena in any great generality for the full posterior.

In this largely expository paper, we do some simple calculations in symmetric Gaussian sequence models. The Gaussian sequence structure makes possible an elementary set of examples that avoid the technical challenges posed by, and sophistication needed for, posterior Gaussian approximation in high dimensional settings (see references in Section 5). Nevertheless, the Gaussian examples can conveniently illustrate some of the issues related to validity of the Bernstein-von Mises theorem in high dimensional models. Depending on the frequentist or Bayesian perspective, we assume that p = p(n) grows with n, and one, or both, of

$(D)Data:Y‒∣θ~Np(θ,σn2I),and(P)Prior:θ~Np(0,τn2I).$

The notation $Y‒$ suggests an average (Y1++Yn)/n of observations individually of variance $σ02$, so that in this case $σn2=σ02∕n$. [If p were held fixed, not depending on n, then $σn2$ would match with the definition given in the introductory section.] We also allow the prior variance $τn2$ to depend on the sample size n.

Our goal is to compare the Bayesian posterior distribution $L(θ∣Y)$ with frequentist distributions, in particular those of the MLE $L(θ^MLE∣θ)$ and of the posterior mean Bayes estimator $L(θ^B∣θ)$. A key simplification is that since both prior and likelihood are Gaussian, so also is the posterior distribution, and hence all the behavior will be determined by centering and scaling. Thus from standard results, the posterior is given by

$θ∣Y‒=y‒~Np(wny,wnσn2I),wn=τn2∕(σn2+τn2).$
(2)

Remarks: 1. The reference to Gaussian sequence models becomes clearer if, as will be helpful later, we write out assumptions (D) and (P) in co-ordinates:

$(Dseq)Data:y‒k=θk+σn∊k,and(Pseq)Prior:θk=τnζk,$

with εk and ζk all i.i.d standard Gaussian, for k = 1, …, p(n).

Strictly speaking, the indexing by n of parameters σn, τn and p(n) creates a sequence of sequence models. However, one can, as needed for almost sure results, think of the infinite sequences {$(∊k,ζk),k∈N$} as being drawn from a single common probability space.

2. We also consider the infinite sequence Gaussian white noise model

$Yt=∫0tf(s)ds+σnWt0≤t≤1$
(3)

or equivalently, when expressed in any orthonormal basis {k(t)} for L2[0, 1],

$yk=θk+σn∊k∊k~indN(0,1),$
(4)

where it is assumed that $σn=σ0∕n$. For some examples, it is helpful to use doubly indexed orthonormal bases {$φjk(t),k=1,…,2j,j∈N$} such as arise with systems of orthonormal wavelets.

The forthcoming book Johnstone (2010) will have more on estimation in such Gaussian sequence models.

We develop three perspectives on the Bernstein-von Mises phenomenon:

1. global convergence of the posterior,
2. behavior of a non-linear functional $‖θ−θ^‖2$, and of
3. linear functionals Lf, in the white noise model (3) – (4).

We shall see that these situations are progressively “less demanding” in terms of validity of the Bernstein-von Mises phenomenon. Indeed, case (1) requires that wn → 1 at a sufficiently fast rate, while setting (2) needs only wn → 1. In case (3), the formulation itself delivers wn → 1, and covers at least all bounded linear functionals.

## 2. Global convergence of posterior

The first calculation considers the p—dimensional posterior distribution (2) and shows that the convergence in (1) occurs, even in the best possible case that θ0 = 0, only if the shrinkage factor wn approaches 1 at a sufficiently fast rate.

### Proposition 1

Let θ0 = 0. The variation distance between posterior distribution $Pθ∣Yn$ and $N(θ^MLE,n−1Iθ0−1)$ converges to zero inPθ0 — probability if and only if $pσn2∕τn2→0$, or equivalently, if

$wn=1−o(1∕pn).$
(5)

PROOF. We introduce notation Py,n(dθ) for the posterior distribution of θ|Yn = y and Qy,n(dθ) for the distribution centered at $θ^MLE=y‒$. Thus

$Py,n(dθ)↔Np(wny‒,σn2wnI)Qy,n(dθ)↔Np(y‒,σn2I).$
(6)

Let $ρ(P,Q)=∫pqdμ$ denote the Hellinger affinity between two probability measures P,Q having densities p, q with respect to a common dominating measure μ. We recall an elementary bound (van der Vaart, 1998, p. 212) for variation distance in terms of Hellinger distance and hence Hellinger affinity:

$2[1−ρ(P,Q)]≤∥P−Q∥≤8[1−ρ(P,Q)]1∕2$
(7)

Thus Py,nQy,n → 0 if and only if ρ(Py,n, Qy,n) → 1. We recall also that affinity commutes with products:

$ρ(ΠPi,ΠQi)=Πρ(Pi,Qi).$

An elementary calculation shows that

$ρ2(N(θ1,σ12),N(θ2,σ22)=(2σ1σ2σ12+σ22)exp{−(θ1−θ2)22(σ12+σ22)}.$
(8)

When applied to Py,n and Qy,n, we set $θ1i=wny‒i$, $θ2i=y‒i$ and $σ12=σn2wn$, $σ22=σn2$ to obtain

$ρ(Py,n,Qy,n)=exp{−p2log12(wn1∕2+wn−1∕2)−(1−wn)2∥y‒∥24(1+wn)σn2}.$
(9)

Introduce $rn=σn2∕τn2=wn−1−1$. Suppose first that $prn2→0$. Since wn = (1 + rn)−1, we have

$log12(wn1∕2+wn−1∕2)=−12log(1+rn)+log(1+12rn)≤2c1rn3$

for $rn≤12$, say. When p → ∞, we have with probability tending to one that $‖y‒‖2∕σn2<2p$, and so for $rn≤12$,

$(1−wn)24(1+wn)∥y‒∥2σn2≤2prn2(1+rn)(1+rn∕2)≤c2prn2.$

Consequently, when $prn2→0$,

$ρ(Py,n,Qy,n)≥exp{−c1prn3−c2prn2}→1.$

Suppose now that $prn2$ does not approach 0. Again with probability tending to one, $‖y‒‖2∕σn2>p∕2$, and since $12(wn1∕2+wn−1∕2)>1$, we have from (9) that

$−logρ(Py,n,Qy,n)>p(1−wn)28(1+wn)>c3min{prn2,p}$

which cannot converge to zero if $prn2$ does not.

Remark. If θ0 = θ0n ≠ 0, so that the data mean differs from the prior mean, then the rate condition is replaced by

$wn−1=o(1∕qn(θ0n)),qn(θ0n)=pn+‖θ0n‖∕σn.$

Example. We illustrate the result by considering estimation in the Gaussian white noise model (3). When expressed in a suitable orthonormal basis of wavelets, we obtain $yjkind~N(θj,k,σn2)$, for k = 1, …, 2j, and $j∈N$. Pinsker's theorem (Pinsker, 1980) describes the minimax linear estimator of f, or equivalently of (θjk), under squared error loss when it is assumed that f has α mean square derivatives and shows that such minimax linear estimators are asymptotically minimax among all estimators as σn → 0.

Pinsker's estimator is necessarily posterior mean Bayes for a corresponding Gaussian prior. The mean square differentiability condition can be equivalently expressed in terms of the coefficients as

$∑j,k22jαθj,k2≤C2,$

and the corresponding least favorable Gaussian prior puts

$θj,k~indN(0,τj2)τj2=σn2(μn2−jα−1)+,$
(10)

where μn = cαn(Cn)2α/(2α+1). The constant cαn satisfies bounds independent of n, ccαnc, whose precise values are unimportant here–for further details see Johnstone (2010).

We consider the validity of the Bernstein-von Mises phenomenon for the collection of coefficients {θjk, k = 1, …, 2j} at a given level j = j(n)–possibly fixed, or possibly varying with n.

The prior variances $τj2$ decrease with j, and vanish above a “critical level” j* = j*(α, C;n). Since j* ~ (2/(2α+1)) log(Cn) grows with n, so does the number of parameters θj*,k at the critical level. From (10), we conclude that

$τj∗2∕σn2≤2α−1,$

and hence that wn ≤ 1 − 2−α does not approach 1, so that the condition of Proposition 1 fails.

On the other hand, at a fixed level j0, we have p = 2jo fixed and $τj02∕σn2=μn2−j∗α−1→∞$, so that $pσn2∕τj02→0$ and so Proposition 1 applies. Thus we may say informally that the Bernstein-von Mises phenomenon holds at a fixed level but fails at the critical level.

## 3. Behavior of the squared loss

In this section, we pay homage to a remarkable paper by Freedman (1999), itself stimulated by Cox (1993), which sets out the failure of the Bernstein-von Mises theorem in a simple sequence model of function estimation in Gaussian white noise. To further simplify the calculations, we use the growing Gaussian location model (D), (P), yielding results parallel to, but not identical with, Freedman's. Hence, define

$Tn(θ,Y)=∥θ−θ^B∥2=∑k=1p(n)(θk−θ^k)2.$

The posterior distribution of θ|Y is described by (2); in particular the shrinkage factor $wn=τn2∕(σn2+τn2)$ again plays a critical role.

Theorem 2 (Bayesian) The posterior distribution $L(Tn∣Y)$ is given by

$Tn=Cn+DnZ1n,$

where

$Cn=pσn2wn$
(11)

$Dn=2pσn2wn$
(12)

and the random variable Z1n has mean 0, variance 1 and converges in distribution to N(0, 1) as n → ∞.

Proof. From (2), the posterior distribution of Tn given Y is $σn2wnχ(p)2$ and in particular it is free of Y. Hence we have the representation

$Tn=pσn2wn+2pσn2wnZ1n,$

and the theorem follows because $(χp2−p)∕2p⇒N(0,1)$ as p → ∞.

Turn now to the frequentist perspective, in which θ is a fixed and unknown (sequence of) parameters. We will therefore use the decomposition yk = θk + σnεk with $∊kiid~N(0,1)$, c.f. (Dseq) above. Since $θ^B,k=wnyk$ we have

$θk−θ^B,k=(1−wn)θk−wnσn∊k.$
(13)

Some of the conclusions will be valid only for “most” θ: to formulate this it is useful to give θ a distribution. The natural one to use is (P), despite the possible confusion arising because, for the frequentist, this is not an a priori law!

Theorem 3 (Frequentist) The conditional distribution $L(Tn∣θ)$ is given by

$Tn=Cn+FnZ2n(θ)+Gn(θ)Z3n(θ,∊),$
(14)

where Cn is as in Theorem 2, while Z3n (θ, ε) has mean 0 and variance 1.

If θ is distributed according to (P), then Z2n(θ) has mean 0, variance 1 and converges in distribution to N(0, 1) as n → ∞. In addition, if wn ∞ w = 1 − cosω,

$Fn~DncosωGn(θ)~Dnsinω,$
(15)

and

$Z3n(θ,⋅)⇒N(0,1)$
(16)

Formulas (15) and (16) hold as n → ∞, for almost all θ's generated from (P).

Proof. Using (13), and $(1−wn)2τn2+wn2σn2=σn2wn$, we may write

$Tn=∑k[(1−wn)θk−wnσn∊k]2=pσn2wn+2pτn2(1−wn)2⋅Σθk2−pτn22pτn2+Rn(θ,∊),$
(17)

with

$Rn(θ,∊)=−2wn(1−wn)σnΣθk∊k+wn2σn2Σ(∊n2−1)$

This leads immediately to the representation (14) after observing that $τn2(1−wn)=σn2wn$ and setting

$Fn=2pσn2wn(1−wn),Z2n(θ)=Σ(θk2−τn2)∕2pτn2,Gn(θ)=VarRn(θ,∊)=4wn2(1−wn)2σn2Σθk2+wn4⋅2pσn4=G1n(θ)+G2n$

Turning to the final assertions, we may rewrite

$Rn(θ,∊)=G1n(θ)Z4n(θ)+G2nZ5n,$

where

$Z4n(θ)=Σθk∊k∕(Σθk2)1∕2,Z5n=(Σ∊k2−p)∕2p.$

Using again $σn2wn=τn2(1−wn)$, we have

$G1n(θ)=2pσn4⋅wn2(2wn−2wn2)⋅p−1∑1p(θk∕τn)2.$

For almost all θ's generated from (P), $p−1Σ1p(θk∕τn)2→1$, and since Gn (θ) = G1n(θ) + G2n, (15) follows.

Clearly Z4n(θ) ~ N(0, 1), free of θ, while Z5n N (0, 1) and so (16) follows.

Remark. The doctrinaire frequentist would not contemplate the joint distribution of (θ, Y ) in (D, P); but anyone else would observe that in that joint distribution, $Tn~σn2wnχ(p)2$, as follows easily in two ways, either from the proof of Theorem 2, or from (17).

The Bernstein-von Mises theorem fails if lim wn = w < 1, as may be seen in Figure 1. For the Bayesian, conditional on $Y,θ−θ^B$ is a noise vector, and Theorem 2 says that the distribution of $‖θ−θ^B‖2$ is approximately normal with mean Cn and standard deviation $Dn$. For the frequentist, $E[θ^B∣θ]=wnθ$ is biased (also asymptotically), and some of $‖θ^B−θ‖2$ comes from this bias. As a result, Theorem 3 says that, conditional on $θ,‖θ^B−θ‖2$ is approximately normal with mean $Cn+FnZ2n(θ)$ and standard deviation $Gn(θ)$. Comparing (12) and (15) shows that the frequentist SD is smaller than the Bayesian $Gn(θ).

The top panel, for the Bayesian, has $θ−θ^B$ as a noise vector, and the posterior distribution $L(T∣Y)$ approximately N(Cn, Dn). The bottom panel, for the frequentist, shows the effect of the bias of $E[θ^B∣$ ...

Under the assumption (P), $θiiid~N(0,τn2)$, the wobble' in the frequentist mean be arbitrarily large relative to $Dn$: from the law of the iterated logarithm, with probability one

$liminfZ2n(θ)∕2loglogp=1$

.

By contrast, if lim wn = 1, then the wobble disappears: $Fn=o(Dn)$ and the Bayesian SD equals the frequentist SD asymptotically: $Gn(θ)~Dn$.

## 4. Linear Functionals

We turn now to the least demanding of our three scenarios for the Bernstein-von Mises theorem: the behavior of linear functionals. We change the setting slightly to the infinite sequence Gaussian white noise model (3). We consider linear functionals Lf such as integrals $∫Bf$ or derivatives f(r) (t0): if f has expansion $f(t)=Σθkφk(t)$, then on setting $ak=Lφk$, we have

$Lf=ΣθkLφk=Σθkak.$

Again, for maximum simplicity, we consider Gaussian priors on the coefficients:

$θk~indN(0,τk2).$
(18)

In order that $Σθk2<∞$ with probability 1, it is necessary and sufficient that $Στn2<∞$.

Consequently, the posterior laws are Gaussian

$θk∣yk~indN(wknyk,wknσn2),$

again with

$wkn=τk2∕(σn2+τk2)$
(19)

so that the posterior mean estimate

$Lf^n=∑nakwknyk.$

Centering at posterior mean. For the Bayesian, the posterior distribution

$Lf∣y~N(Lf^n,Vyn),Vyn=σn2∑kak2wkn$

while for the frequentist, the conditional distribution

$Lf^n∣f~N(EfLf^n,VFn),VFn=σn2∑kak2wkn2.$

The Bayesian might use 100(1 − α)% posterior credible intervals of the form $Lf^n±zα∕2Vyn$, while the frequentist might employ 100(1 − α)% confidence intervals $Lf^n±zα∕2VFn$. This leads us to conider the variance ratio

$VFnVyn=Σak2wkn2Σak2wkn<1,$
(20)

from which we see that the frequentist intervals are narrower–because the frequentist bias $EfLf^−Lf$ is being ignored for now, along with the attendant implications for coverage (but see below).

As sample size n → ∞, the noise variance $σn2→0$ and so for a given Gaussian prior (18), the weights (19) converge marginally: wkn → 1 for each fixed k. This alone does not imply convergence of the variance ratio VFn/Vyn → 1, as a later example shows. A sufficient condition is that the linear functional Lf be bounded (as a mapping from L2[0, 1] to $R$.) This amounts to saying that Lf has the representation $Lf=∫01a(t)f(t)dt$ with ∫ a2(t)dt < ∞, or equivalently, in sequence terms, that $Σak2≤∞$.

### Proposition 4

Let $Pfn$ denote the measure corresponding to (3). If Lf = ∫ af is a bounded linear functional, then the variation distance between Bayesian and frequentist distributions converges to zero:

$‖N(Lf^n,Vyn)−N(Lf^n,VFn)‖→Pfn0.$
(21)

Proof. We again use the Hellinger affinity (7) and apply (8) to the laws $P=N(Lf^n,Vyn)$ and $Q=N(Lf^n,VFn)$ to obtain

$ρ2(P,Q)=2VFn∕Vyn1+VFn∕Vyn.$

In view of (7), the merging in (21) occurs if and only if

$VFn∕Vyn→1.$

When $Σak2<∞$, this convergence follows from (20) and the dominated convergence theorem.

Remarks. 1. Examples of bounded functionals include polynomials $a(t)=Σk=0Kcktk$ and “regions of interest” a(t) = I{t ε B}.

2. Examples of unbounded functionals are given by evaluation of a function (or its derivatives) at a point: Lf = f(r)(t0). We shall see that the variance ratio does not converge to 1, and so the Bernstein-von Mises theorem fails. Indeed, in the Fourier basis

$φ0(t)≡1,{φ2k−1(t)=2sin2πktφ2k(t)=2sin2πkt,}k=1,2,…$

we find that, ak = Lk = drk(t0)/dtr and an easy calculation shows that $a2k−12+a2k2=2(2πk)2r$. We use a Gaussian prior (18) with $τ2k−12=τ2k2=k−2m$ and 2m > 2r + 1. It follows from (19) that, after writing V1n and V2n for Vyn and VFn respectively, we have

$Vjn=2(2π)2rσn2∑kk2r(1+σn2k2m)−j.$

As λ → 0, sums of the form

$∑k=0∞kp(1+λkq)−r~kλ−(p+1)∕q,$

with $κ=κ(p,r;q)=∫0∞vp(1+vq)−rdv=Γ(r−μ)Γ(μ)∕(qΓ(r))$ and μ (p+1)/q. In the present case, with p = 2r, q = 2m and r = j, we conclude that

$VFnVyn→1−2r+12m<1.$

Centering at the MLE. For a bounded linear functional, the MLE $Lf^M=Σakyk$ is well defined and unbiased, with mean $E(Lf^M)=Lf$ and frequentist variance $VMn=Var(Lf^M)=σn2Σkak2$. A frequentist might prefer to use 100(1 − α)% intervals $Lf^M±zα∕2VMn$ which will have the correct coverage property. However, extra conditions are required for the Bernstein-von Mises result to hold in this case.

### Proposition 5

Assume that Lf = ∫ af is a bounded linear functional. Suppose also that the coefficients of θk = f, k of the true' f, and the variances $τk2$ of the Gaussian prior together satisfy Σ|akθkk|< ∞. Then the distance between Bayesian and frequentist distributions

$∥N(Lf^n,Vyn)−N(Lf^M,VMn)∥→Pfn0.$
(22)

Proof. The argument is a slight elaboration of that used in the previous proposition. We use (7) and $P=N(Lf^n,Vyn)$ as before, but now $Q=N(Lf^M,VMn)$ and (8) yields

$ρ2(P,Q)=2VynVMnVyn+VMnexp{−12(Lf^M−Lf^n)2Vyn+VMn}.$

As before $Vyn∕VMn=Σak2wkn2∕Σak2→1$ by dominated convergence. Using this and the expression $VMn=σn2Σak2$ and in view of the bounds (7), the conclusion (22) is equivalent to $σn−1∣Lf^M−Lf^n∣→Pfn0$. We may write

$σn−1(Lf^M−Lf^n)=D∑kakσn−1(1−wkn)θk+Σak(1−wkn)∊k.$

The stochastic term has mean 0 and variance $Σak2(1−wkn)2→0$, again by dominated convergence. Thus we may focus on the deterministic term, and note that the merging in (22) occurs if and anly if

$σn∑kakθkσn2+τk2→0.$

The bound $σnτn∕(σn2+τk2)≤12$ along with the dominated convergence theorem then shows that $Σ∣akθkτk−1∣<∞$ is a sufficient condition for (22) as claimed.

## 5. Related work

As remarked earlier, this paper avoids the important Gaussian approximation part of the Bernstein-von Mises phenomenon by focusing on examples with Gaussian likelihoods and priors. A growing literature addresses the approximation challenges; we give a brief listing here, and refer to the books (Ghosh and Ramamoorthi, 2003; Ghosal and van der Vaart, 2010) and the survey discussion in Ghosal (2010, §2.7) for more detailed discussion.

Ghosal (1997, 1999, 2000) developed posterior normality results for the full posterior in cases where the dimension of the parameter space increases sufficiently slowly. In each case, the emphasis is on conditions under which a non-Gaussian likelihood and appropriate prior sequence can yield approximate Guassian posteriors. However Ghosal (2000, Sec. 4) specializes his results to our setting (D) with $σn2=1∕n$ and notes that one can choose priors-in general not Gaussian-so that the posterior distribution centered by the MLE is approximately Gaussian if p3(log p)/n → 0.

In survival analysis, Bernstein-von Mises theorems for the cumulative hazard function are established by Kim and Lee (2004) and for the cumulative hazard and fixed dimensional covariate regression parameter in a proportional hazards model in Kim (2006).

Boucheron and Gassiat (2009) develop a Bernstein-von Mises theorem for discrete probability distributions of growing dimension, and consider application to functionals such as Shannon and Renyi entropies.

In a semiparametric setting, where a finite dimensional parameter of interest can be separated from an infinite dimensional nuisance parameter, Castillo (2008) obtains conditions leading to a Bernstein-von Mises theorem on the parametric part, clarifying an earlier work of Shen (2002).

Rivoirard and Rousseau (2009) give conditions under which Bernstein-von Mises holds for linear functionals of a nonparametrically specified probability density function.

## Acknowledgments

This work was supported in part by NIH grant RO1 EB 001988. and NSF DMS 0906812.

## References

• Borwanker J, Kallianpur G, Prakasa Rao BLS. The Bernstein-von Mises theorem for Markov processes. Ann. Math. Statist. 1971;42:1241–1253.
• Boucheron S, Gassiat E. A Bernstein-von Mises theorem for discrete probability distributions. Electron. J. Stat. 2009;3:114–148.
• Castillo I. A semiparametric Bernstein-von Mises theorem. 2008. submitted.
• Cox DD. An analysis of Bayesian inference for nonparametric regression. Ann. Statist. 1993;21(2):903–923.
• Freedman D. On the Bernstein-von Mises theorem with infinite dimensional parameters. Annals of Statistics. 1999;27:1119–1140.
• Ghosal S. Normal approximation to the posterior distribution for generalized linear models with many covariates. Math. Methods Statist. 1997;6(3):332–348.
• Ghosal S. Asymptotic normality of posterior distributions in high-dimensional linear models. Bernoulli. 1999;5(2):315–331.
• Ghosal S. Asymptotic normality of posterior distributions for exponential families when the number of parameters tends to infinity. J. Multivariate Anal. 2000;74(1):49–68.
• Ghosal S. The Dirichlet process, related priors and posterior asymptotics. In: Hjort NL, Holmes C, Müller P, Walker SG, editors. Bayesian Nonparametrics. Cambridge University Press; 2010. chapter 2.
• Ghosal S, van der Vaart A. Theory of Nonparametric Bayesian Inference. Cambridge University Press; 2010. in preparation.
• Ghosh JK, Ramamoorthi RV. Bayesian Nonparametrics. Springer-Verlag; New York: 2003. (Springer Series in Statistics).
• Heyde CC, Johnstone IM. On asymptotic posterior normality for stochastic processes. J. Roy. Statist. Soc. Ser. B. 1979;41(2):184–189.
• Johnstone IM. Function estimation and Gaussian sequence models. 2010. Book manuscript at www-stat.stanford.edu.
• Kim Y. The Bernstein-von Mises theorem for the proportional hazard model. Ann. Statist. 2006;34(4):1678–1700.
• Kim Y, Lee J. A Bernstein-von Mises theorem in the nonparametric right-censoring model. Ann. Statist. 2004;32(4):1492–1512.
• Lehmann EL, Casella G. Springer Texts in Statistics. second edn Springer-Verlag; New York: 1998. Theory of Point Estimation.
• Pinsker M. Optimal filtering of square integrable signals in Gaussian white noise. Problems of Information Transmission. 1980;1616:120–133. 52–68. originally in Russian in Problemy Peredatsii Informatsii.
• Rivoirard V, Rousseau J. Bernstein von Mises theorem for linear functionals of the density. 2009. submitted.
• Shen X. Asymptotic normality of semiparametric and nonparametric posterior distributions. J. Amer. Statist. Assoc. 2002;97(457):222–235.
• van der Vaart AW. Asymptotic statistics, Vol. 3 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press; Cambridge: 1998.

 PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers.