Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Inst Math Stat Collect. Author manuscript; available in PMC 2011 January 1.
Published in final edited form as:
Inst Math Stat Collect. 2010; 6: 87–98.
doi:  10.1214/10-IMSCOLL607
PMCID: PMC2990974

High dimensional Bernstein-von Mises: simple examples


In Gaussian sequence models with Gaussian priors, we develop some simple examples to illustrate three perspectives on matching of posterior and frequentist probabilities when the dimension p increases with sample size n: (i) convergence of joint posterior distributions, (ii) behavior of a non-linear functional: squared error loss, and (iii) estimation of linear functionals. The three settings are progressively less demanding in terms of conditions needed for validity of the Bernstein-von Mises theorem.

Keywords: high dimensional inference, Gaussian sequence, linear functional, squared error loss, posterior distribution, frequentist

The Bernstein-von Mises theorem is a formalization of conditions under which Bayesian posterior credible intervals agree approximately with frequentist confidence intervals constructed from likelihood theory. It is traditionally formulated in situations in which the number of parameters p is fixed and the sample size n → ∞. The situation is very different in high dimensional settings in which p is allowed to grow with n. In this primarily expository paper, we use simple Gaussian sequence models to draw some conclusions about when a version of Bernstein-von Mises can hold.

We begin with a somewhat informal statement of the classical theorem. Suppose that Y1, …, Yn are i.i.d. observations from a distribution Pθ having density pθ(y)dμ(y) where θΘRp. The log-likelihood for a single observation


and, as usual, the score function vector and Fisher information matrix are given by


Writing Yn = (Y1, …, Yn) for the full data, the log-likelihood


and we write θ^MLE for a maximizer of Ln(θ). Classical likelihood theory says that any (nice) estimator satisfies the information bound


in the usual ordering of nonnegative definite matrices, and that the bound is asymptotically attained by the MLE, which is also asymptotically Gaussian:


Now suppose that π(θ) is the density of a prior distribution with respect to Legesgue measure. Then the posterior distribution of θ given Yn is given by Bayes' rule; we denote it simply by Pθ|Yn.

The Bernstein-von Mises theorem says, informally, that this posterior distribution is, in large samples, approximately normal with mean approximately the MLE, θ^MLE and variance matrix approximately n1Iθ01 (here θ0 is the `true' value of θ generating the observations Y1, …, Yn). Using the scalar case for simplicity, and writing σn2=n1Iθ01 and zα=Φ~1(α), we have that an approximate 100(1 – α)% credible interval for θ would be given by θ^MLE±zα2σ^n. This is exactly the same as the frequentist confidence interval based on asymptotic normality of the MLE. Thus in large samples the effect of the prior density π disappears: “the data overwhelms the prior”.

A somewhat more formal statement uses the notion of variation distance between probability measures P and Q, and an equivalent expression in terms of the densities p = dP/dμ and q = dQ/dμ relative to a dominating measure μ:


Suppose that π(θ) is continuous and positive at the `true' value θ0, and that θ → Pθ is differentiable in quadratic mean and satisfies a further mild separation condition, then


in probability under Pθ0n.

In other words, the variation distance between posterior and the approximating Gaussian distribution is a random variable depending on Yn, and which converges to zero in probability under repeated draws from Pθ0.

A development of the Bernstein-von Mises theorem as formulated above may be found in van der Vaart (1998, §10.2). A proof due to Bickel is given in Lehmann and Casella (1998, §6.8). Extension from independent to dependent sampling settings are possible, see e.g. Borwanker et al. (1971); Heyde and Johnstone (1979). For further references and methods of proof of the classical results, see Ghosh and Ramamoorthi (2003, §1.4 and §1.5).

1. Growing Gaussian location model

In nonparametric and semiparametric settings the situation is very different. Even frequentist consistency of nonparametric Bayesian methods is a difficult issue with a large literature of both positive and negative results (e.g. Ghosh and Ramamoorthi (2003); Ghosal and van der Vaart, (2010)). One cannot therefore expect Bernstein-von Mises phenomena in any great generality for the full posterior.

In this largely expository paper, we do some simple calculations in symmetric Gaussian sequence models. The Gaussian sequence structure makes possible an elementary set of examples that avoid the technical challenges posed by, and sophistication needed for, posterior Gaussian approximation in high dimensional settings (see references in Section 5). Nevertheless, the Gaussian examples can conveniently illustrate some of the issues related to validity of the Bernstein-von Mises theorem in high dimensional models. Depending on the frequentist or Bayesian perspective, we assume that p = p(n) grows with n, and one, or both, of


The notation Y suggests an average (Y1+(...)+Yn)/n of observations individually of variance σ02, so that in this case σn2=σ02n. [If p were held fixed, not depending on n, then σn2 would match with the definition given in the introductory section.] We also allow the prior variance τn2 to depend on the sample size n.

Our goal is to compare the Bayesian posterior distribution L(θY) with frequentist distributions, in particular those of the MLE L(θ^MLEθ) and of the posterior mean Bayes estimator L(θ^Bθ). A key simplification is that since both prior and likelihood are Gaussian, so also is the posterior distribution, and hence all the behavior will be determined by centering and scaling. Thus from standard results, the posterior is given by


Remarks: 1. The reference to Gaussian sequence models becomes clearer if, as will be helpful later, we write out assumptions (D) and (P) in co-ordinates:


with εk and ζk all i.i.d standard Gaussian, for k = 1, …, p(n).

Strictly speaking, the indexing by n of parameters σn, τn and p(n) creates a sequence of sequence models. However, one can, as needed for almost sure results, think of the infinite sequences {(k,ζk),kN} as being drawn from a single common probability space.

2. We also consider the infinite sequence Gaussian white noise model


or equivalently, when expressed in any orthonormal basis {[var phi]k(t)} for L2[0, 1],


where it is assumed that σn=σ0n. For some examples, it is helpful to use doubly indexed orthonormal bases {φjk(t),k=1,,2j,jN} such as arise with systems of orthonormal wavelets.

The forthcoming book Johnstone (2010) will have more on estimation in such Gaussian sequence models.

We develop three perspectives on the Bernstein-von Mises phenomenon:

  1. global convergence of the posterior,
  2. behavior of a non-linear functional θθ^2, and of
  3. linear functionals Lf, in the white noise model (3) – (4).

We shall see that these situations are progressively “less demanding” in terms of validity of the Bernstein-von Mises phenomenon. Indeed, case (1) requires that wn → 1 at a sufficiently fast rate, while setting (2) needs only wn → 1. In case (3), the formulation itself delivers wn → 1, and covers at least all bounded linear functionals.

2. Global convergence of posterior

The first calculation considers the p—dimensional posterior distribution (2) and shows that the convergence in (1) occurs, even in the best possible case that θ0 = 0, only if the shrinkage factor wn approaches 1 at a sufficiently fast rate.

Proposition 1

Let θ0 = 0. The variation distance between posterior distribution PθYn and N(θ^MLE,n1Iθ01) converges to zero inPθ0 — probability if and only if pσn2τn20, or equivalently, if


PROOF. We introduce notation Py,n(dθ) for the posterior distribution of θ|Yn = y and Qy,n(dθ) for the distribution centered at θ^MLE=y. Thus


Let ρ(P,Q)=pqdμ denote the Hellinger affinity between two probability measures P,Q having densities p, q with respect to a common dominating measure μ. We recall an elementary bound (van der Vaart, 1998, p. 212) for variation distance in terms of Hellinger distance and hence Hellinger affinity:


Thus ||Py,nQy,n|| → 0 if and only if ρ(Py,n, Qy,n) → 1. We recall also that affinity commutes with products:


An elementary calculation shows that


When applied to Py,n and Qy,n, we set θ1i=wnyi, θ2i=yi and σ12=σn2wn, σ22=σn2 to obtain


Introduce rn=σn2τn2=wn11. Suppose first that prn20. Since wn = (1 + rn)−1, we have


for rn12, say. When p → ∞, we have with probability tending to one that y2σn2<2p, and so for rn12,


Consequently, when prn20,


Suppose now that prn2 does not approach 0. Again with probability tending to one, y2σn2>p2, and since 12(wn12+wn12)>1, we have from (9) that


which cannot converge to zero if prn2 does not.

Remark. If θ0 = θ0n ≠ 0, so that the data mean differs from the prior mean, then the rate condition is replaced by


Example. We illustrate the result by considering estimation in the Gaussian white noise model (3). When expressed in a suitable orthonormal basis of wavelets, we obtain yjkind~N(θj,k,σn2), for k = 1, …, 2j, and jN. Pinsker's theorem (Pinsker, 1980) describes the minimax linear estimator of f, or equivalently of (θjk), under squared error loss when it is assumed that f has α mean square derivatives and shows that such minimax linear estimators are asymptotically minimax among all estimators as σn → 0.

Pinsker's estimator is necessarily posterior mean Bayes for a corresponding Gaussian prior. The mean square differentiability condition can be equivalently expressed in terms of the coefficients as


and the corresponding least favorable Gaussian prior puts


where μn = cαn(Cn)2α/(2α+1). The constant cαn satisfies bounds independent of n, ccαnc, whose precise values are unimportant here–for further details see Johnstone (2010).

We consider the validity of the Bernstein-von Mises phenomenon for the collection of coefficients {θjk, k = 1, …, 2j} at a given level j = j(n)–possibly fixed, or possibly varying with n.

The prior variances τj2 decrease with j, and vanish above a “critical level” j* = j*(α, C;n). Since j* ~ (2/(2α+1)) log(Cn) grows with n, so does the number of parameters θj*,k at the critical level. From (10), we conclude that


and hence that wn ≤ 1 − 2−α does not approach 1, so that the condition of Proposition 1 fails.

On the other hand, at a fixed level j0, we have p = 2jo fixed and τj02σn2=μn2jα1, so that pσn2τj020 and so Proposition 1 applies. Thus we may say informally that the Bernstein-von Mises phenomenon holds at a fixed level but fails at the critical level.

3. Behavior of the squared loss

In this section, we pay homage to a remarkable paper by Freedman (1999), itself stimulated by Cox (1993), which sets out the failure of the Bernstein-von Mises theorem in a simple sequence model of function estimation in Gaussian white noise. To further simplify the calculations, we use the growing Gaussian location model (D), (P), yielding results parallel to, but not identical with, Freedman's. Hence, define


The posterior distribution of θ|Y is described by (2); in particular the shrinkage factor wn=τn2(σn2+τn2) again plays a critical role.

Theorem 2 (Bayesian) The posterior distribution L(TnY) is given by





and the random variable Z1n has mean 0, variance 1 and converges in distribution to N(0, 1) as n → ∞.

Proof. From (2), the posterior distribution of Tn given Y is σn2wnχ(p)2 and in particular it is free of Y. Hence we have the representation


and the theorem follows because (χp2p)2pN(0,1) as p → ∞.

Turn now to the frequentist perspective, in which θ is a fixed and unknown (sequence of) parameters. We will therefore use the decomposition yk = θk + σnεk with kiid~N(0,1), c.f. (Dseq) above. Since θ^B,k=wnyk we have


Some of the conclusions will be valid only for “most” θ: to formulate this it is useful to give θ a distribution. The natural one to use is (P), despite the possible confusion arising because, for the frequentist, this is not an a priori law!

Theorem 3 (Frequentist) The conditional distribution L(Tnθ) is given by


where Cn is as in Theorem 2, while Z3n (θ, ε) has mean 0 and variance 1.

If θ is distributed according to (P), then Z2n(θ) has mean 0, variance 1 and converges in distribution to N(0, 1) as n → ∞. In addition, if wn ∞ w = 1 − cosω,




Formulas (15) and (16) hold as n → ∞, for almost all θ's generated from (P).

Proof. Using (13), and (1wn)2τn2+wn2σn2=σn2wn, we may write




This leads immediately to the representation (14) after observing that τn2(1wn)=σn2wn and setting


Turning to the final assertions, we may rewrite




Using again σn2wn=τn2(1wn), we have


For almost all θ's generated from (P), p1Σ1p(θkτn)21, and since Gn (θ) = G1n(θ) + G2n, (15) follows.

Clearly Z4n(θ) ~ N(0, 1), free of θ, while Z5n [implies] N (0, 1) and so (16) follows.

Remark. The doctrinaire frequentist would not contemplate the joint distribution of (θ, Y ) in (D, P); but anyone else would observe that in that joint distribution, Tn~σn2wnχ(p)2, as follows easily in two ways, either from the proof of Theorem 2, or from (17).

The Bernstein-von Mises theorem fails if lim wn = w < 1, as may be seen in Figure 1. For the Bayesian, conditional on Y,θθ^B is a noise vector, and Theorem 2 says that the distribution of θθ^B2 is approximately normal with mean Cn and standard deviation Dn. For the frequentist, E[θ^Bθ]=wnθ is biased (also asymptotically), and some of θ^Bθ2 comes from this bias. As a result, Theorem 3 says that, conditional on θ,θ^Bθ2 is approximately normal with mean Cn+FnZ2n(θ) and standard deviation Gn(θ). Comparing (12) and (15) shows that the frequentist SD is smaller than the Bayesian Gn(θ)<Dn.

Fig 1
The top panel, for the Bayesian, has θθ^B as a noise vector, and the posterior distribution L(TY) approximately N(Cn, Dn). The bottom panel, for the frequentist, shows the effect of the bias of E[θ^B ...

Under the assumption (P), θiiid~N(0,τn2), the `wobble' in the frequentist mean be arbitrarily large relative to Dn: from the law of the iterated logarithm, with probability one



By contrast, if lim wn = 1, then the wobble disappears: Fn=o(Dn) and the Bayesian SD equals the frequentist SD asymptotically: Gn(θ)~Dn.

4. Linear Functionals

We turn now to the least demanding of our three scenarios for the Bernstein-von Mises theorem: the behavior of linear functionals. We change the setting slightly to the infinite sequence Gaussian white noise model (3). We consider linear functionals Lf such as integrals Bf or derivatives f(r) (t0): if f has expansion f(t)=Σθkφk(t), then on setting ak=Lφk, we have


Again, for maximum simplicity, we consider Gaussian priors on the coefficients:


In order that Σθk2< with probability 1, it is necessary and sufficient that Στn2<.

Consequently, the posterior laws are Gaussian


again with


so that the posterior mean estimate


Centering at posterior mean. For the Bayesian, the posterior distribution


while for the frequentist, the conditional distribution


The Bayesian might use 100(1 − α)% posterior credible intervals of the form Lf^n±zα2Vyn, while the frequentist might employ 100(1 − α)% confidence intervals Lf^n±zα2VFn. This leads us to conider the variance ratio


from which we see that the frequentist intervals are narrower–because the frequentist bias EfLf^Lf is being ignored for now, along with the attendant implications for coverage (but see below).

As sample size n → ∞, the noise variance σn20 and so for a given Gaussian prior (18), the weights (19) converge marginally: wkn → 1 for each fixed k. This alone does not imply convergence of the variance ratio VFn/Vyn → 1, as a later example shows. A sufficient condition is that the linear functional Lf be bounded (as a mapping from L2[0, 1] to R.) This amounts to saying that Lf has the representation Lf=01a(t)f(t)dt with ∫ a2(t)dt < ∞, or equivalently, in sequence terms, that Σak2.

Proposition 4

Let Pfn denote the measure corresponding to (3). If Lf = ∫ af is a bounded linear functional, then the variation distance between Bayesian and frequentist distributions converges to zero:


Proof. We again use the Hellinger affinity (7) and apply (8) to the laws P=N(Lf^n,Vyn) and Q=N(Lf^n,VFn) to obtain


In view of (7), the merging in (21) occurs if and only if


When Σak2<, this convergence follows from (20) and the dominated convergence theorem.

Remarks. 1. Examples of bounded functionals include polynomials a(t)=Σk=0Kcktk and “regions of interest” a(t) = I{t ε B}.

2. Examples of unbounded functionals are given by evaluation of a function (or its derivatives) at a point: Lf = f(r)(t0). We shall see that the variance ratio does not converge to 1, and so the Bernstein-von Mises theorem fails. Indeed, in the Fourier basis


we find that, ak = L[var phi]k = dr[var phi]k(t0)/dtr and an easy calculation shows that a2k12+a2k2=2(2πk)2r. We use a Gaussian prior (18) with τ2k12=τ2k2=k2m and 2m > 2r + 1. It follows from (19) that, after writing V1n and V2n for Vyn and VFn respectively, we have


As λ → 0, sums of the form


with κ=κ(p,r;q)=0vp(1+vq)rdv=Γ(rμ)Γ(μ)(qΓ(r)) and μ (p+1)/q. In the present case, with p = 2r, q = 2m and r = j, we conclude that


Centering at the MLE. For a bounded linear functional, the MLE Lf^M=Σakyk is well defined and unbiased, with mean E(Lf^M)=Lf and frequentist variance VMn=Var(Lf^M)=σn2Σkak2. A frequentist might prefer to use 100(1 − α)% intervals Lf^M±zα2VMn which will have the correct coverage property. However, extra conditions are required for the Bernstein-von Mises result to hold in this case.

Proposition 5

Assume that Lf = ∫ af is a bounded linear functional. Suppose also that the coefficients of θk = left angle bracketf, [var phi]kright angle bracket of the `true' f, and the variances τk2 of the Gaussian prior together satisfy Σ|akθkk|< ∞. Then the distance between Bayesian and frequentist distributions


Proof. The argument is a slight elaboration of that used in the previous proposition. We use (7) and P=N(Lf^n,Vyn) as before, but now Q=N(Lf^M,VMn) and (8) yields


As before VynVMn=Σak2wkn2Σak21 by dominated convergence. Using this and the expression VMn=σn2Σak2 and in view of the bounds (7), the conclusion (22) is equivalent to σn1Lf^MLf^nPfn0. We may write


The stochastic term has mean 0 and variance Σak2(1wkn)20, again by dominated convergence. Thus we may focus on the deterministic term, and note that the merging in (22) occurs if and anly if


The bound σnτn(σn2+τk2)12 along with the dominated convergence theorem then shows that Σakθkτk1< is a sufficient condition for (22) as claimed.

5. Related work

As remarked earlier, this paper avoids the important Gaussian approximation part of the Bernstein-von Mises phenomenon by focusing on examples with Gaussian likelihoods and priors. A growing literature addresses the approximation challenges; we give a brief listing here, and refer to the books (Ghosh and Ramamoorthi, 2003; Ghosal and van der Vaart, 2010) and the survey discussion in Ghosal (2010, §2.7) for more detailed discussion.

Ghosal (1997, 1999, 2000) developed posterior normality results for the full posterior in cases where the dimension of the parameter space increases sufficiently slowly. In each case, the emphasis is on conditions under which a non-Gaussian likelihood and appropriate prior sequence can yield approximate Guassian posteriors. However Ghosal (2000, Sec. 4) specializes his results to our setting (D) with σn2=1n and notes that one can choose priors-in general not Gaussian-so that the posterior distribution centered by the MLE is approximately Gaussian if p3(log p)/n → 0.

In survival analysis, Bernstein-von Mises theorems for the cumulative hazard function are established by Kim and Lee (2004) and for the cumulative hazard and fixed dimensional covariate regression parameter in a proportional hazards model in Kim (2006).

Boucheron and Gassiat (2009) develop a Bernstein-von Mises theorem for discrete probability distributions of growing dimension, and consider application to functionals such as Shannon and Renyi entropies.

In a semiparametric setting, where a finite dimensional parameter of interest can be separated from an infinite dimensional nuisance parameter, Castillo (2008) obtains conditions leading to a Bernstein-von Mises theorem on the parametric part, clarifying an earlier work of Shen (2002).

Rivoirard and Rousseau (2009) give conditions under which Bernstein-von Mises holds for linear functionals of a nonparametrically specified probability density function.


This work was supported in part by NIH grant RO1 EB 001988. and NSF DMS 0906812.


  • Borwanker J, Kallianpur G, Prakasa Rao BLS. The Bernstein-von Mises theorem for Markov processes. Ann. Math. Statist. 1971;42:1241–1253.
  • Boucheron S, Gassiat E. A Bernstein-von Mises theorem for discrete probability distributions. Electron. J. Stat. 2009;3:114–148.
  • Castillo I. A semiparametric Bernstein-von Mises theorem. 2008. submitted.
  • Cox DD. An analysis of Bayesian inference for nonparametric regression. Ann. Statist. 1993;21(2):903–923.
  • Freedman D. On the Bernstein-von Mises theorem with infinite dimensional parameters. Annals of Statistics. 1999;27:1119–1140.
  • Ghosal S. Normal approximation to the posterior distribution for generalized linear models with many covariates. Math. Methods Statist. 1997;6(3):332–348.
  • Ghosal S. Asymptotic normality of posterior distributions in high-dimensional linear models. Bernoulli. 1999;5(2):315–331.
  • Ghosal S. Asymptotic normality of posterior distributions for exponential families when the number of parameters tends to infinity. J. Multivariate Anal. 2000;74(1):49–68.
  • Ghosal S. The Dirichlet process, related priors and posterior asymptotics. In: Hjort NL, Holmes C, Müller P, Walker SG, editors. Bayesian Nonparametrics. Cambridge University Press; 2010. chapter 2.
  • Ghosal S, van der Vaart A. Theory of Nonparametric Bayesian Inference. Cambridge University Press; 2010. in preparation.
  • Ghosh JK, Ramamoorthi RV. Bayesian Nonparametrics. Springer-Verlag; New York: 2003. (Springer Series in Statistics).
  • Heyde CC, Johnstone IM. On asymptotic posterior normality for stochastic processes. J. Roy. Statist. Soc. Ser. B. 1979;41(2):184–189.
  • Johnstone IM. Function estimation and Gaussian sequence models. 2010. Book manuscript at
  • Kim Y. The Bernstein-von Mises theorem for the proportional hazard model. Ann. Statist. 2006;34(4):1678–1700.
  • Kim Y, Lee J. A Bernstein-von Mises theorem in the nonparametric right-censoring model. Ann. Statist. 2004;32(4):1492–1512.
  • Lehmann EL, Casella G. Springer Texts in Statistics. second edn Springer-Verlag; New York: 1998. Theory of Point Estimation.
  • Pinsker M. Optimal filtering of square integrable signals in Gaussian white noise. Problems of Information Transmission. 1980;1616:120–133. 52–68. originally in Russian in Problemy Peredatsii Informatsii.
  • Rivoirard V, Rousseau J. Bernstein von Mises theorem for linear functionals of the density. 2009. submitted.
  • Shen X. Asymptotic normality of semiparametric and nonparametric posterior distributions. J. Amer. Statist. Assoc. 2002;97(457):222–235.
  • van der Vaart AW. Asymptotic statistics, Vol. 3 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press; Cambridge: 1998.