Home | About | Journals | Submit | Contact Us | Français |

**|**PLoS One**|**v.12(3); 2017**|**PMC5354282

Formats

Article sections

Authors

Related links

PLoS One. 2017; 12(3): e0173453.

Published online 2017 March 16. doi: 10.1371/journal.pone.0173453

PMCID: PMC5354282

Gui-Quan Sun, Editor^{}

Shanxi University, CHINA

**Conceptualization:**RH RP LMB.**Data curation:**RH.**Formal analysis:**RH RP LMB.**Investigation:**RH RP LMB.**Methodology:**RH RP LMB.**Project administration:**RH RP LMB.**Software:**RH.**Supervision:**RH RP LMB.**Validation:**RH RP LMB.**Visualization:**RH.**Writing – original draft:**RH RP LMB.**Writing – review & editing:**RH RP LMB.

* E-mail: ude.uso.tats@iebreh

Received 2016 November 7; Accepted 2017 February 22.

Copyright © 2017 Herbei et al

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

We examine the performance of a strategy for Markov chain Monte Carlo (MCMC) developed by simulating a discrete approximation to a stochastic differential equation (SDE). We refer to the approach as *diffusion MCMC*. A variety of motivations for the approach are reviewed in the context of Bayesian analysis. In particular, implementation of diffusion MCMC is very simple to set-up, even in the presence of nonlinear models and non-conjugate priors. Also, it requires comparatively little problem-specific tuning. We implement the algorithm and assess its performance for both a test case and a glaciological application. Our results demonstrate that in some settings, diffusion MCMC is a faster alternative to a general Metropolis-Hastings algorithm.

The advent of Markov Chain Monte Carlo (MCMC) has led to major advances in the application of Bayesian analysis in complex problems. The idea is simply put: faced with a posterior distribution too complicated to compute or simulate from directly (i.e., we cannot readily obtain the normalizer or denominator appearing in Bayes’ Theorem), one develops a Markov chain whose stationary distribution is known to coincide with the target posterior distribution. One then runs that chain, knowing that eventually realizations from the chain form an approximate dependent sample from the posterior. Those realizations are then used to estimate features of the posterior (i.e., posterior expectations of interesting quantities, predictive densities, etc.) [1–3].

For example, in some settings, nonlinearity and/or nonconjugacy of certain components of a large model render the standard Gibbs Sampler unusable. Metropolis-Hastings algorithms and Gibbs-Metropolis hybrids can be suggested, though these approaches can be taxing and may require substantial tuning.

In response to such difficulties, we explore diffusion based strategies for MCMC analysis. That is, one develops a diffusion (a solution, in the sense of Itô, to a stochastic differential equation) whose stationary distribution is the target posterior, see Chapter 5 of [4]. The key idea is certainly not new. Indeed, Langevin MCMC procedures are often suggested for generating candidate states in Metropolis steps in MCMC. In this article, we suggest diffusion MCMC as a stand-alone algorithm.

In part, our motivation for suggesting diffusion MCMC is its simplicity in terms of set-up. There are no probability calculations to perform, as in Gibbs’ Sampling, nor any need for choosing and updating distributions for generating candidate states. Indeed, the approach is recommended as an “off-the-shelf” strategy that can be readily implemented. However, as indicated below, it is not a panacea. Further, issues such as burn-in, mixing, convergence rates, and output analysis remain challenging.

Consider a Bayesian analysis for an unknown quantity *θ* (after the introduction, we allow vector-valued unknowns) based on observational data **Y**, having conditional density *g*(**y** *θ*). Let *π*(*θ*) denote our prior distribution for *θ*. We are to obtain the posterior distribution for *θ* based on the fixed observation **Y** = **y**,

$$\begin{array}{c}\hfill p\left(\theta \right)\phantom{\rule{0.166667em}{0ex}}\stackrel{\text{def}}{=}\phantom{\rule{0.166667em}{0ex}}p(\theta \mid \mathbf{y})=C{\left(\mathbf{y}\right)}^{-1}g(\mathbf{y}\mid \theta )\pi \left(\theta \right),\end{array}$$

(1)

where *C*(**y**) is the normalizing constant. Consider a one-dimensional stochastic differential equation (SDE)

(2)

where *θ*_{0} is some fixed initial value and the *drift*
*b*() and *diffusion*
*σ*() > 0 are specified functions, such that Eq (2) admits a unique weak solution. The initial state *θ*_{0} is a random variable with specified density *p*(*θ*, 0); and *dW*(*t*) represents *white noise*. Specifically, {*W*(*t*): *t* ≥ 0} is a standard Brownian motion process or a Wiener process. Consider the temporal evolution of the probability density function, *p*(*θ*, *t*), of a solution *θ*(*t*). Under regularity conditions requiring *b* and *σ* to be differentiable and satisfy a Lipschitz condition

|*b*(*θ*) - *b*(*θ*^{′})| + |*σ*(*θ*) - *σ*(*θ*^{′})| ≤ *K*|*θ* - *θ*^{′}|,

for some constant *K* and for all *θ*, *θ*′, *p*(*θ*, *t*) is the solution to the ordinary partial differential equation, known as the *Fokker-Planck* or *Kolmogorov Forward* Equation,

$$\begin{array}{c}\hfill \frac{\partial p}{\partial t}=0.50\phantom{\rule{0.166667em}{0ex}}\frac{{\partial}^{2}}{\partial {\theta}^{2}}\left({\sigma}^{2}\phantom{\rule{0.166667em}{0ex}}p\right)-\frac{\partial}{\partial \theta}\left(b\phantom{\rule{0.166667em}{0ex}}p\right),\end{array}$$

(3)

subject to the initial condition *p*(*θ*, 0) and the assumption that *θ*(0) and {*W*(*t*): *t* ≥ 0} are independent.

Our interest is in stationary solutions, i.e., solutions that are functionally independent of time, so the partial derivative with respect to *t* is zero. Setting the right-hand side of Eq (3) equal to zero and integrating the result once w.r.t. *θ*, it suffices to find a stationary density *p*(*θ*) such that

$$\begin{array}{c}\hfill 0.50\phantom{\rule{0.166667em}{0ex}}\frac{\partial}{\partial \theta}\left({\sigma}^{2}\left(\theta \right)\phantom{\rule{0.166667em}{0ex}}p\left(\theta \right)\right)=b\left(\theta \right)\phantom{\rule{0.166667em}{0ex}}p\left(\theta \right)\phantom{\rule{4pt}{0ex}},\end{array}$$

(4)

for all values of *θ* in the parameter space. The general solution of Eq (4) is

$$\begin{array}{c}\hfill p\left(\theta \right)={\left(c{\sigma}^{2}\left(\theta \right)\right)}^{-1}\phantom{\rule{0.166667em}{0ex}}exp\left({\int}_{0}^{\theta}\frac{2b\left(z\right)}{{\sigma}^{2}\left(z\right)}dz\right),\end{array}$$

where the constant *c* is a normalizer.

We let *p*(*θ*) be the target posterior Eq (1) for our Bayesian model and find appropriate functions *b*(*θ*) and *σ*(*θ*) such that *p*() satisfies the Eq (4). That is,

$$\begin{array}{cc}\hfill b\left(\theta \right)& =0.50\left({\sigma}^{2}{\left(\theta \right)}^{\prime}+{\sigma}^{2}\left(\theta \right)\frac{p{\left(\theta \right)}^{\prime}}{p\left(\theta \right)}\right)=0.5{\sigma}^{2}\left(\theta \right)\left(\frac{{\sigma}^{2}{\left(\theta \right)}^{\prime}}{{\sigma}^{2}\left(\theta \right)}+\frac{p{\left(\theta \right)}^{\prime}}{p\left(\theta \right)}\right)\end{array}$$

where the derivatives are taken with respect to *θ*. When *p*(*θ*) = *g*(**y** | *θ*)*π*(*θ*) as in Eq (1), we look for functions *b*() and *σ*^{2}() satisfying

$$\begin{array}{ccc}\hfill b\left(\theta \right)& =& 0.50{\sigma}^{2}\left(\theta \right)\left(\frac{d}{d\theta}log\left(g\left(\mathbf{y}\phantom{\rule{4pt}{0ex}}\right|\phantom{\rule{4pt}{0ex}}\theta )\pi \left(\theta \right){\sigma}^{2}\left(\theta \right)\right)\right)\hfill \\ & =& 0.50{\sigma}^{2}\left(\theta \right)\left(\frac{g{\left(\mathbf{y}\phantom{\rule{4pt}{0ex}}\right|\phantom{\rule{4pt}{0ex}}\theta )}^{\prime}}{g\left(\mathbf{y}\phantom{\rule{4pt}{0ex}}\right|\phantom{\rule{4pt}{0ex}}\theta )}+\frac{\pi {\left(\theta \right)}^{\prime}}{\pi \left(\theta \right)}+\frac{{\sigma}^{2}{\left(\theta \right)}^{\prime}}{{\sigma}^{2}\left(\theta \right)}\right),\hfill \end{array}$$

(5)

Having completed this step, we can simulate the diffusion and proceed as in MCMC. This is typically accomplished by forming a discrete-time approximation to Eq (2), that is, a Markov chain approximation to the continuous-time process. There may be many pairs (*b*(), *σ*^{2}()) that work for a fixed Bayesian model. It is important to note that the core of a *diffusion MCMC* (DMCMC) implementation has been completely described.

In this article we discretize the diffusion Eq (2) using the Euler scheme. This method has been extensively discussed in the literature, see for example [5] for a comprehensive overview and [6–8] for recent developments. The solution of the stochastic differential Eq (2) is approximated using a discrete time Markov chain {*θ*_{m}}_{m ≥ 0},

(6)

where *θ*_{0} = *θ*(0), *h* > 0 is the discretization step-size, and *Z*_{m+1} is a realization from a standard Gaussian distribution. Abusing notation, we use *θ* to denote both the continuous-time defined in Eq (2) and the discrete-time process from Eq (6). It is typical to extend {*θ*_{m}}_{m ≥ 0} to a continuous time process via interpolation; however this step is not necessary for this paper. From a practical perspective, we are interested in the process {*θ*_{m}}_{m ≥ 0}. Two critical questions arise:

- Does the discrete stochastic process converge to a stationary, ergodic distribution?
- If so, is that stationary distribution “close” enough to the target posterior distribution to justify the use of conventional output analysis to enable approximate Bayesian inference?

Unfortunately, there are situations where for any choice of the time step *h* > 0, the Markov chain described by Eq (6) will behave drastically different than the continuous time version Eq (2), see [9] for a discussion on this issue. Nevertheless, in many cases the ergodic properties of the discretized process {*θ*_{m}}_{m ≥ 0} are similar to those of {*θ*(*t*)}_{t ≥ 0}. In particular, in [10] the author shows that under regularity conditions, the Euler discretization scheme does have a stationary measure which converges at an appropriate rate to the unique stationary measure of the continuous-time SDE. Furthermore, in [9] the authors provide conditions under which the continuous-time Langevin diffusion (defined below in Eq (7)) as well as the discretized version Eq (6) are geometrically ergodic. In [7], the authors study the asymptotic properties of time averages $(1/M){\sum}_{m=1}^{M}F\left({\theta}_{m}\right)$, where *F* is a given function. This statistic is the natural estimate for the expected value *E*(*F*(*θ*)) = ∫*F*(*θ*)*dp*. They use Poisson equations to show that under mild regularity conditions, any stationary measure of the Euler-discretized process Eq (6) will be close to the unique stationary measure of the underlying SDE. Their Theorem 5.1 also shows that the time average estimator is of order *O*(*h* + 1/*M*).

To illustrate *diffusion-MCMC* and indicate its potential value and simplicity, examples are reviewed in the next section. Henceforth, we set *σ* 1, and then find the function *b*(); this approach often has the tag *Langevin*. That is, we restrict ourselves to diffusion processes of the form

$$\begin{array}{c}\hfill d\theta \left(t\right)=\frac{1}{2}\frac{\partial}{\partial \theta}logp\left(\theta \left(t\right)\right)dt+dW\left(t\right),\phantom{\rule{1.em}{0ex}}t\ge 0,\end{array}$$

(7)

where *p*() is the posterior distribution Eq (1). We note that application of Eq (6) yields the corresponding transition distribution as

We emphasize again that our goal is to present the benefits of a procedure that has been present in the literature for over a decade. Due to its wild behavior even in some simple cases, it has received little attention, especially from practitioners and applied scientists. It has been proposed (see for example the MALA algorithm presented in [9]) that an additional Metropolis-Hastings step will correct the explosive behavior. The MALA algorithm has been further studied and extended in [11] and [12]. In both cases the improvement comes with an increase in computational complexity. We stress that our goal is to avoid a Metropolis-Hastings accept-reject step and this work is motivated by recent theoretical advances in this direction, see [7]. We explore the efficiency and applicability of DMCMC to high-dimensional problems arising in a Bayesian framework, **without** performing the Metropolis-Hastings correction step. When classical (or adaptive) MCMC fails (for example, due to computational time restrictions or inability to select good proposals), we show that diffusion MCMC is a viable alternative which requires little input from the user and can be computationally more efficient.

The multivariate form of the diffusion Eq (2) is written as

(8)

where {* θ*(

One class of problems in which diffusion MCMC may be useful involve nonlinearity. For example, suppose the likelihood function *g* depends on * θ* via a “link” function

**Example 1**. Assume that for *τ*^{2} known, *Y* | *θ* ~ *N*(*θ*, *τ*^{2}) and *θ* ~ *N*(*μ*, *η*^{2}). Of course, we know that *θ* | *Y* = *y* is normally distributed with easily computed mean and variance (these will appear below). Applying Eq (5) with the choice of *σ*(*θ*) = 1, yields

$$\begin{array}{c}\hfill b\left(\theta \right)=0.50\left(\frac{y}{{\tau}^{2}}+\frac{\mu}{{\eta}^{2}}-\theta (\frac{1}{{\tau}^{2}}+\frac{1}{{\eta}^{2}})\right).\end{array}$$

Let *α* = 0.50(*τ*^{−2}
*y* + *η*^{−2}
*μ*) and *β* = 0.50(*τ*^{−2} + *η*^{−2}). The solution to

is

$$\begin{array}{c}\hfill \theta \left(t\right)={\int}_{0}^{t}\alpha {e}^{-\beta (t-s)}ds+{\int}_{0}^{t}{e}^{-\beta (t-s)}dW\left(s\right)+\theta \left(0\right){e}^{-\beta t}.\end{array}$$

It follows that

$$E\left(\theta \left(t\right)\right)={\displaystyle {\int}_{0}^{t}\alpha {e}^{-\beta \left(t-s\right)}ds+E\left(\theta \left(0\right)\right){e}^{-\beta t}}$$

(9)

$$=\frac{\alpha}{\beta}\left(1-{e}^{-\beta t}\right)+E\left(\theta \left(0\right)\right){e}^{-\beta t}.$$

(10)

It can be shown that

$$\text{var}\left(\theta \left(t\right)\right)={\displaystyle {\int}_{0}^{t}{e}^{-2\beta \left(t-s\right)}ds+\text{var}\left(\theta \left(0\right){e}^{-\beta t}\right)}$$

(11)

= (2*β*)^{−1}(1 − *e*^{−2βt}) + var(*θ*(0))*e*^{−2βt}.

(12)

Returning to the original parameterization, we conclude that as *t* → ∞,

$$\begin{array}{c}\hfill E\left(\theta \left(t\right)\right)\to {\left(\frac{1}{{\tau}^{2}}+\frac{1}{{\eta}^{2}}\right)}^{-1}\frac{y}{{\tau}^{2}}+\frac{\mu}{{\eta}^{2}}\end{array}$$

and

$$\begin{array}{c}\hfill \text{var}\left(\theta \left(t\right)\right)\to {\left(\frac{1}{{\tau}^{2}}+\frac{1}{{\eta}^{2}}\right)}^{-1},\end{array}$$

which are the usual posterior mean and variance for this Bayesian model. If the initial condition *θ*(0) is normally distributed, then for each *t*, *θ*(*t*) is also normally distributed. Note that the convergence rate to the stationary distribution is exponentially fast.

**Example 2**. Diffusion MCMC is useful in combining data from highly different likelihoods. Let * θ* = (

$$\begin{array}{c}\hfill \frac{\partial}{\partial {\theta}_{i}}\nabla logp\left(\mathit{\theta}\right)=-\sum _{j=1}^{{r}_{i}}\frac{({\theta}_{i}-{Y}_{ij})}{{\tau}^{2}}-\sum _{i=1}^{K}\frac{2({\theta}_{i}-\mu )}{{A}^{2}+{({\theta}_{i}-\mu )}^{2}}.\end{array}$$

Note that conjugacy plays no direct role in this approach, though the presence of the Cauchy distribution makes a Gibbs sampler infeasible. This example is further analyzed in the next Section.

**Example 3. (Mixture Models)** Suppose *Y*_{1}, …, *Y*_{n} are conditionally independent and identically distributed given * θ* according to a finite mixture of

$$\begin{array}{c}\hfill g(\mathbf{y}\mid \theta )=\prod _{j=1}^{n}\left(\sum _{i=1}^{m}{\alpha}_{i}{g}_{i}({y}_{j}\mid \theta )\right),\end{array}$$

where **y** = (*y*_{1}, …, *y _{n}*)′,

**Example 4**. **(Hierarchical Models)**. In many models *g* and *π* are products of a variety of terms, e.g., for conditionally independent observations, *g* is a product; *π* is often represented as a product of hierarchical components. In such cases, we have that

$$\begin{array}{c}\hfill \frac{\partial log\left(g\pi \right)}{\partial {\theta}_{i}}=\frac{\partial log\left({g}^{\left(i\right)}{\pi}^{\left(i\right)}\right)}{\partial {\theta}_{i}},\end{array}$$

where the superscripts indicate that only those components of *g* and *π* that explicitly depend on *θ*_{i} are involved in the calculation. This parallels the familiar step in computing full conditionals in setting up a Gibbs Sampler. Namely, for each *i*, one computes the distributions

$$\begin{array}{c}\hfill [{\theta}_{i}\mid \text{all}\phantom{\rule{4.pt}{0ex}}\text{other}\phantom{\rule{4.pt}{0ex}}{\theta}_{j}]=\frac{{g}^{\left(i\right)}{\pi}^{\left(i\right)}}{\int {g}^{\left(i\right)}{\pi}^{\left(i\right)}d{\theta}_{i}}.\end{array}$$

(13)

Suppose that the Bayesian model takes the form **Y** | *θ*_{1}, … *θ _{q}* ~

We adapt the notation in Eq (5) as follows: for a function *f*(*θ*_{1}, … *θ*_{q}) define

$$\begin{array}{c}\hfill {f}_{\left(i\right)}=\frac{\partial f}{\partial {\theta}_{i}},\phantom{\rule{0.166667em}{0ex}}i=1,\dots ,q.\end{array}$$

Hence,

$$\begin{array}{c}\hfill \frac{\partial log\left(g\pi \right)}{\partial {\theta}_{i}}=\frac{{g}_{\left(i\right)}}{g}+\sum _{j=1}^{i}\frac{{\pi}_{\left(i\right)}^{j}}{{\pi}^{j}}\end{array}$$

(14)

We note that Gibbs sampling is useful when the full conditionals Eq (13) are readily obtained and simulated. This typically arises when the full conditionals actually depend on a small subset of the parameters in the conditions. This is not necessary in diffusion MCMC.

To provide insight into diffusion MCMC (DMCMC), we present a standard test case and a real-data example. Our goal is to assess the performance of DMCMC, especially in comparison with the current *state-of-the art* adaptive MCMC approach, see [13, 14]. The DMCMC methodology is compared to a multivariate adaptive Metropolis sampler (AM). For the AM algorithm, the proposal distribution at iteration *m* is given by

(1 - *β*)*N*(*x*, (2.38)^{2}*Σ*_{m}/*q*) + *β**N*(0, (0.1)^{2}*I*_{q}/*q*)

where Σ_{m} is the current estimate of the covariance matrix of the target distribution and *β* is a small positive constant (we take *β* = 0.05). The AM algorithm is widely accepted as one of the best sampling algorithms, especially for complex target distributions where dependencies among parameters make it difficult to select proposal distributions. We refer the reader to [15] for several comparisons between MCMC algorithms. The scaling factor (2.38)^{2} can also be “adapted”; in this case we refer to the procedure as *adaptive scaling within adaptive MCMC*. However, user input is not eliminated completely as there remain tuning parameters to be specified.

For comparisons, we inspect trace-plots to assess convergence and compare algorithms via their *averaged squared jumping distance*

This quantity is estimated by $\frac{1}{M}{\sum}_{m=1}^{M}{({X}_{m}-{X}_{m-1})}^{2}$ for both AM and DMCMC algorithms. Comparatively large *ASJD* indicates the desirable property of fast mixing.

We also add a computational constraint for our examples. We limit ourselves to relatively short runs of the Markov chains (AM and DMCMC). This can be very dangerous for classical MCMC since one will have difficulty assessing whether the chains have reached stationarity. Our examples will show that the diffusion approach quickly finds regions with high posterior probability and explores them thoroughly.

Assume that *Y*_{i1}, …, *Y*_{iri}|*θ*_{i}, *γ* are an iid sample from a *Gaussian*(*θ*_{i}, *V*(*γ*)) distribution, where 1 ≤ *i* ≤ 1000 and 1 ≤ *j* ≤ *r*_{i}. We specify the variance *V*(*γ*) to be

$$\begin{array}{c}\hfill V\left(\gamma \right)=\frac{a+b\phantom{\rule{0.166667em}{0ex}}{e}^{\gamma}}{1+{e}^{\gamma}}\phantom{\rule{0.166667em}{0ex}},\phantom{\rule{1.em}{0ex}}\phantom{\rule{4.pt}{0ex}}\text{for}\phantom{\rule{4.pt}{0ex}}\gamma \in \mathbb{R},\end{array}$$

(15)

where 0 < *a* < *b* < ∞ are specified constants. The reason behind Eq (15) is twofold: (1) we require that all the parameters of the model be supported on the entire real line, hence a transformation is required for all variances, and (2) we aim for a *Uniform*(*a*, *b*) prior distribution for the variance *V*(*γ*). Certainly, other distributions (such as Gamma or Inverse Gamma) can be considered. Using an inverse transformation, this is equivalent to specifying the prior density for *γ* as

$$\begin{array}{c}\hfill f\left(\gamma \right)=\frac{{e}^{\gamma}}{{(1+{e}^{\gamma})}^{2}}\phantom{\rule{0.166667em}{0ex}},\phantom{\rule{1.em}{0ex}}\gamma \in \mathbb{R}\phantom{\rule{4pt}{0ex}}.\end{array}$$

We let the sample sizes *r*_{i} vary between 5 and 500. For *θ*_{1}, *θ*_{2}, …, *θ*_{500} we specify independent prior distributions, *θ*_{i}|*μ*, *A* ~ *Cauchy*(*μ*, *A*), with density proportional to [1 + ((*θ*_{i} − *μ*)/*A*)^{2}]^{−1}. The parameter *A* is held fixed for this example, although it can be treated similarly to the data variance *V*(*γ*). For the hyperparameter *μ* we specify a *Gaussian*(0, 1) prior distribution. Using the independence assumption, the likelihood function is written as

$$\begin{array}{ccc}\hfill {g}_{\mathbf{y}}\left(\mathit{\theta}\right)& \equiv & {g}_{\mathbf{y}}({\theta}_{1},\dots ,{\theta}_{1000},\gamma ,\mu )\hfill \\ & \propto & \prod _{i=1}^{1000}\left(V{\left(\gamma \right)}^{-{r}_{i}/2}\right)exp\left\{-\frac{{\sum}_{j=1}^{{r}_{i}}{({Y}_{ij}-{\theta}_{i})}^{2}}{2V\left(\gamma \right)}\right\},\hfill \end{array}$$

(16)

and the prior density is proportional to

$$\begin{array}{ccc}\hfill \pi \left(\mathit{\theta}\right)& \equiv & \pi ({\theta}_{1},\dots ,{\theta}_{1000},\gamma ,\mu )\hfill \\ & \propto & \left[\prod _{i=1}^{1000}\frac{1}{1+{\left(\frac{{\theta}_{i}-\mu}{A}\right)}^{2}}\right]\xb7\frac{{e}^{\gamma}}{{(1+{e}^{\gamma})}^{2}}\xb7exp\{-{\mu}^{2}/2\}\phantom{\rule{4pt}{0ex}}.\hfill \end{array}$$

(17)

**Selection of the time step**. The selection of “good” time steps for DMDMC is challenging. First, time steps that are too large may result in explosive (transient) processes. In general, the user faces a conundrum: a very small time step typically results in chain whose dynamics are similar to those of the target continuous-time diffusion, but is slowly mixing.

In our experiments we observed that selecting *h* = *O*(1/*q*) results in a good performance of the DMCMC algorithm for this example. In Fig 1 we display the trace plots and autocorrelation functions for three parameters (results for all parameters where quite similar) from the DMCMC approach. Fig 2 shows trace plots for the same parameters resulting from two adaptive approaches as described above. In each case the chains were run for 20000 iterations and thinned by ten. It is evident from these figures that the adaptive approach has difficulty exploring the state space. The key issue is the dimension of the state space (1000+ in this example). The adaptive approach will not “learn” a one thousand dimensional covariance matrix properly. In Table 1 we summarize the target values and estimates (posterior means) for two parameters, *θ*_{1} and *θ*_{201}. Table 2 contains corresponding estimates of ASJD. We see that the estimated ASJD indicates that the mixing rates of the diffusion MCMC algorithm is much higher than the adaptive case.

In [16], the authors present a hierarchical Bayesian analysis for inferring features of the dynamics of the Northeast Ice Stream in Greenland. For our purposes, we use a subset of their data and a simplified version of their model. Glaciers flow under the influence of gravity moderated by resistive forces at its base and sides. For a path roughly along the center of the glacier as it flows toward the sea, physical reasoning and simplifying approximations lead to a model for surface velocity of the stream as a function of ice thickness and the shape of the surface. Let *x* (*m*) denote a spatial location. The model for the surface velocity *u*(*x*) (*ms*^{−1}) used here is

$$\begin{array}{c}\hfill u\left(x\right)={u}_{bx}+(0.50)\phantom{\rule{0.166667em}{0ex}}A\left(x\right)\phantom{\rule{0.166667em}{0ex}}{\left(\rho g\right)}^{3}\phantom{\rule{0.166667em}{0ex}}{H}^{4}\left(x\right)\phantom{\rule{0.166667em}{0ex}}{\left(\frac{ds\left(x\right)}{dx}\right)}^{3},\end{array}$$

(18)

where *u*_{bx} is sliding velocity, *s*(*x*) (*m*) is ice-surface elevation, *H*(*x*) (*m*) is ice thickness, *ρ* = 911 *kgm*^{−3} is the density of ice, and *g* = 9.81 *ms*^{−2} is the gravity constant. Though the quantity *A*(*x*) depends on temperature, it is often treated as a constant *flow parameter*. In this article we model *A* using a Fourier expansion

$$\begin{array}{c}\hfill A\left(x\right)={a}_{0}+\sum _{k=1}^{3}{a}_{k}cos\left(kx\omega \right)+{b}_{k}sin\left(kx\omega \right).\end{array}$$

(19)

We assume the following quadratic model for the surface:

(20)

where *β*_{0} and *β*_{1} are unknown parameters. The authors in [16] use a different, more complicated functional form for *s*. We found Eq (20) sufficient for our purposes. Let *B*(*x*) be the elevation of the base of the glacier so that

The dataset consists of vectors **S** observed at fill-in spatial locations covering approximately 200 *km* of the ice stream, the observed surface topography; **B**, the observed basal topography; and **U**, surface velocities. Additional description of the data is given in [16].

Let * θ* represent the set of all parameters introduced in the modeling. We assume that

**Surface Data**. The data model for **S** is a conventional Gaussian measurement error model:

$$\begin{array}{c}\hfill \mathbf{S}\mid \mathit{\theta}\sim N(\mathbf{s},{\sigma}_{S}^{2}\phantom{\rule{0.166667em}{0ex}}I),\end{array}$$

(21)

where **s** is the vector of values of Eq (20) at the observation locations; ${\sigma}_{S}^{2}$ is an unknown measurement error variance; and *I* is the identity matrix.

**Basal Data**. It is argued (see [16]) that the basal data must be smoothed to be useful in Eq (18). Following their approach we partition the domain of the data into 2^{10} = 1024 bins of equal length (189.5 m). All basal observations within each bin are averaged, leading to a data vector $\overline{\mathbf{B}}$ of length 1024. As in [16], we use a wavelet model with two sets of wavelets to provide smoothing. The first group of wavelets captures a smooth signal; the second captures fine-scale or detail signals. Based on the results of [16], we used four smooth and 28 detail wavelets. Specifically, we assume that

$$\begin{array}{c}\hfill \overline{\mathbf{B}}\mid \mathit{\theta}\phantom{\rule{0.166667em}{0ex}}\sim N(\mathbf{W}\mathbf{C},{\sigma}_{B}^{2}\phantom{\rule{0.166667em}{0ex}}\text{diag}\left\{{n}_{i}^{-1}\right\}),\end{array}$$

(22)

where **W** is the 1024 × 32 matrix of discretized wavelet basis functions; **C** is the 32-dimensional vector of wavelet coefficients; ${\sigma}_{B}^{2}$ is an error variance; and $\text{diag}\left\{{n}_{i}^{-1}\right\}$ is a 1024 × 1024 diagonal matrix with diagonal elements equal to ${n}_{i}^{-1}$ where *n*_{i} is the number of observations averaged in bin *i* (all *n*_{i} are either one or two). We selected Daubechies wavelets, see [17] and [18] for discussion.

**Velocity Model**. We assume that

$$\begin{array}{c}\hfill \mathbf{U}\mid \mathit{\theta}\sim N(\mathbf{u},{\sigma}_{U}^{2}\phantom{\rule{0.166667em}{0ex}}I),\end{array}$$

(23)

where **u** is the vector of values given by Eq (18) at the observation locations; ${\sigma}_{U}^{2}$ is the unknown measurement error variance; and *I* is the identity matrix. Note that the sliding velocity is assumed to be a constant over the study range.

**Error Variances**. The measurement error variances ${\sigma}_{S}^{2}$, ${\sigma}_{B}^{2}$, and ${\sigma}_{U}^{2}$ were assigned independent, inverse gamma distributions with means and standard deviations (100, 10), (2500, 500), and (9, 3), respectively.

**Surface Model Parameters**. The prior distributions for *β*_{0} and *β*_{1} were specified to be independent normal distributions with large variances: 10,000 for *β*_{0} and 10 for *β*_{1}. The means of these normal distributions were set to be equal to the least squares estimates of *β*_{0} and *β*_{1} derived from a traditional analysis fitting the model in Eq (20) to the surface observations.

**Basal Model Parameters**. The prior used for the four coefficients of the smooth-signal wavelets is

$$\begin{array}{c}\hfill {\mathbf{C}}_{s}\sim N({\mathit{\mu}}_{s},{\sigma}_{c}^{2}\phantom{\rule{0.166667em}{0ex}}{I}_{4})\end{array}$$

where * μ_{s}* is the vector of conventional least squares estimates of Haar-wavelet coefficients. The prior for the remaining 28 coefficients is

$$\begin{array}{c}\hfill {\mathbf{C}}_{d}\sim N(\mathbf{0},{\sigma}_{d}^{2}\phantom{\rule{0.166667em}{0ex}}{I}_{28}).\end{array}$$

We set ${\sigma}_{c}^{2}=2000$ and ${\sigma}_{d}^{2}=10000$.

**Velocity Model Parameters**. The prior for the sliding velocity is

To develop reasonable priors for the Fourier coefficients in Eq (19), we first obtained the least squares fits to the surface and the basal models (20) and (22). These fitted models were substituted into the velocity model (23). We then fitted the result via least squares. As above, the least squares estimates of the Fourier coefficients were used as prior means for the corresponding parameters. These values were on the order of 10^{−16} (which is consistent with the theoretical value of the parameter *A*) except for the frequency parameter *ω* which was estimated to be roughly 10^{−5}. These parameters were all assumed to be independent, normal random variables with prior variances equal to 10.

**Performance**. Fig 3 shows trace plots for various parameters. We ran the algorithm for 100000 iterations and thinned it by fifty steps. The diffusion MCMC algorithm performs very well. It appears that it explores the state space properly and mixes very fast. In Fig 4 we show posterior means for the surface, velocity and basal processes. For comparison, we added the posterior means (dashed lines) for the three processes from a much longer adaptive MCMC run. This plot confirms that diffusion MCMC performs as expected, giving similar results to the adaptive MCMC approach with the added benefit of a much shorter computing time.

Simulation of a diffusion formulated to have stationary distribution coinciding with a target posterior distribution is a viable MCMC method. The approach is comparatively simple to implement since it requires no probability computations such as those needed in Gibbs’ sampling nor any accept-reject steps as in Metropolis algorithms. These advantages can be significant in a variety of settings including mixture likelihoods and/or priors, hierarchical models, nonconjugate priors, and nonlinear models.

The key problem that arises in diffusion MCMC is the approximation of the desired continuous time diffusion by a discrete time Markov chain. Our implementations use Euler discretizations. As reviewed in the Introduction, there are results in the literature providing sufficient conditions under which the discrete approximation has a stationary distribution that approximates that of the target, continuous-time diffusion. Though beyond our scope here, selection of the time-step *h* can be done adaptively, see [6] for some recent theoretical developments in this area.

We implemented diffusion MCMC for a familiar test problem and compared it to an adaptive MCMC procedure. We found that diffusion MCMC out-performed the “state-of-the-art” adaptive MCMC. Next, we implemented the diffusion MCMC approach in a complicated, nonlinear model involving glacial dynamics. Again, we found that our suggested approach performs well, mixing very fast.

In summary, we believe that diffusion MCMC is a valuable addition to the MCMC toolbox. By construction, the DMCMC algorithm has the ability to quickly find important regions of the target distribution, while a classical, even adaptive MCMC, may require longer exploration times (as seen in the glaciological example). It can be applied in great generality and with ease in some complicated contexts for which other MCMC methods are difficult or very time-consuming to implement. DMCMC does carry the baggage of temporal discretization and concern for the quality of the resulting approximation. Nevertheless, the potential power of diffusion MCMC justifies its application and further development.

We provide the surface, basal and velocity data used in this manuscript.

(ZIP)

Click here for additional data file.^{(11K, zip)}

RH has been supported in part by the National Science Foundation (www.nsf.gov) grant number DMS-1209142. RP has been supported in part by the National Science Foundation (www.nsf.gov) grant NSF-CMMI-1537379. LMB was supported in part by the National Science Foundation (www.nsf.gov) grants ATM-07-3024403 and DMS-10-49064. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

Data Availability

The surface, basal and velocity data used in the current manuscript are available for download as a MATLAB file in the supporting information section. Data used in this paper are from the study below whose corresponding author may be contacted at ude.uso.tats@bm: Berliner, L. M., Jezek, K., Cressie, N., Kim, Y., Lam, C. Q., and van der Veen, C.J., Modeling dynamic controls on ice streams: a Bayesian statistical approach. Journal of Glaciology 54 (2008), pp. 705714. Available at: http://www.ingentaconnect.com/content/igsoc/jog/2008/00000054/00000187/art00014.

1.
Bernardo J. M. and Smith A. F. M., *Bayesian Theory*, Wiley, 1994.

2.
Gilks W. R., Richardson S., and Spiegelhalter D. J. (ed.), *Markov chain Monte Carlo in practice*, Chapman and Hall, New York, 1995.

3.
Robert C.P. and Casella G., *Monte Carlo Statistical Methods (second edition)*. New York: Springer-Verlag, 2004.

4.
Karatzas I. and Shreve S. E., *Brownian Motion and Stochastic Calculus*. Springer-Verlag, New York, 1991.

5.
Kloeden P. E. and Platen E., *Numerical Solution of Stochastic Differential Equations*.
Springer-Verlag, Berlin, 1992.

6.
Lamba H., Mattingly J.C., Stuart A.M., An adaptive Euler-Maruyama scheme for SDEs: convergence and stability. IMA Journal of Numerical Analysis, 27 (2007), pp. 479–506. doi: 10.1093/imanum/drl032

7.
Mattingly J. C., Stuart A. M., Tretyakov M. V., Convergence of numerical time-averaging and stationary measures via Poisson equations. SIAM J. Numer. Anal., 48 (2010), pp. 552–577. doi: 10.1137/090770527

8.
Roberts G. O. and Stramer O., Langevin Diffusions and Metropolis-Hastings Algorithms. Methodology and Computing in Applied Probability, 4 (2002), pp. 337–357. doi: 10.1023/A:1023562417138

9.
Roberts G. O. and Tweedie R. L., Exponential Convergence of Langevin Distributions and Their Discrete Approximations. Bernoulli, 4 (1996) pp. 341–363. doi: 10.2307/3318418

10.
Talay D., Second order discretization schemes of stochastic differential systems for the computation of the invariant law. Stochastics and Stochastics Reports, 29 (1990), pp. 13–36. doi: 10.1080/17442509008833606

11.
Girolami M. and Calderhead B., Riemann manifold Langevin and Hamiltonian Monte Carlo methods, J. R. Statist. Soc. B, 73 (2011), pp. 123–214. doi: 10.1111/j.1467-9868.2010.00765.x

12.
Bou-Rabee N., Hairer M., Vanden-Eijnden E., Non-asymptotic mixing of the MALA algorithm, IMA Journal of Numerical Analysis, 33, (2012), 80–110. doi: 10.1093/imanum/drs003

13.
Gilks W. R., Roberts G. O. and Sahu S. K., Adaptive Markov chain Monte Carlo. J. Amer. Statist. Assoc., 93, (1998), pp. 1045–1054. doi: 10.1080/01621459.1998.10473766

14.
Haario H., Saksman E. and Tamminen J., An adaptive Metropolis algorithm. Bernoulli, 7 (2001), pp. 223–242. doi: 10.2307/3318737

15.
Roberts G. O. and Rosenthal J. S., Examples of Adaptive MCMC. J. Comp. Graph. Stat., 18 (2009), pp. 349–367. doi: 10.1198/jcgs.2009.06134

16.
Berliner L. M., Jezek K., Cressie N., Kim Y., Lam C. Q., and van der Veen C.J., Modeling dynamic controls on ice streams: a Bayesian statistical approach. Journal of Glaciology
54 (2008), pp. 705–714. doi: 10.3189/002214308786570917

17.
Bruce A. and Gao H. Y., *Applied wavelet analysis with S-PLUS*, New York, Springer-Verlag, 1996.

18.
Vidakovic B. *Statistical Modeling by Wavelets*. John Wiley & Sons, New York, 1999.

Articles from PLoS ONE are provided here courtesy of **Public Library of Science**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |