Home | About | Journals | Submit | Contact Us | Français |

**|**Europe PMC Author Manuscripts**|**PMC2703732

Formats

Article sections

Authors

Related links

Neuroimage. Author manuscript; available in PMC 2009 July 15.

Published in final edited form as:

Published online 2009 March 20. doi: 10.1016/j.neuroimage.2009.03.025

PMCID: PMC2703732

EMSID: UKMS5226

Klaas Enno Stephan,^{1,}^{2} Will D. Penny,^{1} Jean Daunizeau,^{1} Rosalyn J. Moran,^{1} and Karl J. Friston^{1}

Address for correspondence: Klaas Enno Stephan, Wellcome Trust Centre for Neuroimaging, Institute of Neurology, UCL, 12 Queen Square, London, UK, WC1N 3BG, Tel (44) 20 7833 7472, Fax (44) 20 7813 1420, Email ku.ca.lcu.noi.lif@nahpets.k

The publisher's final edited version of this article is available at Neuroimage

See other articles in PMC that cite the published article.

Bayesian model selection (BMS) is a powerful method for determining the most likely among a set of competing hypotheses about the mechanisms that generated observed data. BMS has recently found widespread application in neuroimaging, particularly in the context of dynamic causal modelling (DCM). However, so far, combining BMS results from several subjects has relied on simple (fixed effects) metrics, *e.g.* the group Bayes factor (*GBF*), that do not account for group heterogeneity or outliers. In this paper, we compare the *GBF* with two random effects methods for BMS at the between-subject or group level. These methods provide inference on model-space using a classical and Bayesian perspective respectively. First, a classical (frequentist) approach uses the log model evidence as a subject-specific summary statistic. This enables one to use analysis of variance to test for differences in log-evidences over models, relative to inter-subject differences. We then consider the same problem in Bayesian terms and describe a novel hierarchical model, which is optimised to furnish a probability density on the models themselves. This new variational Bayes method rests on treating the model as a random variable and estimating the parameters of a Dirichlet distribution which describes the probabilities for all models considered. These probabilities then define a multinomial distribution over model space, allowing one to compute how likely it is that a specific model generated the data of a randomly chosen subject as well as the exceedance probability of one model being more likely than any other model. Using empirical and synthetic data, we show that optimising a conditional density of the model probabilities, given the log-evidences for each model over subjects, is more informative and appropriate than both the *GBF* and frequentist tests of the log-evidences. In particular, we found that the hierarchical Bayesian approach is considerably more robust than either of the other approaches in the presence of outliers. We expect that this new random effects method will prove useful for a wide range of group studies, not only in the context of DCM, but also for other modelling endeavours, *e.g.* comparing different source reconstruction methods for EEG/MEG or selecting among competing computational models of learning and decision-making.

Model comparison and selection is central to the scientific process, in that it allows one to evaluate different hypotheses about the way data are caused (Pitt & Myung 2002). Nearly all scientific reporting rests upon some form of model comparison, which represents a probabilistic statement about the beliefs in one hypothesis relative to some other(s), given observations or data. The fundamental Neyman-Pearson lemma states that the best statistic upon which to base model selection is simply the probability of observing the data under one model, divided by the probability under another model (Neyman & Pearson 1933). This is known as a *log-likelihood ratio*. In a classical (frequentist) setting, the distribution of the log-likelihood ratio, under the null hypothesis that there is no difference between models, can be computed relatively easily for some models. Common examples include Wilk’s Lambda for linear multivariate models and the *F*- and *t*-statistics for univariate models. In a Bayesian setting, the equivalent to the log-likelihood ratio is the log-evidence ratio, which is commonly known as a *Bayes factor* (Kass & Raftery 1995). An important property of Bayes factors are that they can deal both with nested and non-nested models. In contrast, frequentist model comparison can be seen as a special case of Bayes factors where, under certain hierarchical restrictions on the models, their null distribution is readily available.

In this paper, we will consider the general case of how to use the model evidence for analyses at the group level, without putting any constraints on the models compared. These models can be nonlinear, possibly dynamic and, critically, do not necessarily bear a hierarchical relationship to each other, *i.e.* they are not necessarily nested. The application domain we have in mind is the comparison of dynamic causal models (DCMs) for fMRI or electrophysiological data (Friston *et al.* 2003; Stephan *et al.* 2007a) that have been inverted for each subject. However, the theoretical framework described in this paper can be applied to any model, for example when comparing different source reconstruction methods for EEG/MEG or selecting among competing computational models of learning and decision-making.

This paper is structured as follows. First, to ensure this paper is self-contained, particularly for readers without an in-depth knowledge of Bayesian statistics, we summarise the concept of log-evidence as a measure of model goodness and review commonly used approximations to it, *i.e.* the Akaike Information Criterion (*AIC*; Akaike 1974), the Bayesian Information Criterion (*BIC;*
Schwarz 1978), and the negative free-energy (*F*). These approximations differ in how they trade-off model fit against model complexity. Given any of these approximations to the log-evidence, we then consider model comparison at the group level. We address this issue both from a classical and Bayesian perspective. First, in a frequentist setting, we consider classical inference on the log-evidences themselves by treating them as summary statistics that reflect the evidence for each model for a given subject. Subsequently, using a hierarchical model and variational Bayes (VB), we describe a novel technique for inference on the conditional density of the models *per se*, given data (or log-evidences) from all subjects. This rests on treating the model as a random variable and estimating the parameters of a Dirichlet distribution, which describes the probabilities for all models considered. These probabilities then define a multinomial distribution over model space, allowing one to compute how likely it is that a specific model generated the data of a subject chosen at random.

We compare and contrast these random effects approaches to the conventional use of the group Bayes factor (*GBF*), an approach for model comparison at the between-subject level that has been used extensively in previous group studies in neuroimaging. For example, the *GBF* has been used frequently to decide between competing dynamic causal models fitted to fMRI (Acs & Greenlee 2008; Allen *et al.* 2008; Grol *et al.* 2007; Heim *et al.* 2008; Kumar *et al.* 2007; Leff *et al.* 2008; Smith *et al.* 2006; Stephan *et al.* 2007b, 2007c; Summerfield & Koechlin 2008) and EEG data (Garrido *et al.* 2007, 2008). While the *GBF* is a simple and straightforward index for model comparison at the group level, it assumes that all subjects’ data are generated by the same model (*i.e.* a fixed effects approach) and can be influenced adversely by violations of this assumption.

The novel Bayesian framework presented in this paper does not suffer from these shortcomings: it can quantify the probability that a particular model generated the data for any randomly selected subject, relative to other models, and it is robust to the presence of outliers. In the analyses below, we illustrate the advantages of this new approach using synthetic and empirical data. We show that computing a conditional density of the model probabilities, given the log-evidences for all subjects, can be superior to both the *GBF* and frequentist tests applied to the log-evidences. In particular, we found that our Bayesian approach is markedly more robust than either of the other approaches in the presence of outlying subjects.

The model evidence *p*(*y* | *m*) is the probability of obtaining observed data *y* given a particular model *m*. It can be considered the holy grail of any model inversion and is necessary to compare different models or hypotheses. The evidence for some models can be computed relatively easily (*e.g.*, for linear models); however, in general, computing the model evidence entails integrating out any dependency on the model parameters :

$$p\left(y\mid m\right)=\int p(y\mid \vartheta ,m)p\left(\vartheta \mid m\right)d\vartheta $$

(1)

In many cases, this integration is analytically intractable and numerically difficult to compute. Usually, it is therefore necessary to use computationally tractable approximations to the model evidence (or the log-evidence^{1}). A detailed description of some of the most common approximations is contained by Appendix A.

A systematic evaluation of the relative usefulness of different approximations to the log-evidence is not at the focus of this paper and will be presented in forthcoming work. This article deals with a different question, namely: Given a particular approximation to the log-evidence and a number of inverted models, how can we infer which of several competing models is most likely to have generated the data from a group of subjects? In other words, how can we make inference on model space at the group level, taking into account potential heterogeneity across the group?

In this section, we consider inference at the group level, using subject-specific model-evidences obtained by inverting a generative model for each subject. We will first describe a classical approach, testing the null hypothesis that there are no differences among the relative log-evidences for various models over subjects. We then move on to more formal Bayesian inference on model space *per se*. In contrast to the *GBF*, which, as described above, represents a fixed effects analysis, both the classical and Bayesian approaches are random effects procedures and thus consider inter-subject heterogeneity explicitly.

A straightforward random effects procedure to evaluate the between-subject consistency of evidence for one model relative to others is to use the log-evidences across subjects as the basis for a classical log-likelihood ratio statistic, testing the null hypothesis that no single model is better (in terms of their log-evidences) than any other. This essentially involves performing an ANOVA, using the log-evidence as a summary statistic of model adequacy for each subject. This ANOVA then compares the differences among models to the differences among subjects with a classical *F*-statistic. If this statistic is significant one can then compare the best model with the second best using a *post hoc t*-test. Effectively, this tests for differences between models that are consistent and large in relation to differences within models over subjects. The most general implementation would be a repeated-measures ANOVA, where the log-evidences for the different models represent the repeated measure. At its simplest, the comparison of just two models over subjects reduces to a simple paired *t*-test on the log-evidences (or a one-sample *t*-test on the log-evidence differences). Log-evidences tend to be fairly well behaved, and the residuals of a simple ANOVA model, or tests of normality like Kolmogorov-Smirnoff, usually indicate that parametric assumptions are appropriate. In those cases when they are not, *e.g.* due to outlier subjects, one can use robust regression methods that are less sensitive to violations of normality (Diedrichsen *et al.* 2005; Wager *et al.* 2005) or non-parametric tests that do not make any distributional assumptions (*e.g.* a Wilcoxon signed rank test; see one of our examples below).

This classical random effects approach is simple to implement, straightforward and easily interpreted. In this sense, there seems little reason not to use it. However, as shown in the empirical examples below, this type of inference can be affected markedly by group heterogeneity, even when the distribution of log-evidence differences is normal. A more robust analysis obtains by quantifying the density on model space itself, using a Bayesian approach as described in the next section.

Previously, we have suggested the use of a group Bayes factor (*GBF*) that is simply the product of Bayes factors over *N* subjects (Stephan *et al.* 2007b). This is equivalent to a fixed effects analysis that rests on multiplying the likelihoods over subjects to furnish the probability of the multi-subject data, conditioned on each model:

$${\mathit{GBF}}_{i,j}=\prod _{n=1}^{N}B{F}_{i,j}^{\left(n\right)}$$

(2)

Here, the subscripts *i,j* refer to the models being compared, and the bracketed superscript refers to the *n*-th subject. The reason one can simply multiply the probabilities (or add the log-evidences) is that the measured data can be regarded as conditionally independent samples over subjects. However, this does not represent a formal evaluation of the conditional density of a particular model given data from all subjects. Furthermore, it rests upon a very particular generative model for group data: first, select one of *K* models from a multinomial distribution and then generates data, under this model, for each of the *N* subjects. This is fundamentally different from a generative model which treats subjects as random effects: here we would select a model for each subject by sampling from a multinomial distribution, and then generate data under that subject-specific model. The distinction between these two generative models is illustrated graphically in Figure 1.

Bayesian dependency graphs for fixed effects (A) and random effects generative models for multi-subject data (B, C). The graphical model in Figures 1B and 1C are equivalent; we show both because 1B is more intuitive for readers unfamiliar with graphical **...**

In short, the *GBF* encodes the relative probability that the data were generated by one model relative to another, assuming the data were generated *by the same model* for all subjects. What we often want, however, is the density from which models are sampled to generate subject-specific data. In other words, we seek the conditional estimates of the multinomial parameters, *i.e.* the model probabilities *r*=[*r*_{1},...,*r _{K}*], that generate switches or indicator variables,

We will deal with *K* models with probabilities *r*=[*r*_{1},...,*r _{K}*] that are described by a Dirichlet distribution

$$\begin{array}{c}p\left(r\mid \alpha \right)=Dir(r,\alpha )=\frac{1}{Z\left(\alpha \right)}\prod _{k}{r}_{k}^{{\alpha}_{k}-1}\hfill \\ Z\left(\alpha \right)=\frac{\prod _{k}\Gamma \left({\alpha}_{k}\right)}{\Gamma \left(\sum _{k}{\alpha}_{k}\right)}\hfill \end{array}$$

(3)

Here, *α*=[*α*_{1},...,*α _{K}*] are related to the unobserved “occurrences” of models in the population;

$$p\left({m}_{n}\mid r\right)=\prod _{k}{r}_{k}^{{m}_{nk}}$$

(4)

For any given subject *n*, we can sample from this multinomial distribution to obtain a particular model *k*. The marginal likelihood of the data in the *n*-th subject, given this model *k*, is then obtained by integrating over the parameters of the model selected

$$p\left({y}_{n}\mid {m}_{nk}\right)=\int p\left(y\mid \vartheta \right)p\left(\vartheta \mid {m}_{nk}\right)d\vartheta $$

(5)

The graphical model summarising the dependencies among *r, m* and *y* as described by Equations 3-5 is shown in Figure 1B and 1C. Our goal is to invert this hierarchical model and estimate the posterior distribution over *r*.

Given the structure of the hierarchical model in Figure 1, the joint probability of the parameters and the data *y* can be written as:

$$\begin{array}{cc}\hfill p(y,r,m)& =p\left(y\mid m\right)p\left(m\mid r\right)p\left(r\mid {\alpha}_{0}\right)\hfill \\ \hfill & =p\left(r\mid {\alpha}_{0}\right)\left[\prod _{n}p\left({y}_{n}\mid {m}_{n}\right)p\left({m}_{n}\mid r\right)\right]\hfill \\ \hfill & =\frac{1}{Z\left({\alpha}_{0}\right)}\left[\prod _{k}{r}_{k}^{{\alpha}_{0{k}^{-1}}}\right]\left[\prod _{n}p\left({y}_{n}\mid {m}_{n}\right)\prod _{k}{r}_{k}^{{m}_{nk}}\right]\hfill \\ \hfill & =\frac{1}{Z\left({\alpha}_{0}\right)}\prod _{n}\left[\prod _{k}{\left[p\left({y}_{n}\mid {m}_{nk}\right){r}_{k}\right]}^{{m}_{nk}}{r}_{k}^{{\alpha}_{0{k}^{-1}}}\right]\hfill \end{array}$$

(6)

The log joint probability is therefore given by

$$\mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}p(y,r,m)=-\mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}Z\left({\alpha}_{0}\right)+\sum _{n}\sum _{k}(({\alpha}_{0k}-1)\mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}{r}_{k}+{m}_{nk}(\mathrm{log}p\left({y}_{n}\mid {m}_{nk}\right)+\mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}{r}_{k}))$$

(7)

The inversion of our hierarchical model relies on the following variational Bayesian (VB) approach in which we assume that an approximate posterior density *q* can be described by the following mean-field factorisation:

$$\begin{array}{c}\hfill q(r,m)=q\left(r\right)q\left(m\right)\hfill \\ \hfill q\left(r\right)\propto \mathrm{exp}\left(I\left(r\right)\right)\hfill \\ \hfill q\left(m\right)\propto \mathrm{exp}\left(I\left(m\right)\right)\hfill \\ \hfill I\left(r\right)={\langle \mathrm{log}p(y,r,m)\rangle}_{q\left(m\right)}\hfill \\ \hfill I\left(m\right)={\langle \mathrm{log}p(y,r,m)\rangle}_{q\left(r\right)}\hfill \end{array}$$

(8)

Here, *I*(*r*) and *I*(*r*) are variational energies for the mean-field partition. The mean-field assumption in Equation 8 means that the VB posterior will only be approximate but, as we shall see, it provides a particularly simple and intuitive algorithm (c.f. Equation 14). This algorithm provides precise estimates of the parameters *α* defining the approximate Dirichlet posterior *q*(*r*) ≈ *p*(*r* | *y*); this was verified by comparisons with a sampling method which is described in Appendix B.

To obtain the approximate posterior *q*(*m*) ≈ *p*(*m* | *y*), we have to do two things: first, compute *I*(*m*) and second, determine the normalising constant or partition function for exp(*I*(*m*)), which renders *q*(*m*) a probability density. Making use of the log joint probability in Equation 7 and omitting terms that do not depend on *m*, the variational energy is:

$$\begin{array}{cc}\hfill I\left(m\right)& =\int q\left(r\right)\mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}p(y,r,m)dr\hfill \\ \hfill & =\sum _{n}\sum _{k}{m}_{nk}(\mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}p\left({y}_{n}\mid {m}_{nk}\right)+\int q\left({r}_{k}\right)\mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}{r}_{k}d{r}_{k})\hfill \\ \hfill & =\sum _{n}\sum _{k}{m}_{nk}(\mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}p\left({y}_{n}\mid {m}_{nk}\right)+\Psi \left({\alpha}_{k}\right)-\Psi \left({\alpha}_{S}\right))\hfill \end{array}$$

(9)

Here ${\alpha}_{S}={\sum}_{k}{\alpha}_{k}$ and Ψ is the digamma function^{2}

$$\Psi \left({\alpha}_{k}\right)=\frac{\partial \phantom{\rule{thinmathspace}{0ex}}\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}\Gamma \left({\alpha}_{k}\right)}{\partial {\alpha}_{k}}$$

(10)

The next step is to obtain the approximate posterior, *q*(*m*): If *g _{nk}* is our (normalized) posterior belief that model

$$\begin{array}{c}{g}_{nk}=\frac{{u}_{nk}}{{u}_{n}}\hfill \\ {u}_{nk}=\mathrm{exp}(\mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}p\left({y}_{n}\mid {m}_{nk}\right)+\Psi \left({\alpha}_{k}\right)-\Psi \left({\alpha}_{s}\right))\hfill \\ {u}_{n}=\sum _{k}{u}_{nk}\hfill \end{array}$$

(11)

where *u _{nk}* is the equivalent (non-normalized) belief and

We now repeat the above procedure but this time for the approximate posterior over *r*. By substituting in the log joint probability from Equation 7 and omitting terms that do not depend on *r*, we have

$$\begin{array}{cc}\hfill I\left(r\right)& =\int q\left(m\right)\phantom{\rule{thinmathspace}{0ex}}\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}p(y,r,m)dm\hfill \\ \hfill & =\sum _{k}[({\alpha}_{0k}-1)\phantom{\rule{thinmathspace}{0ex}}\mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}{r}_{k}+\sum _{n}{g}_{nk}\mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}{r}_{k}]\hfill \\ \hfill & =\sum _{k}({\alpha}_{0k}+{\beta}_{k}-1)\mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}{r}_{k}\hfill \end{array}$$

(12)

Here, ${\beta}_{k}={\sum}_{n}{g}_{nk}$ is the expected number of subjects whose data we believe were generated by model *k*. Now, from Equation 8 we have log *q*(*r*)=*I*(*r*)+... and from Equation 3 we see that the log of a Dirichlet density is given by $\mathit{Dir}(r;\alpha )={\sum}_{k}({\alpha}_{k}-1)\mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}{r}_{k}+\dots $ Hence, by comparing terms we see that the approximate posterior *q*(*r*)=*Dir*(*r*;*α*) where:

$$\alpha ={\alpha}_{0}+\beta $$

(13)

In short, Equation 13 simply adds the ‘data counts’, *β*, to the ‘prior counts’, *α*_{0}. This is an example of a free-form VB approximation, where the optimal form of the approximate posterior (in this case a Dirichlet), has been derived rather than assumed before-hand (*c.f.* fixed-form VB approximations; Friston *et al.* 2007). It should be stressed, however, that due to the mean-field assumption used by our VB approach (see Equation 8), *q*(*r*) is only an approximate posterior and the true posterior distribution *p*(*r* | *y*) does not have the exact form of a Dirichlet distribution.

The above equations can be implemented as an optimisation algorithm which updates estimates of *α* iteratively until convergence. By combining Equations 11, 12 and 13 we get the following pseudo-code of a simple algorithm that gives us the parameters of the conditional density we seek, i.e. *q*(*r*)=*Dir*(*r*;*α*)

$$\alpha ={\alpha}_{0}$$

*Until convergence*

$$\begin{array}{c}{u}_{nk}=\mathrm{exp}(\mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}p\left({y}_{n}\mid {m}_{nk}\right)+\Psi \left({\alpha}_{k}\right)-\Psi \left(\sum _{k}{\alpha}_{k}\right))\hfill \\ {\beta}_{k}=\sum _{n}\frac{{u}_{nk}}{\sum _{k}{u}_{nk}}\hfill \\ \alpha ={\alpha}_{0}+\beta \hfill \end{array}$$

(14)

*end*

We make the usual assumption that, *a priori*, no models have been “seen” (*i.e.* the Dirichlet prior is *α*_{0} = [1,...,1]).^{3} Critically, this scheme requires only the log-evidences over models and subjects (*c.f.*
Equation 11):

After the above optimization of the Dirichlet parameters, *α*, the Dirichlet density *p*(*r* | *y*;*α*) can be used for model comparisons at the group level. There are several ways to report this comparison that result in equivalent model rankings. The simplest option is to report the estimates of the Dirichlet parameter estimates *α*. Another possibility is to use those estimates to compute the expected multinomial parameters *r _{k}* and thus the expected likelihood of obtaining the

$${\langle {r}_{k}\rangle}_{q}={\alpha}_{k}\u2215({\alpha}_{1}+\dots +{\alpha}_{K})$$

(15)

A third option is to use the conditional model probability *p*(*r* | *y*;*α*) to quantify an *exceedance probability, i.e.* our belief that a particular model *k* is more likely than any other model (of the *K* models tested), given the group data:

$$\begin{array}{c}\exists k\in \left\{1\dots K\right\},\forall j\in \{1\dots K\mid j\ne k\}:\hfill \\ {\phi}_{k}=p({r}_{k}>{r}_{j}\mid y;\alpha )\hfill \end{array}$$

(16)

The exceedance probabilities * _{k}* sum to one over all models tested. They are particularly intuitive when comparing two models (or model subsets, see below). In this case, because the conditional probabilities of the models

$$\begin{array}{cc}\hfill {\phi}_{1}& =p({r}_{1}>{r}_{2}\mid y;\alpha )\hfill \\ \hfill & =p({r}_{1}>0.5\mid y;\alpha )\hfill \end{array}$$

(17)

The analyses of empirical data below include several examples where two models are compared; the associated exceedance probabilities are shown in Figures Figures3,3, ,6,6, ,99 and and1313.

The Dirichlet density for the nonlinear partition of model space, defined by the parameter estimates shown by Figure 12C. The exceedance probability of _{1} = 98.6% (shaded area) indicates the probability that nonlinear hemodynamic models were better than **...**

Either the Dirichlet parameter estimates *α*, the conditional expectations of model probabilities *r _{k}* or the exceedance probabilities

A particular strength of the approach presented in this paper is that it can not only be used to compare specific models, but also to compare particular classes or subsets of models, resulting from a partition of model space. For example, one may want to compute the probability that a specific model attribute, say the presence vs. absence of a particular connection in a DCM, improves or reduces model performance, regardless of any other differences among the models considered. This type of inference rests on comparing two (or more) subsets of model space, pooling information over all models in these subsets. This effectively removes uncertainty about any aspect of model structure, other than the attribute of interest (which defines the partition). Heuristically, this sort of analysis can be considered a Bayesian analogue of tests for “main effects” in classical ANOVA.

Within our framework this type of analysis can be performed by exploiting the agglomerative property of the Dirichlet distribution. Generally, for any partition of model space into *J* disjoint subsets, *N*_{1},*N*_{2},...,*N _{J}*, this property ensures that

$$\begin{array}{cc}\hfill ({r}_{1},{r}_{2},\dots ,{r}_{K})& ~\mathit{Dir}({\alpha}_{1},{\alpha}_{2},\dots ,{\alpha}_{k})\hfill \\ \hfill & \Rightarrow {r}_{1}^{\ast}=\sum _{k\in {N}_{1}}{r}_{k},{r}_{2}^{\ast}=\sum _{k\in {N}_{2}}{r}_{k},\dots ,{r}_{J}^{\ast}=\sum _{k\in {N}_{J}}{r}_{k}\hfill \\ \hfill & ~\mathit{Dir}[{\alpha}_{1}^{\ast}=\sum _{k\in {N}_{1}}{\alpha}_{k},{\alpha}_{2}^{\ast}=\sum _{k\in {N}_{2}}{\alpha}_{k},\dots ,{\alpha}_{J}^{\ast}=\sum _{k\in {N}_{J}}{\alpha}_{k}]\hfill \end{array}$$

(18)

In other words, once we have estimates of the Dirichlet parameters *α _{k}* for all

An example of model space partitioning applied to the case of DCMs which were identical in network architecture (the same as *m*_{1} in Figure 8) but differed in the hemodynamic forward model employed (for details, see Stephan *et al.* 2007c).
**...**

- Eight different

In what follows, we compare classical inference, the *GBF* (fixed effects) and inference on model space (random effects) using both synthetic and real data. These data have been previously published and have been analysed in various ways, including group level model inference using *GBF*s (Stephan *et al.* 2007b, 2007c; Stephan *et al.* 2008).

To demonstrate the face validity of our method, we used simulated data, where the true model was known. Specifically, we used one of the synthetic data sets described by Stephan *et al.* (2008), consisting of twenty synthetic BOLD time-series that were generated using a three-area nonlinear DCM with fixed parameters and adding Gaussian observation noise to achieve a signal-to-noise ratio (SNR) of two. Each time-series consisted of 100 data points that were obtained by sampling the model output at a frequency of 1 Hz over a period of 100 seconds. For each time-series, we fitted (i) a nonlinear DCM with the same model structure as the model that generated the data (“correct model” in Fig. 2, model *m*_{1}), and (ii) a second DCM that was similar in structure but included a bilinear (instead of a nonlinear) modulatory influence (“incorrect model” in Fig. 2, model *m*_{2}). Using the negative free-energy approximation to the log-evidence, the differences in log-evidences for all twenty time-series are plotted in the lower part of Fig. 2. It can be seen that in 17 out of 20 cases the nonlinear model was correctly identified as the more likely model. The overall *GBF* (9 × 10^{14}) was also clearly in favour of the correct model.

Synthetic data consisting of twenty time-series that were generated using a three-area nonlinear DCM and adding random observation noise (see Stephan *et al.* 2008 for details). To each of these time-series, two models were fitted and compared: (i) a nonlinear **...**

Here, we revisit this synthetic data set using random effects BMS procedures. We first used classical inference, applying a paired *t*-test to the log-evidences of the two models. This test rejected the null hypothesis of no difference in model goodness (*t* = 4.615, *df* = 19, *p* < 10^{-4}). Applying the novel hierarchical BMS approach gave an even clearer (and arguably also more useful) answer: the exceedance probability _{1}, *i.e.* the probability of *m*_{1} being a more likely model than *m*_{2}, was 100% (Figure 3). In other words, using the exceedance probability as a criterion, the correct model was identified perfectly, given all twenty data sets and the chosen level of noise. To further corroborate this result, we compared the result from our VB algorithm to an independent method which estimates the parameters *α* by sampling from the approximate Dirichlet posterior *q*(*r*) ≈ *p*(*r* | *y*) . This comparison showed that the VB estimate of *α* resulted in an estimate of the negative free-energy *F*(*y,α*) ≤ ln *p*(*y* | *α*) that was consistent with the results from the sampling approach (Figure 4). This provides an additional validation of our VB technique. We used this sampling approach to verify the correctness of our VB estimates in all subsequent analyses.

It should be noted that this simulation study concerned the extreme case that only one model had generated all data, i.e. *r*_{1}=100% and *r*_{2}=0%, making it easy to intuitively understand the performance of the proposed model selection procedure. However, this simulation did not probe the robustness of our method when randomly sampling from a heterogeneous population of subjects whose data had been generated by different models. We will revisit this scenario in a later section of this paper once we have introduced and compared two alternative DCMs of inter-hemispheric interactions using empirical data.

As a first empirical application, we investigated a case we had encountered in our previous research (Stephan *et al.* 2007b) and which had actually triggered our interest in developing more powerful group level inference about models. This model comparison concerned DCMs describing alternative mechanisms of inter-hemispheric integration in terms of context-dependent modulation of connections. In one of the analyses of the original report (Stephan *et al.* 2007b), competing DCMs had been constructed for the ventral stream of the visual system by systematically changing which of the experimentally controlled conditions modulated the intra- and/or the inter-hemispheric connections.

First, we focused on the six-area model of the ventral stream, comprising the lingual gyrus (LG), middle occipital gyrus (MOG) and fusiform gyrus (FG) in both hemispheres, and revisit the comparison of the best two models as indexed by the *GBF*. In the first model, *m*_{1}, inter-hemispheric connections were modulated by a letter decision task, but conditional on the visual field of stimulus presentation (LD|VF); intra-hemispheric connections were modulated by LD alone (see right side of Figure 5). In the second model, *m*_{2}, these modulations were reversed: inter-hemispheric connections were modulated by LD and intra-hemispheric connections were modulated by LD|VF (see left side of Figure 5). The distribution of log-evidence differences (approximated by *AIC*/*BIC*, following the procedure suggested by Penny *et al.* 2004) is shown in the centre of Figure 5: Although *m*_{1} was robustly superior in 11 of the 12 subjects, a single outlier was so extreme that the *GBF* indicated an overall superiority of *m*_{2} (*GBF*=15 in favour of *m*_{2}). In contrast, model comparison using our novel Bayesian method was not affected by this outlier: the exceedance probability in favour of *m*_{1} was very high (_{1} = 99.7%), and the conditional expectation *r*_{1} that *m*_{1} generated the data of any randomly selected subject was 84.3% (Figure 6). The estimates of our VB method were confirmed by the sampling approach (Figure 7).

Comparison of DCMs describing alternative mechanisms of inter-hemispheric integration in terms of context-dependent modulation of connections (Stephan *et al.* 2007b). Two variants of a six-area model of the ventral stream, comprising the lingual gyrus **...**

For comparison, we also applied frequentist statistics to the log-evidences as described above. The single outlier subject made the distribution of the log-evidence differences non-normal (Kolmogorov-Smirnov test: *p* < 10^{-7}, *D _{N}* = 0.822), and thus prevented detection of a significant difference between the two models by a one-tailed paired

Next, we investigated a variant of the previous case where the distribution of log-evidences across subjects was more heterogeneous. This model comparison was essentially identical to the previous one, except that the models in question only contained four areas (LG and FG in both hemispheres), instead of six. Visual inspection of the distribution of log-evidence differences (Figure 8) shows that the same subject as in the previous example favoured *m*_{2}, albeit far less strongly; in addition three more subjects showed evidence in favour of *m*_{2}, albeit only weakly. Given this constellation, the original analysis by Stephan *et al.* (2007b) only found a relatively weak superiority of *m*_{1} (*GBF* = 8). In contrast, the VB method gave a exceedance probability of _{1} = 92.8% in favour of *m*_{1}, indicating more clearly that *m*_{1} is a superior model (Figure 9). As above, the estimates of our VB method were confirmed by sampling (Figure 10).

When comparing this result to the frequentist random effects approach, a one-tailed paired *t*-test was unable to detect a significant difference between the two models (*t* = 0.165, *df* = 11, *p* = 0.436). In contrast to the previous example, this failure was not due to outlier-induced deviations from normality: a Kolmogorov-Smirnov test applied to the log-evidences was unable to reject the null hypothesis that they were normally distributed (*p* = 0.743). Here, the between-subject variability, while in accordance with normality assumptions, was simply too large to reject the null hypothesis with the classical *t*-test. A nonparametric Wilcoxon signed rank test did not fare any better (*p* = 0.266).

In a second simulation study, we examined the robustness of our method when randomly sampling from a heterogeneous population of subjects. Specifically, we dealt with a population in which 70% of subjects showed brain responses as generated by model *m*_{1} shown in Figure 8, whereas brain activity in the remaining 30% of the population was generated by model *m*_{2}. We randomly sampled 20 subjects from this population and generated synthetic fMRI data by integrating the state equations of the associated models with fixed parameters and inputs^{5} and adding Gaussian observation noise to achieve an SNR of two. Each synthetic data set had exactly the same structure as the empirical data described in the previous section (700 data points, TR = 3 s). Both *m*_{1} and *m*_{2} were then fitted to all 20 synthetic data sets, and the resulting log-evidences were used to perform both fixed effects BMS and random effects BMS, using the VB method described in this paper. This sampling and data generation procedure was repeated 20 times, resulting in a total of 400 generated data sets and 800 fitted models. For each of the 20 sets of 20 subjects, we computed the different indices provided by random effects BMS (i.e., *α*, *r*, ) and fixed effects BMS (log GBF). The means of these indices are plotted in Figure 11, together with 95% confidence intervals (CI). If our random effects BMS method were perfect in uncovering the underlying structure of the population we sampled from, one would expect to find the following average estimates: (i) *α*_{1}=22×0.7=15.4,*α*_{2}=22×0.3=6.6 for the Dirichlet parameters, (ii) *r*_{1}=0.7,*r*_{2}=0.3 for the posterior expectations of model probabilities, and (iii) _{1}=1,_{2}=0 as exceedance probabilities (note that the exceedance probability is not the posterior model probability itself, but a statement of belief about the posterior probability of one model being higher than the posterior probability of any other model). The actual estimates of the BMS indices for the simulated data were (i) *α*_{1} = 15.4 (CI: 14.1 - 16.7) and *α*_{2} = 6.6 (CI: 5.3 - 7.9), (ii) *r*_{1}=0.7 (CI: 0.64 - 0.76) and *r*_{2}=0.3 (CI: 0.24 - 0.36), and (iii) _{1}=0.89 (CI: 0.83 - 0.96) and _{1}=0.11 (CI: 0.04 - 0.17). For comparison, the average log GBF in favour of model *m*_{1} was 548.9 (CI: 446.2 - 651.6).

Summary of the results from a simulation study in which we examined the robustness of our method when randomly sampling from a heterogeneous population of subjects. Specifically, we dealt with a population in which 70% of subjects showed brain responses **...**

In conclusion, while our random effects BMS method provides a slightly overconservative estimate of exceedance probabilities for the chosen sample size, it shows very good performance overall, providing BMS indices that accurately reflect the structure of the population we sampled from. In particular, the Dirichlet parameters and posterior expectations of model probabilities (which represent the expected probability of obtaining the *k*-th model when randomly selecting a subject) were estimated very precisely. This result not only validates the results obtained for the empirical data set described above, but demonstrates more generally that our BMS procedure is robust when randomly sampling from a heterogeneous population of subjects.

Finally, we revisited a comparison of DCMs, which were identical in network architecture (the same as *m*_{1} in Figure 8) but differed in the hemodynamic forward model employed (Stephan *et al.* 2007c). A three-factor design was used to construct 8 different models: (i) nonlinear vs. linear BOLD equations, (ii) classical vs. revised coefficients of the BOLD equation, and (iii) free vs. fixed parameter (*ε*) for the ratio of intra- and extravascular signal changes. In the original analysis by Stephan *et al.* (2007c), the *GBF* (based on the negative free-energy approximation) was used to establish the best among the eight models. The best model, abbreviated as RBM_{N}(*ε*) in Figure 12, was characterised by (i) a nonlinear BOLD equation, (ii) revised coefficients of the BOLD equation, and (iii) free *ε*. The difference of its summed log-evidence compared to the second-best model, its linear counterpart RBM_{L}(*ε*), was 5.26, corresponding to a *GBF* of 192 in favour of the nonlinear model. The summed log-evidences for all 8 models are shown in Figure 12A.

Here, we demonstrate how one can use the agglomerative property of the Dirichlet distribution (Equation 18) to go beyond selective comparisons of specific models and instead examine the relative importance of particular model attributes or model subspaces. Given the three factors above, we focussed on the importance of nonlinearities: what is the posterior probability that nonlinear BOLD equations improve the model compared to linear BOLD equations, regardless of any other dimensions of model space (*i.e.*, classical vs. revised coefficients and free vs. fixed *ε*)?

Following Equation 18, this question is addressed easily. In a first step, the VB procedure was applied to the entire set of eight models, yielding posterior estimates of the Dirichlet parameters *α*_{1},...,*α*_{8} (see Figure 12B). Subsequently, a new Dirichlet density reflecting the partition of model space into nonlinear and linear subspaces was computed by summing *α _{k}* separately for the nonlinear and linear models (Figure 12C; for simplicity the ordering of the models in Figure 12 has been chosen such that the first four models are nonlinear [left of the dashed line], whereas the last four models are linear [right of the dashed line]) The resulting Dirichlet can then be used to compare nonlinear and linear models in exactly the same way as one compares two models;

For comparison, we also used classical inference, applying a repeated-measure ANOVA (with Greenhouse-Geisser correction for non-sphericity) to the log-evidences of the eight models. The result of this test was compatible with the above analysis, rejecting the null hypothesis that linear and nonlinear models were equal in log-evidence (*F* = 24.330, *df* = 1,11, *p* < 0.0004).

In this paper, we have introduced a novel approach for model selection at the group level. Provisional experience suggests that this approach represents a more powerful way of quantifying one’s belief that a particular model is more likely than any other at the group level, relative to the conventional *GBF*. Critically, this variational Bayesian approach rests on treating the model switches *m _{i}* as a random variable, within a full hierarchical model for multi-subject data (see Figure 1), and thus accommodates random effects at the between-subject level. Notably, this inference procedure needs only the log-evidences for each model and subject.

In the empirical examples above, we showed two cases where frequentist tests failed to indicate clear differences between models, while the novel Bayesian approach succeeded. In one case (the six-area ventral stream model), a strong outlier subject made the distribution of log-evidences non-normal and thus rendered the *t*-test (but not a non-parametric test) unable to find a significant difference between models. In another case (the four-area ventral stream model), the distribution of log-evidences was normal, but with a between-subject variance that was big enough to prevent significant results by frequentist tests (parametric or non-parametric). It should be noted, however, that the frequentist and Bayesian approaches do not test the same thing. The frequentist approach tries to reject the null hypothesis that there are no differences in log-evidence across models. In contrast, the Bayesian approach estimates the models’ probabilities, given the data, and enables inference in terms of exceedance probabilities: the exceedance probability * _{k}* is the probability that a given model

The exceedance probability of a model differs in a subtle but important way from the conventional posterior probability of a model in Bayesian model comparison: Because we have a hierarchical model, the posterior probability that any particular model caused the data from a subject chosen at random, is itself a random variable (*r* in the derivations above). This means that the exceedance probability is a statement of belief about the posterior probability, not the posterior probability itself. So, for example, when we say that the exceedance probability is 98%, we mean that we can be 98% confident that the favoured model has a greater posterior probability than any other model tested. This is not the same as saying that the posterior probability of the favoured model is 98%. The advantage of using exceedance probabilities is that they are sensitive to the confidence in the posterior probability and easily interpretable (since they sum to unity over all models tested).

As can be seen from Equations 9 and 11, our method is sensitive to both the distribution and the magnitude of log-evidence differences. The same is true for frequentist tests applied to log-evidence differences, *e.g. t*-tests. However, a critical difference between these frequentist approaches and the VB method is that for the latter the influence of outliers has a natural bound. There is a simple and intuitive reason for this nice property of the VB method: if we keep increasing the log-evidence of model *k* for a particular subject *n*, our posterior belief that *k* generated the data of subject *n* (that is, *g _{nk}*=

Another important advantage of the method proposed here is that it can go beyond the selective comparison of specific models and enables one to assess the importance of changes along any specific dimension of model space. This type of inference, which could be seen as a Bayesian analogue of testing for “main effects” in classical ANOVA, rests on comparing two (or more) subsets of models (*i.e*., model subspaces). These partitions would typically reflect those components of model structure that one seeks inference about; *e.g.* whether a specific connection should be included in the model or not, whether a particular connection is modulated by one experimental condition or another, or whether certain effects are linear or nonlinear. We used this approach to demonstrate that hemodynamic models with nonlinear BOLD equations are superior to those with linear ones. This result is in accordance with previous studies that highlight the importance of nonlinearities in the BOLD signal (Deneux & Faugeras 2006; Friston *et al.* 2000; Miller *et al.* 2001; Stephan *et al.* 2007c; Vazquez & Noll 1998; Wager *et al.* 2005). However, in these earlier studies, this conclusion was based on comparisons of specific and single instances of linear and nonlinear hemodynamic models.

The inferential advance achieved by the present method is that arbitrarily large set of models can be considered together, allowing one to integrate out uncertainty over any aspect of model structure, other than the one of interest.

At first glance, it may appear surprising that the hierarchical model described above has been introduced as a generative model for the data *y*, given its inversion does not need the data but the model evidence, *p*(*y* | *m*). This apparent contradiction could be resolved by noting that the log-evidence is a function of the data and represents a sufficient ‘summary statistic’. To generate data, one would need to introduce the model parameters * _{k}* to the graphical model shown in Figure 1B,C. In the context of DCM, for example, once one has drawn a model

One property of the method proposed in this paper is that for each subject *n* our posterior beliefs about model *k* having generated their data sum to one over all models that are considered, that is $\forall n:\sum _{k=1}^{K}{g}_{nk}=1$ (*c.f.*
Equation 11). In other words, our posterior belief about which model *k* is most likely to have generated the data for a given subject *n* is a function of the entire set of models considered. This means that reducing or extending model space can change our inference about which model is most likely at the group level. Although this is a fairly trivial corollary, it should not be forgotten when using this method in practice. In short, one should infer the most likely model by comparing the entire set of plausible models at once, instead of selectively analysing subparts of model space.

To our knowledge, there has been relatively little work on group level methods for Bayesian model comparison so far. In addition to the *GBF* (Stephan *et al.* 2007b), we had previously suggested a metric called the “positive evidence ratio” (*PER*; Stephan *et al.* 2007b, 2007c). Based on the conventional definition of “positive evidence” as a Bayes factor larger than three (Kass & Raftery 1995), the *PER* is simply the number of subjects where there is positive (or stronger) evidence for model 1 divided by the number of subjects with positive (or stronger) evidence for model 2. While the *PER* is insensitive to outliers, it is also insensitive to the magnitude of the differences across subjects. More importantly, however, it is only a descriptive index that does not allow for probabilistic inference in a straightforward manner. In the approach described in this paper, the sufficient statistics for the model frequencies are the posterior estimates of the Dirichlet parameters (*α*). When the differences in model evidences are very strong, these simply boil down to the number of subjects with positive (and more) evidence in favour of a particular model. In that case where for each subject there is one highly superior model, the expected model frequencies become identical to the PER. From this perspective, the present approach can be considered a (probabilistic) generalisation of the PER.

The only other work on group level methods for Bayesian model comparison that we are aware of is a recent paper by Li *et al.* (2008) who suggested a “group-level BIC score”. This score is derived by summing the BIC for each model across subjects. As explained earlier in this paper, the BIC is a well-known approximation to the log-evidence (Schwarz 1978). The group-level BIC score by Li *et al.* (2008) thus approximates the sum of log-evidences and simply corresponds to the log *GBF*. Effectively, the analysis by Li *et al.* (2008) thus used a fixed effects analysis across models that is formally identical to that used in reports of DCM studies (*e.g*. Acs & Greenlee 2008; Allen *et al.* 2008; Grol *et al.* 2007; Heim *et al.* 2008; Kumar *et al.* 2007; Smith *et al.* 2006; Stephan *et al.* 2007a,b; Summerfield & Koechlin 2008).

Finally, it should be noted that a random effects model selection approach is not necessarily preferable to a fixed effects approach. The choice between fixed and random effects BMS depends on the specific scientific question addressed. In the context of basic mechanisms that are unlikely to differ across subjects, the conventional GBF is both sufficient and appropriate. For example, it is unlikely that subjects differ with regard to basic physiological mechanisms such as the involvement of sodium ion channels in action potential generation or the presence of certain types of connections in the brain. In this context, it is perfectly tenable to assume that all subjects generate data under the same model; and the data from all subjects can be pooled to select this model in the usual way. In contrast, whenever subjects can exhibit different models or functional architectures, the random effects BMS technique presented in this paper is a more appropriate method. For example, there is evidence that many higher cognitive functions can rely on more than one neurobiological system (Price & Friston 2002). Also, it is likely that in some mental diseases, e.g. schizophrenia, patients with identical symptoms show heterogeneity with regard to the pathophysiological processes involved (Stephan *et al.* 2006).

In summary, in contrast to the *GBF* and other established approaches for group-level model comparison, the approach suggested in this paper rests on a hierarchical model for multi-subject data that accommodates random effects at the between-subject level (Figure 1) and thus provides a generic framework for hypothesis testing. We expect this method to be a useful tool for group studies, not only in the context of dynamic causal modelling, but also for a range of other modelling endeavours; for example, comparing different source reconstruction methods for EEG/MEG at the group level (Henson *et al.* 2007; Litvak & Friston 2008; Mattout *et al.* 2007), or selecting among competing computational models of learning and decision-making, given data from a group of subjects (Brodersen *et al.* 2008; Hampton *et al.* 2006).

This work was funded by the Wellcome Trust (KES, WDP, RJM, KJF) and the University Research Priority Program “Foundations of Human Social Behaviour” at the University of Zurich (KES). JD is funded by Marie Curie Fellowship. We are very grateful to Marcia Bennett for helping prepare this manuscript, to the FIL Methods Group, particularly Justin Chumbley, for useful discussions and to Jon Roiser and Dominik Bach for helpful comments on practical applications. Finally, we would like to thank the two anonymous reviewers for their constructive comments which have greatly helped to improve this paper.

With the exception of some special cases (e.g., linear models), the integral expression for the model evidence (Equation 1) is analytically intractable and numerically difficult to compute. Under these circumstances, people generally adopt a bound approach where, instead of evaluating the integral above, one optimises a bound on the integral using iterative sampling or analytic techniques. The most common approach of the latter kind is variational Bayes. In this framework, one posits an approximating conditional or posterior density on the unknown parameters, *q*(), and optimises this density with respect to a free-energy bound, *F*, on the log-evidence:^{6}

$$F=\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}p(y\mid m)-KL[q\left(\vartheta \right),p(\vartheta \mid y,m)]$$

A.1

Because of its relation to variational calculus and Gibb’s free-energy in statistical physics, this free-energy bound *F* is often referred to as the “negative free-energy” or “variational free-energy” (Friston *et al.* 2007; MacKay 2003; Neal & Hinton 1998). Its second term is the Kullback-Leibler (KL) divergence (Kullback & Leibler 1951) between the approximating posterior density *q*() and the true posterior *p*( | *y,m*), which is always positive (or zero when *q*() becomes identical to *p*( | *y,m*)). By iterative optimisation, the negative free-energy *F* is made an increasingly tighter lower bound on the desired log-evidence, ln *p*(*y* | *m*); as a consequence, the KL divergence between the approximating and true posterior is minimised. There are a number of approximations that are used when specifying the form of *q*(). These include the ubiquitous mean-field approximation, where various sets of unknown parameters are assumed to be independent, so that the conditional density can be factorised. A common example here would be a bipartition into the regression coefficients of a general linear model and the parameters controlling random effects or error variance. Another common approximation within the mean-field framework is to assume that the conditional density is multivariate Gaussian. This is also known as the Laplace approximation, a full treatment of which can be found in Friston *et al.* (2007).

For any approximation to the conditional density, the free-energy bound on the log-evidence can be re-written as a mixture of accuracy and complexity:

$$F={\langle \mathrm{log}\phantom{\rule{thinmathspace}{0ex}}p(y\mid \vartheta ,m)\rangle}_{q}-KL[q\left(\vartheta \right),p(\vartheta \mid m)]$$

A.2

The *accuracy* (first term) is simply the log-likelihood of the data expected under the conditional density. The *complexity* (second term) is the Kullback-Leibler divergence between the approximating posterior and prior density. In other words, it reflects the amount of information obtained about the model parameters, from the data. Clearly, model complexity will increase with the number of parameters (provided that they can be estimated precisely and that they diverge from their prior values). However, model complexity depends on factors other than the mere number of parameters, *e.g.* how much these parameters are dependent on each other, both *a priori* and *a posteriori*. This is seen easily under the Laplace approximation, *i.e.* assuming that the conditional density is multivariate Gaussian. In this case, the complexity can be written as follows (see the Appendix of Penny *et al.* 2004):

$$KL[q\left(\vartheta \right),p(\vartheta \mid m)]=\frac{1}{2}\mid {C}_{\vartheta}\mid -\frac{1}{2}\mid {C}_{\vartheta \mid y}\mid +\frac{1}{2}{({\mu}_{\vartheta \mid y}-{\mu}_{\vartheta})}^{T}{C}_{\vartheta}^{-1}({\mu}_{\vartheta \mid y}-{\mu}_{\vartheta})$$

A.3

Here, |*C _{}*| and |

In addition to the free-energy bound approximation, there are two other commonly used approximations to the log-evidence, which appeal to the behaviour of the complexity term as the number of observations becomes infinite. We will call these limit-approximations. These include the *AIC* and *BIC* (see Penny *et al.* 2004). The key difference between the free-energy bound and these limit approximations is that the latter assume a much simpler approximation to the complexity. Under Gaussian assumptions about the error:

$$\begin{array}{c}\mathit{BIC}={\langle \mathrm{log}\phantom{\rule{thinmathspace}{0ex}}p(y\mid \vartheta ,m)\rangle}_{q}-\frac{p}{2}\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}n\hfill \\ \mathit{AIC}={\langle \mathrm{log}\phantom{\rule{thinmathspace}{0ex}}p(y\mid \vartheta ,m)\rangle}_{q}-p\hfill \end{array}$$

A.4

It can be seen that the *AIC* and *BIC* approximate the complexity with the number of parameters or the number of parameters *p*, scaled by the log of the number of observations, *n*. These can be useful approximations when it is difficult to invert the model or optimise the free-energy bound, because one only needs to compute the accuracy or fit of the model to provide an estimate of the log-evidence. However, comparing the complexity terms in these expressions to Equation A.3, shows that both the *AIC* and *BIC* will fail in various situations. An obvious example is redundant parameterisation; the true complexity will not change when we add a parameter whose effect is identical to another parameter in measurement space. While the free-energy bound would take this redundancy into account, retaining the same complexity, the *AIC* and *BIC* approximations would indicate that complexity has increased. In practice, many models show partial dependencies amongst parameters, meaning that *AIC* and *BIC* routinely over-estimate the effect that adding or removing parameters has on model complexity.

In this appendix, we introduce a sampling procedure that provides an approximation to the negative free energy *F*(*y,α*) ≤ ln *p*(*y* | *α*) which is independent from the VB approach described in the main text. This sampling procedure can be used to demonstrate the correctness of the proposed VB procedure by verifying that the algorithm described by Equation 14 provides an accurate solution for the variational energies in the mean-field approximation of Equation 8. In this context, it should be noted that we are assuming that the exact posterior *p*(*r* | *y*) can be adequately approximated by a Dirichlet density *q*(*r*); therefore, the procedure proposed in this appendix samples from the approximate posterior *q*(*r*), not from the exact posterior *p*(*r* | *y*).

We seek the posterior density on the multinomial parameters *r*=[*r*_{1},...,*r _{K}*] that generate switches or indicator variables,

$$q(r;\alpha )=D\left(\alpha \right)=\frac{1}{Z\left(\alpha \right)}\prod _{k=1}^{k}{r}_{k}^{{\alpha}_{k}-1}\Rightarrow \mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}q(r;\alpha )=-\mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}Z\left(\alpha \right)+\sum _{k}\left({\alpha}_{k-1}\right)\mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}{r}_{k}$$

B.1

where the expected multinomial parameters (*i.e*., conditional expectation that the *k*-th model will be selected at random) are

$$\begin{array}{c}\hfill {\langle {r}_{k}\rangle}_{q}=\frac{{\alpha}_{k}}{{\alpha}_{S}}\hfill \\ \hfill {\alpha}_{S}=\sum _{k=1}^{K}{\alpha}_{k}\hfill \end{array}$$

B.2

Note that a Dirichlet form ensures that $\sum _{k=1}^{K}{r}_{k}=1$. The normalising or partition coefficient in B.1 is

$$Z\left(\alpha \right)=\frac{\prod _{k}\Gamma \left({\alpha}_{k}\right)}{\Gamma \left({\alpha}_{S}\right)}\Rightarrow \mathrm{ln}Z\left(\alpha \right)=-\mathrm{ln}\Gamma \left({\alpha}_{S}\right)+\sum _{k}\mathrm{ln}\Gamma \left({\alpha}_{k}\right)$$

B.3

We can now construct a free-energy bound in the usual way, assuming Dirichlet priors *α*_{0} (which would usually be *α*_{0}=[1,...,1] unless one had prior beliefs about which model is more likely to be selected):

$$F(y,\alpha )={\langle \mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}p({y}_{1}\mid r)+...+\mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}p({y}_{N}\mid r)+\mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}p(r\mid {\alpha}_{0})-\mathrm{ln}q(r\mid \alpha )\rangle}_{q}$$

B.4

This can be decomposed into three terms:

$$\begin{array}{c}{\langle \mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}p({y}_{n}\mid r)\rangle}_{q}={\langle \mathrm{ln}\sum _{k}p({y}_{n}\mid {m}_{nk}=1){r}_{k}\rangle}_{q}\hfill \\ {\langle \mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}q(r\mid \alpha )\rangle}_{q}=-\mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}Z\left(\alpha \right)+\sum _{k}({\alpha}_{k}-1)[\Psi \left({\alpha}_{k}\right)-\Psi \left({\alpha}_{S}\right)]\hfill \\ {\langle \mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}p(r\mid {\alpha}_{0})\rangle}_{q}=-\mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}Z\left({\alpha}_{0}\right)+\sum _{k}({\alpha}_{0k}-1)[\Psi \left({\alpha}_{k}\right)-\Psi \left({\alpha}_{S}\right)]\hfill \end{array}$$

B.5

The last two terms only depend on the priors *α*_{0k} and the parameters *α* of the Dirichlet and can thus be computed directly. The first term can be computed numerically by drawing a large number of samples from *q*(*r*;*α*) . In this paper, we gridded the possible range for values of *α _{k}, i.e.* [1 ...

$$\alpha =\underset{\alpha}{\mathrm{arg}\phantom{\rule{thinmathspace}{0ex}}\mathrm{max}}F(y,\alpha )$$

B.6

As a final note, we would like to point out that one could also use Jensen’s inequality to simplify the first term in B.5:

$$\begin{array}{cc}\hfill {\langle \mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}p({y}_{n}\mid r)\rangle}_{q}& ={\langle \mathrm{ln}\sum _{k}p({y}_{n}\mid {m}_{nk}=1){r}_{k}\rangle}_{q}\hfill \\ \hfill & \ge {\langle \sum _{k}{r}_{k}\phantom{\rule{thinmathspace}{0ex}}\mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}p({y}_{n}\mid {m}_{nk}=1)\rangle}_{q}=\sum _{k}\frac{{\alpha}_{k}}{{\alpha}_{S}}\mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}p({y}_{n}\mid {m}_{nk}=1)\hfill \end{array}$$

B.7

This effectively provides a lower-bound on a lower-bound, which can be simplified to give

$$\begin{array}{cc}\hfill & \stackrel{~}{F}(y,\alpha )=\hfill \\ \hfill & \sum _{k}\left(\frac{{\alpha}_{k}}{{\alpha}_{S}}\sum _{n}\mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}p({y}_{n}\mid {m}_{nk}=1)-({\alpha}_{k}-{\alpha}_{k}^{0})[\Psi \left({\alpha}_{k}\right)-\Psi \left({\alpha}_{S}\right)]+\mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}\Gamma \left({\alpha}_{k}\right)\right)-\mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}\Gamma \left({\alpha}_{S}\right)\hfill \end{array}$$

B.8

Given the priors, *α*_{0}, and the log-evidences ln *p*(*y _{n}* |

^{1}Due to the monotonic nature of the logarithmic function, model comparisons yield equivalent results regardless whether one maximises the model evidence or the log-evidence. Since the latter is numerically easier, it is usually the preferred metric.

^{2}See Appendix B in Bishop (2006) concerning the use of the digamma function in Equation 10.

^{3}Note that this choice of Dirichlet prior is a “flat” prior, assigning uniform probabilities to all models. In contrast, a Dirichlet prior with elements below unity results in a highly concave probability density that concentrates the probability mass around zero and one, respectively.

^{4}For the special case of “drawing” a single “sample” (model), the multinomial distribution of models reduces to *p*(*m _{nk}*=1 |

^{5}The coupling parameters of all endogenous connections were set to 0.1 s^{-1}, except for the inhibitory self-connections whose strengths were set to -1 s^{-1}. Furthermore, the strengths of all modulatory and driving inputs were set to 0.3 s^{-1}. The input functions were the same as in the empirical dataset described above.

^{6}Because of the monotonic nature of the logarithm, one can maximise the model evidence or the log-evidence; the latter, however, is numerically more convenient to deal with. Please note that for simplicity and clarity we have removed constant terms from the definition of all approximations to the log-evidence discussed in this paper.

^{7}It is helpful to note that the determinant of a covariance matrix can be treated as a measure of the volume spanned by a set of vectors (Woodruff 2005). This volume increases with the degree of independence amongst the vectors.

Software note

The method described in this paper is freely available to the community as part of the open-source software package Statistical Parametric Mapping (SPM8; http://www.fil.ion.ucl.ac.uk/spm).

- Acs F, Greenlee MW. Connectivity modulation of early visual processing areas during covert and overt tracking tasks. NeuroImage. 2008;41:380–388. [PubMed]
- Akaike H. A new look at the statistical model identification. IEEE Trans. Automatic Control. 1974;19:716–723.
- Allen P, Mechelli A, Stephan KE, Day F, Dalton J, Williams S, McGuire PK. Fronto-temporal interactions during overt verbal initiation and suppression. J. Cogn. Neurosci. 2008;20:1656–1669. [PubMed]
- Bishop CM. Pattern recognition and machine learning. Springer; Berlin: 2006.
- Brodersen KH, Penny WD, Harrison LM, Daunizeau J, Ruff CC, Duzel E, Friston KJ, Stephan KE. Integrated Bayesian models of learning and decision making for saccadic eye movements. Neural Netw. 2008;21:1247–1260. [PMC free article] [PubMed]
- Deneux T, Faugeras O. Using nonlinear models in fMRI data analysis: Model selection and activation detection. NeuroImage. 2006;32:1669–1689. [PubMed]
- Diedrichsen J, Shadmehr R. Detecting and adjusting for artifacts in fMRI time series data. NeuroImage. 2005;27:624–634. [PMC free article] [PubMed]
- Ferguson TS. A Bayesian analysis of some nonparametric problems. Ann. Stat. 1973;1:209–230.
- Friston KJ, Mechelli A, Turner R, Price CJ. Nonlinear responses in fMRI: the Balloon model, Volterra kernels, and other hemodynamics. NeuroImage. 2000;12:466–477. [PubMed]
- Friston KJ, Harrison L, Penny W. Dynamic causal modelling. NeuroImage. 2003;19:1273–1302. [PubMed]
- Friston KJ, Mattout J, Trujillo-Barreto N, Ashburner A, Penny WD. Variational free-energy and the Laplace approximation. NeuroImage. 2007;34:220–234. [PubMed]
- Garrido MI, Kilner JM, Kiebel SJ, Stephan KE, Friston KJ. Dynamic causal modelling of evoked potentials: a reproducibility study. NeuroImage. 2007;36:571–580. [PMC free article] [PubMed]
- Garrido MI, Friston KJ, Kiebel SJ, Stephan KE, Baldeweg T, Kilner JM. The functional anatomy of the MMN: A DCM study of the roving paradigm. NeuroImage. 2008;42:936–944. [PMC free article] [PubMed]
- Grol MJ, Majdandzić J, Stephan KE, Verhagen L, Dijkerman HC, Bekkering H, Verstraten FA, Toni I. Parieto-frontal connectivity during visually guided grasping. J. Neurosci. 2007;27:11877–11887. [PMC free article] [PubMed]
- Hampton AN, Bossaerts P, O’Doherty JP. The role of the ventromedial prefrontal cortex in abstract state-based inference during decision making in humans. J. Neurosci. 2006;26:8360–8367. [PubMed]
- Heim S, Eickhoff SB, Ischebeck AK, Friederici AD, Stephan KE, Amunts K. Effective connectivity of the left BA 44, BA 45, and inferior temporal gyrus during lexical and phonological decisions identified with DCM. Hum. Brain Mapp. 30:392–402. [PubMed]
- Henson RN, Mattout J, Singh KD, Barnes GR, Hillebrand A, Friston KJ. Population-level inferences for distributed MEG source localization under multiple constraints: application to face-evoked fields. NeuroImage. 2007;38:422–438. [PubMed]
- Kass RE, Raftery AE. Bayes factors. J. Am. Stat. Assoc. 1995;90:773–795.
- Kullback S, Leibler RA. On information and sufficiency. Ann. Math. Stat. 1951;22:79–86.
- Kumar S, Stephan KE, Warren JD, Friston KJ, Griffiths TD. Hierarchical processing of auditory objects in humans. PLoS Comput. Biol. 2007;3:e100. [PubMed]
- Leff AP, Schofield AM, Stephan KE, Crinion JT, Friston KJ, Price CJ. The cortical dynamics of intelligible speech. J. Neurosci. 2008;28:13209–13215. [PMC free article] [PubMed]
- Li J, Wang J, Palmer SJ, McKeown MJ. Dynamic Bayesian network modelling of fMRI:A comparison of group-analysis methods. NeuroImage. 2008;41:398–407. [PubMed]
- Litvak V, Friston K. Electromagnetic source reconstruction for group studies. NeuroImage. 42:1490–1498. [PMC free article] [PubMed]
- MacKay DJC. Information theory, inference, and learning algorithms. Cambridge University Press; Cambridge: 2003.
- Mattout J, Henson RN, Friston KJ. Canonical Source Reconstruction for MEG. Comput. Intell. Neurosci. 2007:67613. [PMC free article] [PubMed]
- Miller KL, Luh WM, Liu TT, Martinez A, Obata T, Wong EC, Frank LR, Buxton RB. Nonlinear temporal dynamics of the cerebral blood flow response. Hum. Brain Mapp. 2001;13:1–12. [PubMed]
- Neal RM, Hinton GE. A view of the EM algorithm that justifies incremental sparse and other variants. In: Jordan MI, editor. Learning in Graphical Models. Kluwer Academic Publishers; Dordrecht: 1998.
- Neyman J, Pearson E. On the problem of the most efficient tests of statistical hypotheses. Philos. Trans. R. Soc. Lond. A. 1933;231:289–337.
- Penny WD, Stephan KE, Mechelli A, Friston KJ. Comparing dynamic causal models. NeuroImage. 2004;22:1157–1172. [PubMed]
- Pitt MA, Myung IJ. When a good fit can be bad. Trends Cogn. Sci. 2002;6:421–425. [PubMed]
- Price CJ, Friston KJ. Degeneracy and cognitive anatomy. Trends Cogn. Sci. 2002;6:416–421. [PubMed]
- Schwarz G. Estimating the dimension of a model. Ann. Stat. 1978;6:461–464.
- Smith AP, Stephan KE, Rugg MD, Dolan RJ. Task and content modulate amygdala-hippocampal connectivity in emotional retrieval. Neuron. 2006;49:631–638. [PubMed]
- Stephan KE, Baldeweg T, Friston KJ. Synaptic plasticity and dysconnection in schizophrenia. Biol. Psychiatry. 2006;59:929–939. [PubMed]
- Stephan KE, Harrison LM, Kiebel SJ, David O, Penny WD, Friston KJ. Dynamic causal models of neural system dynamics: current state and future extensions. J. Biosci. 2007a;32:129–144. [PMC free article] [PubMed]
- Stephan KE, Marshall JC, Penny WD, Friston KJ, Fink GR. Interhemispheric integration of visual processing during task-driven lateralization. J. Neurosci. 2007b;27:3512–3522. [PMC free article] [PubMed]
- Stephan KE, Weiskopf N, Drysdale PM, Robinson PA, Friston KJ. Comparing hemodynamic models with DCM. NeuroImage. 2007c;38:387–401. [PMC free article] [PubMed]
- Stephan KE, Kasper L, Harrison LM, Daunizeau J, den Ouden HEM, Breakspear M, Friston KJ. Nonlinear dynamic causal models for fMRI. NeuroImage. 2008;42:649–662. [PMC free article] [PubMed]
- Summerfield C, Koechlin E. A neural representation of prior information during perceptual inference. Neuron. 2008;59:336–347. [PubMed]
- Vazquez AL, Noll DC. Nonlinear aspects of the BOLD response in functional MRI. NeuroImage. 1998;7:108–118. [PubMed]
- Wager TD, Vazquez A, Hernandez L, Noll DC. Accounting for nonlinear bold effects in fMRI: parameter estimates and a model for prediction in rapid event-related studies. NeuroImage. 2005;25:206–218. [PubMed]
- Wager TD, Keller MC, Lacey SC, Jonides J. Increased sensitivity in neuroimaging analyses using robust regression. NeuroImage. 2005;26:99–113. [PubMed]
- Woodruff DL. General purpose metrics for solution variety. In: Rego C, Alidaee B, editors. Metaheuristic optimization via memory and evolution: tabu search and scatter search. Springer; Berlin: 2005.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |