PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of ploscompComputational BiologyView this ArticleSubmit to PLoSGet E-mail AlertsContact UsPublic Library of Science (PLoS)
 
PLoS Comput Biol. 2013 July; 9(7): e1003150.
Published online 2013 July 25. doi:  10.1371/journal.pcbi.1003150
PMCID: PMC3723502

A Mixture of Delta-Rules Approximation to Bayesian Inference in Change-Point Problems

Tim Behrens, Editor

Abstract

Error-driven learning rules have received considerable attention because of their close relationships to both optimal theory and neurobiological mechanisms. However, basic forms of these rules are effective under only a restricted set of conditions in which the environment is stable. Recent studies have defined optimal solutions to learning problems in more general, potentially unstable, environments, but the relevance of these complex mathematical solutions to how the brain solves these problems remains unclear. Here, we show that one such Bayesian solution can be approximated by a computationally straightforward mixture of simple error-driven ‘Delta’ rules. This simpler model can make effective inferences in a dynamic environment and matches human performance on a predictive-inference task using a mixture of a small number of Delta rules. This model represents an important conceptual advance in our understanding of how the brain can use relatively simple computations to make nearly optimal inferences in a dynamic world.

Author Summary

The ability to make accurate predictions is important to thrive in a dynamic world. Many predictions, like those made by a stock picker, are based, at least in part, on historical data thought also to reflect future trends. However, when unexpected changes occur, like an abrupt change in the value of a company that affects its stock price, the past can become irrelevant and we must rapidly update our beliefs. Previous research has shown that, under certain conditions, human predictions are similar to those of mathematical, ideal-observer models that make accurate predictions in the presence of change-points. Despite this progress, these models require superhuman feats of memory and computation and thus are unlikely to be implemented directly in the brain. In this work, we address this conundrum by developing an approximation to the ideal-observer model that drastically reduces the computational load with only a minimal cost in performance. We show that this model better explains human behavior than other models, including the optimal model, and suggest it as a biologically plausible model for learning and prediction.

Introduction

Decisions are often guided by beliefs about the probability and utility of potential outcomes. These beliefs are learned through past experiences that, in stable environments, can be used to generate accurate predictions. However, in dynamic environments, changes can occur that render past experiences irrelevant for predicting future outcomes. For example, after a change in government, historical tax rates may no longer be a reliable predictor of future tax rates. Thus, an important challenge faced by a decision-maker is to identify and respond to environmental change-points, corresponding to when previous beliefs should be abandoned and new beliefs should be formed.

A toy example of such a situation is shown in figure 1A, where we plot the price of a fictional stock over time. In this example, the stock price on a given day (red dots) is generated by sampling from a Gaussian distribution with variance $1 and a mean (dashed black line) that starts at $10 before changing abruptly to $20 at a change-point, perhaps caused by the favorable resolution of a court case. A trader only sees the stock price and not the underlying mean but has to make predictions about the stock price on the next day.

Figure 1
An example change-point problem.

One common strategy for computing this prediction is based on the Delta rule:

equation image
(1)

According to this rule, an observation, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e004.jpg, is used to update an existing prediction, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e005.jpg, based on the learning rate, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e006.jpg and the prediction error, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e007.jpg. Despite its simplicity, this learning rule can provide effective solutions to a wide range of machine-learning problems [1], [2]. In certain forms, it can also account for numerous behavioral findings that are thought to depend on prediction-error signals represented in brainstem dopaminergic neurons, their inputs from the lateral habenula, and their targets in the basal ganglia and the anterior cingulate cortex [3][15].

Unfortunately, this rule does not perform particularly well in the presence of change-points. We illustrate this problem with a toy example in figure 1B and C. In panel B, we plot the predictions of this model for the toy data set when An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e008.jpg is set to 0.2. In this case, the algorithm does an excellent job of computing the mean stock value before the change-point. However, it takes a long time to adjust its predictions after the change-point, undervaluing the stock for several days. In figure 1C, we plot the predictions of the model when An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e009.jpg. In this case, the model responds rapidly to the change-point but has larger errors during periods of stability.

One way around this problem is to dynamically update the learning rate on a trial-by-trial basis between zero, indicating that no weight is given to the last observed outcome, and one, indicating that the prediction is equal to the last outcome [16], [17]. During periods of stability, a decreasing learning rate can match the current belief to the average outcome. After change-points, a high learning rate shifts beliefs away from historical data and towards more recent, and more relevant, outcomes.

These adaptive dynamics are captured by Bayesian ideal-observer models that determine the rate of learning based on the statistics of change-points and the observed data [18][20]. An example of the behavior of the Bayesian model is shown in figure 1D. In this case, the model uses a low learning rate in periods of stability to make predictions that are very close to the mean, then changes to a high learning rate after a change-point to adapt more quickly to the new circumstances.

Recent experimental work has shown that human subjects adaptively adjust learning rates in dynamic environments in a manner that is qualitatively consistent with these algorithms [16], [17], [21]. However, it is unlikely that subjects are basing these adjustments on a direct neural implementation of the Bayesian algorithms, which are complex and computationally demanding. Thus, in this paper we ask two questions: 1) Is there a simpler, general algorithm capable of adaptively adjusting its learning rate in the presence of change-points? And 2) Does the new model better explain human behavioral data than either the full Bayesian model or a simple Delta rule? We address these questions by developing a simple approximation to the full Bayesian model. In contrast to earlier work that used a single Delta rule with an adaptive learning rate [17], [21], our model uses a mixture of biologically plausible Delta rules, each with its own, fixed learning rate, to adapt its behavior in the presence of change-points. We show that the model provides a better match to human performance than the other models. We conclude with a discussion of the biological plausibility of our model, which we propose as a general model of human learning.

Methods

Ethics statement

Human subject protocols were approved by the University of Pennsylvania internal review board. Informed consent was given by all participants prior to taking part in the study.

Change-point processes

To familiarize readers with change-point processes and the Bayesian model, we first review these topics in some detail and then turn our attention to the reduced model.

In this paper we are concerned with data generated from change-point processes. An example of such a process generating Gaussian data is given in figure 2. We start by defining a hazard rate, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e010.jpg, that in the general case can be variable over time but for our purposes is assumed to be constant. Change-point locations are then generated by sampling from a Bernoulli distribution with this hazard rate, such that the probability of a change-point occurring at time An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e011.jpg is An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e012.jpg (figure 2A). In between change-points, in periods we term ‘epochs,’ the generative parameters of the data are constant. Within each epoch, the values of the generative parameters, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e013.jpg, are sampled from a prior distribution An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e014.jpg, for some hyper-parameters An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e015.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e016.jpg that will be described in more detail in the following sections. For the Gaussian example, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e017.jpg is simply the mean of the Gaussian at each time point. We generate this mean for each epoch (figure 2B) by sampling from the prior distribution shown in figure 2C. Finally, we sample the data points at each time An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e018.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e019.jpg from the generative distribution An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e020.jpg (figure 2D and E).

Figure 2
An example of the generative process behind a change-point data set with Gaussian data.

Full Bayesian model

The goal of the full Bayesian model [18], [19] is to make accurate predictions in the presence of change-points. This model infers the predictive distribution, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e031.jpg, over the next data point, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e032.jpg, given the data observed up to time An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e033.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e034.jpg.

In the case where the change-point locations are known, computing the predictive distribution is straightforward. In particular, because the parameters of the generative distribution are resampled independently at a change-point (more technically, the change-points separate the data into product partitions [22]) only data seen since the last change-point are relevant for predicting the future. Therefore, if we define the run-length at time An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e035.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e036.jpg, as the number of time steps since the last change-point, we can write

equation image
(2)

where we have introduced the shorthand An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e038.jpg to denote the predictive distribution given the last An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e039.jpg time points. Assuming that our generative distribution is parameterized by parameters An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e040.jpg, then An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e041.jpg is straightforward to write down (at least formally) as the marginal over An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e042.jpg

equation image
(3)

where An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e044.jpg is the inferred distribution over An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e045.jpg given the last An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e046.jpg time points, and An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e047.jpg is the likelihood of the data given the generative parameters.

When the change-point locations are unknown the situation is more complex. In particular we need to compute a probability distribution over all possible values for the run-length given the observed data. This distribution is called the run-length distribution An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e048.jpg. Once we have the run-length distribution, we can compute the predictive distribution in the following way. First we compute the expected run-length on the next trial, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e049.jpg; i.e.,

equation image
(4)

where the sum is over all possible values of the run-length at time An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e051.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e052.jpg is the change-point prior that describes the dynamics of the run-length over time. In particular, because the run-length either increases by one, with probability An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e053.jpg in between change-points, or decreases to zero, with probability An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e054.jpg at a change-point, the change-point prior, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e055.jpg, takes the following form

equation image
(5)

Given the distribution An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e057.jpg, we can then compute the predictive distribution of the data on the next trial, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e058.jpg in the following manner,

equation image
(6)

where the sum is over all possible values of the run-length at time An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e060.jpg.

All that then remains is to compute the run-length distribution itself, which can be done recursively using Bayes' rule

equation image
(7)

Substituting in the form of the change-point prior for An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e062.jpg we get

equation image
(8)

Thus for each value of the run-length, all but two of the of the terms in equation 7 vanish and the algorithm has complexity of An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e064.jpg computations per timestep. Unfortunately, although this is a substantial improvement compared to An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e065.jpg complexity of a more naïve change-point model, this computation is still quite demanding. In principle, the total number of run-lengths we must consider is infinite, because we must allow for the possibility that a change-point occurred at any time in the past. In practice, however, it is usual to introduce a maximum run-length, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e066.jpg, and define the change-point prior here to be

equation image
(9)

With this procedure, the complexity of the computation is bounded but still can remain dauntingly high.

Efficient solution for exponential families

This inference algorithm is particularly well suited to problems that involve exponential family distributions (such as the Gaussian, Bernoulli, or Laplace distributions) with a conjugate prior [23]. For these cases, the predictive distribution given the run-length, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e068.jpg, can be represented with a finite number of parameters, called sufficient statistics, that are easily updated when new data arrive.

Specifically, we assume that An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e069.jpg is sampled from a distribution with parameters An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e070.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e071.jpg, which can be related to An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e072.jpg as

equation image
(10)

If An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e074.jpg is an exponential family distribution and we assume a conjugate prior, then this equation is relatively straightforward to compute. Specifically we assume that An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e075.jpg has the form

equation image
(11)

where the forms of An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e077.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e078.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e079.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e080.jpg determine the specific type of exponential family distribution. For example, for a Gaussian distribution with unknown mean, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e081.jpg, and known variance An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e082.jpg, we have

equation image
(12)

We further assume that the generative parameters, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e084.jpg, are resampled at each change-point from a conjugate prior distribution of the form

equation image
(13)

where An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e086.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e087.jpg are the prior hyperparameters and the forms of An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e088.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e089.jpg determine the nature of the prior distribution.

For example, for a Gaussian prior distribution over An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e090.jpg with standard deviation An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e091.jpg and mean An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e092.jpg, we set

equation image
(14)

With this conjugate prior, the posterior distribution over the parameters given the last An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e094.jpg data points, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e095.jpg, has the same form as the prior, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e096.jpg and we can write

equation image
(15)

This posterior distribution, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e098.jpg (and thus also the likelihood An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e099.jpg by equation 10), is parameterized by the sufficient statistics An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e100.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e101.jpg. Crucially, these statistics are straightforward to compute, as follows

equation image
(16)

and

equation image
(17)

Thus, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e104.jpg is constant for a given run-length, and An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e105.jpg computes a running sum of the most recent An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e106.jpg data points (transformed by function An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e107.jpg).

It is useful to write the equation for An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e108.jpg as an update rule; that is, in terms of the sufficient statistics at an earlier time point. In particular, for An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e109.jpg, we can write the update in terms of the sufficient statistic at the previous time point and run-length; i.e.,

equation image
(18)

Dividing through by An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e111.jpg gives a Delta-rule update for the mean, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e112.jpg:

equation image
(19)

Note that in this case the learning rate, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e114.jpg, decays as the run-length increases.

Graphical interpretation

The previous sections showed that, for conjugate exponential distributions, the Bayesian model needs to keep track of only the run-length distribution, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e115.jpg, and the sufficient statistics, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e116.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e117.jpg, for each run-length to fully compute the predictive distribution, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e118.jpg. This algorithm also has an intuitive interpretation in terms of message passing on a graph (Figure 3A). Each node in this graph represents a run-length, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e119.jpg, with two properties: 1) the sufficient statistics, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e120.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e121.jpg, associated with that run-length, and 2) a ‘weight’ representing the probability that the run-length at time An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e122.jpg is An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e123.jpg; i.e., An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e124.jpg. The weights of the nodes are computed by passing messages along the edges of the graph. Specifically, each node, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e125.jpg, sends out two messages: an ‘increasing’ message to node An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e126.jpg that corresponds to an increase in run-length if no change-point occurred, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e127.jpg, and 2) a ‘change-point’ message, to An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e128.jpg, corresponding to a decrease in run-length at a change-point, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e129.jpg. The weight of node An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e130.jpg is then updated by summing all of the incoming messages and multiplying it by An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e131.jpg, which implements equation 8.

Figure 3
Schematic of algorithms.

Reduced model

Despite the elegance of the full Bayesian algorithm, it is complex, requiring a memory of a large number (An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e132.jpg) of different run-lengths, which, in the worst case, is equivalent to keeping track of all the past data. Thus, it seems an unlikely model of human cognition, and a key question is whether comparable predictive performance can be achieved with a simpler, more biologically plausible algorithm. Here we introduce an approximation to the full model that addresses these issues. First we reduce the model's complexity by removing nodes from the update graph (Figure 3). Then we transform the update equation for An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e133.jpg into a Delta-rule update equation in which the sufficient statistic on each node updates independently of the other nodes. The resulting algorithm is a biologically plausible mixture of Delta-rules that is able to flexibly adapt its overall learning rate in the presence of change-points and whose performance is comparable with that of the full Bayesian model at a fraction of the computational cost. Below we derive new update equations for the sufficient statistics and the weights of each new node for this reduced model.

To more easily distinguish the full and reduced models, we use An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e134.jpg to denote run-length in the reduced model and An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e135.jpg to denote run-length in the full model. Thus, the reduced model has An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e136.jpg nodes, where node An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e137.jpg has run-length An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e138.jpg. The set of run-lengths, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e139.jpg, are ordered such that An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e140.jpg. Unlike the full model, the run-lengths in the reduced model can take on non-integer values, which allows greater flexibility.

The first step in our approximation is to remove nodes from the update graph. This step reduces the memory demands of the algorithm but also requires us to change the update rule for the sufficient statistic and the form of the change-point prior.

Consider a node with run-length An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e141.jpg. In the full Bayesian model, the sufficient statistic for this node would be

equation image
(20)

Note that this form of the update relies on having computed An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e143.jpg, which is the sufficient statistic at run length An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e144.jpg. In the full Bayesian model, this procedure is straightforward because all possible run-lengths are represented. In contrast, the reduced model includes only a subset of possible run-lengths, and thus a node with run-length An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e145.jpg will not exist for some values of An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e146.jpg. Therefore, the reduced model must include a new method for updating the sufficient statistic and a new form of the change-point prior.

We first note that another way of writing the update for An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e147.jpg is as

equation image
(21)

This sliding-window update equation depends only on information available at node An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e149.jpg and thus does not rely on knowing the sufficient statistic at node An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e150.jpg. However, this update also has a high memory demand because, to update the sliding window, we have to subtract An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e151.jpg, which we can only do if we keep track of the previous An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e152.jpg data points on each node.

In our model, we remove the dependence on An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e153.jpg, and hence the additional memory demands, by taking the average of equation 21. This procedure leads to a memoryless (yet approximate) form of the update equation for each node. In particular, if we take the average of equation 21 with respect to An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e154.jpg, we have

equation image
(22)

where we have introduced An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e156.jpg as the Delta-rule's approximation to the mean sufficient statistic and

equation image
(23)

as the mean of the node. Dividing equation 21 by An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e158.jpg gives us the following form of the update for the mean

equation image
(24)

Note that this equation for the update of An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e160.jpg is a Delta rule, just like equation 1, with a fixed learning rate, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e161.jpg. Thus, the reduced model simply has to keep track of An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e162.jpg for each node and update it using only the most recent data point. This form of update rule also allows us to interpret non-integer values of the run-length, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e163.jpg, in terms of changes in the learning rate of the Delta rule on a continuum. In figure 4 we show the effect of this approximation on the extent to which past data points are used to compute the mean of each node. The sliding window rule computes the average across the last An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e164.jpg data points, ignoring all previous data. In contrast, the Delta rule computes a weighted average using an exponential that decays over time, which tends to slightly under-emphasize the contributions of recent data and over-emphasize the contributions of distant data relative to the sliding window.

Figure 4
Comparison of the extent to which the sliding window and Delta rule updates weigh past information for different run-lengths.

Reducing the number of nodes in the model also requires us to change how we update the weights of each node. In particular the update for the weights, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e168.jpg, is given as

equation image
(25)

This equation is similar to equation 7 but differs in the number of run-lengths available. Crucially, this difference requires an adjustment to the change-point prior. The adjusted prior should approximate the full change-point prior (Eq. 5) as closely as possible. Recall that the full prior captures the fact that the run-length either decreases to zero if there is a change-point (with prior probability An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e170.jpg) or increases by one if there is no change-point (with prior probability An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e171.jpg).

To see how to compute this adjusted prior in the reduced model, we first decompose the change-point prior into two terms corresponding to the possibility that a change-point will occur or not; i.e.,

equation image
(26)

where An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e173.jpg is the probability that the run-length is An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e174.jpg given that there was a change-point and that the previous run-length was An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e175.jpg. Similarly An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e176.jpg is the probability that the run-length is An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e177.jpg given that the previous run-length was An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e178.jpg and there was not a change-point.

The change-point case is straightforward, because a change-point always results in a transition to the shortest run-length; i.e., An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e179.jpg is zero, except when An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e180.jpg when it takes value 1.

The no change-point case, however, is more difficult. In the full model the run-length increases by 1 when there is no change-point, thus we would like to have

equation image
(27)

However, because the nodes have variable spacing in the reduced model, this form is not possible as there may be no node with a run-length An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e182.jpg. We thus seek an approximation such that the prior defines an average increase in run-length of 1 if there is not a change-point. That is, we require

equation image
(28)

For An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e184.jpg we can match this expectation exactly by setting

equation image
(29)

For An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e186.jpg we approximate An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e187.jpg using

equation image
(30)

In this case we do not match the expected increase in run-length. For the final node, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e189.jpg, it is impossible to transition to a longer run-length and so we simply have a self transition with probability 1; i.e.,

equation image
(31)

Taken together with equation 26, equations 29, 30 and 31 define the change-point prior in the reduced model.

Like the full Bayesian model, our reduced model also has a graphical interpretation. Again each node, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e191.jpg, keeps track of two quantities: 1) the mean An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e192.jpg, computed according to equation 24, and 2) the weight An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e193.jpg. As in the full model, the weights are computed by passing messages along the edge of the graph. However, the structure of the graph is slightly different, with no increasing message being sent by node An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e194.jpg and an extra ‘self’ message from An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e195.jpg to itself. The increasing message has weight

equation image
(32)

the self message has weight

equation image
(33)

and the change-point message has weight

equation image
(34)

Finally the new weight for each node is computed by summing all of the incoming messages to implement equation 25.

Results

In this section we present the results of simple simulations comparing the reduced and full models, investigate the error between the reduced model's predictions and the ground truth and use our model to fit human behavior on a simple prediction task with change-points.

Simulations

First we consider the simplest cases of one and two nodes with Gaussian data. These cases have particularly simple update rules, and their output is easy to understand. We then consider the more general case of many nodes to show how the reduced model retains many of the useful properties of the full model, such as keeping track of an approximate run-length distribution and being able to handle different kinds of data.

One and two nodes

To better understand the model it is useful to consider the special cases of one and two nodes with Gaussian data. When there is only one node, the model has only one run-length, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e199.jpg. The update for the mean of this single node is given by

equation image
(35)

where we have used the fact that, for Gaussian data with a known variance, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e201.jpg, we have An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e202.jpg. This update rule is, of course, equivalent to a simple Delta rule with a fixed learning rate. Because there is only one node, computing the run-length distribution is trivial, as An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e203.jpg for all An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e204.jpg and thus the predictions of this model are simply the mean of the single Delta rule.

In the two-node case the model has two nodes with run-lengths An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e205.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e206.jpg. The means of these nodes update according to independent Delta rules

equation image
(36)

The prediction of the two-node model is given as the weighted sum of these two nodes

equation image
(37)

where the weights, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e209.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e210.jpg, are the components of the run-length distribution that update according to equation 25. For node 1, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e211.jpg updates as

equation image
(38)

where we have used the fact that An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e213.jpg because the run-length distribution is normalized. For node 2, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e214.jpg updates as

equation image
(39)

Thus, for the two-node case, the run-length distribution is closely tied to the likelihood of the data for each of the nodes, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e216.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e217.jpg. These likelihoods are computed in a straightforward manner given the mean and run-length of each node. For Gaussian data these likelihoods take the form

equation image
(40)

An illustration of the output of the one and two node models is shown in figure 5A. This figure shows the predictions of one- and two-node models when faced with a relatively simple change-point task. To generate this figure, the one-node model had a single run-length, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e219.jpg, whereas the two-node model had two run-lengths, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e220.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e221.jpg. The hazard rate in each model was set to 0.1, and the noise standard deviation, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e222.jpg, was set at 0.5. The two-node model is much better able to adapt to the change-point than the one-node model. Figure 5B shows the evolving weights of the two nodes, determined from the run-length distribution. Before the change-point, the model has a high weight on the An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e223.jpg node and a low weight on the An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e224.jpg node. At the change-point, this trend reverses abruptly but then returns after the model stabilizes to the mean of the new data.

Figure 5
Output of one- and two- node models on a simple change-point task.

Many nodes

Here we illustrate the utility of the approximate algorithm to solve simulated change-point problems using three different types of generative distribution. The first is a Bernoulli process with a piecewise constant rate, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e225.jpg, (Figure 6A) in which the generative distribution takes the following exponential family form

equation image
(41)

and with a uniform prior distribution defined by An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e227.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e228.jpg.

Figure 6
Examples comparing estimates and run-length distributions from the full Bayesian model and our reduced approximation.

The second is a Gaussian distribution with known standard deviation, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e229.jpg = 5, but unknown mean (Figure 6B). In this case, the generative distribution takes on the following exponential family form

equation image
(42)

with prior hyperparameters An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e231.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e232.jpg.

The third is a Gaussian distribution with a known mean, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e233.jpg = 0, and a changing standard deviation An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e234.jpg (Figure 6C). In this case, the generative distribution takes on the following exponential family form

equation image
(43)

with prior hyper parameters An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e236.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e237.jpg.

For all three cases, both the full and reduced models used a fixed hazard rate (equal to 0.05 for the first and third cases, 0.025 for the second case). The reduced models used as initial sufficient statistics An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e238.jpg for case 1 and 2 and An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e239.jpg in case 3, and had 18 nodes spaced logarithmically between 1 and 100.

In figure 6, the top row shows the true value of the parameter of interest for the generative process (the Bernoulli rate in panel A, the mean in panel B, and the standard deviation in panel C), the generated data, and the inferred value of the parameter from the full (blue) and reduced (red) models. For all three cases, there is a close correspondence between the values inferred by the full and reduced models. For the Bernoulli case, the full model has an average mean squared error (relative to ground truth) of 0.037 versus 0.041 for the reduced model. For the Gaussian case with known variance the mean squared errors are 13.9 for the full model and 16.4 for the reduced model. For the Gaussian case with known variance the errors are 4.3 and 6.2 respectively. We also show the run-length distributions inferred by both models (middle and bottom rows), which are more sparsely sampled by the reduced models but still pick up the major trends seen in the full model.

For these examples, we used more nodes in the reduced model than were necessary to solve these problems effectively, because this approach allowed us to better visualize the run-length distribution in the reduced model and to facilitate comparison with the full model. In the next section, we explore the relationship between the effectiveness of the reduced model and its number of nodes in more detail.

Performance of the reduced model relative to ground truth

Here we derive an approximate, but analytic, expression for the average discrepancy between the predictions made by the reduced model and the ground truth generative parameters. We then use this result to compute approximately optimal node arrangements for a variety of conditions and investigate how the error varies as a function of the parameters in the model.

Analytic expression for error

Although there are many measures we could use to quantify the error between the approximation and the ground truth, for reasons of analytic tractability, we focus here on the squared error. More specifically, we compute the expected value, over data and time, of the squared error between the predictive mean of the reduced model, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e240.jpg, and the ground truth mean on the next time step, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e241.jpg; i.e.,

equation image
(44)

Because our model is a mixture model, the mean An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e243.jpg is given by

equation image
(45)

For notational convenience we drop the An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e245.jpg subscripts and refer to node An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e246.jpg simply by its subscript An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e247.jpg, and we write An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e248.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e249.jpg. We also refer to the learning rate of node An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e250.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e251.jpg. Finally, we refer to the set of nodes in the reduced model as An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e252.jpg, such that the above equation, in our new notation, becomes

equation image
(46)

Substituting this expression into equation 44 for the error we get

equation image
(47)

Here we have made a mean-field approximation along the lines of

equation image
(48)

where An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e256.jpg is the average run-length distribution over the reduced model. This assumption is clearly not strictly true, because the weights of the two nodes are driven by at least some of the same data points. Accordingly, this approximation breaks down under certain conditions. For example, when change-point locations are known exactly, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e257.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e258.jpg are strongly correlated, because if An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e259.jpg, then An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e260.jpg is necessarily zero. Thus, under these conditions, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e261.jpg is only non-zero when An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e262.jpg, which is not true in the approximation. However, in noisy environments, change-point locations are rarely known exactly and this approximation is far less problematic. As we show below, the approximation provided a reasonably close match to the actual squared error measured from simulations for both Bernoulli and Gaussian data.

Equations 47 and 48 imply that, to compute the error, we need to compute four quantities: the averages over An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e263.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e264.jpg, and An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e265.jpg, in addition to the expected run-length distribution, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e266.jpg. A full derivation of these terms is presented in the Supplementary Material; here we focus on presenting how this error varies with model parameters in the specific cases of Bernoulli and Gaussian data. To facilitate comparison between these two different data types, we compute the error relative to the variance of the prior distribution over the data,

equation image
(49)

where An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e268.jpg is the prior over the data given by An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e269.jpg. An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e270.jpg is the mean squared error if the algorithm simply predicted the mean of the prior distribution at each time step. Thus the ‘relative error,’ An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e271.jpg, takes a value of one when the algorithm picks the mean of the prior distribution, which is the limiting case as the learning rate approaches zero.

Error for one node

We first consider how the relative error varies as a function of hazard rate and learning rate for a model with just one node (figure 7). The one-node case is useful because we can easily visualize the results and, because in this case the run-length distribution has only one non-zero term, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e272.jpg, the expression for the error is exact. Figures 7A and B consider Bernoulli data with a uniform prior (An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e273.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e274.jpg = 1). For different settings of the hazard rate, there is a unique learning rate (which is bounded between 0 and 1) that minimizes the error. The value of this optimal learning rate tends to increase as a function of increasing hazard rate, except at high hazard rates when it decreases to near zero. This decrease at high hazard rates is due to the fact that when a change happens on nearly every trial, the best guess is the mean of the prior distribution, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e275.jpg, which is better learned with a smaller learning rate that averages over multiple change-points.

Figure 7
Error and optimal learning rates from the one-node model.

Figure 7C and D consider a Gaussian distribution with unknown mean and known variance (using parameters that match the experimental setup: standard deviation = 10, prior parameters An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e277.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e278.jpg). These plots show the same qualitative pattern as the Bernoulli case, except that the relative error is smaller and the optimal learning rate varies over a wider range. This variability results from the fact that the costs involved in making a wrong prediction can be much higher in the Gaussian case (because of the larger variance) than the Bernoulli case, in which the maximal error is between −1 and 1.

Error for multiple nodes

Next we consider the case of multiple nodes. Figure 8 shows the optimal learning rates as a function of hazard rate for the reduced model with 1–3 nodes for Bernoulli (panels A–C) and Gaussian (panels D–F) data. In the Bernoulli case, going to two nodes adds a second, larger learning rate that shows the same non-monotonic dependence on hazard rate as with one node. However, the hazard rate at which the smaller learning rate goes to zero is lower than in the one-node case. For three nodes, the relationship between optimal learning rate and hazard rate is more complicated. We see numerical instability in the optimization procedure at low hazard rate, caused by the presence of several distinct local minima. We also see complex behavior at higher hazard rates, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e279.jpg as the smallest learning rate goes to zero, the behavior of the other two learning rates changes dramatically. Similar results were obtained for the Gaussian case except that for three nodes, the optimal node positions become degenerate as the highest two learning rates converge for intermediate values of the hazard rate.

Figure 8
Optimal learning rates.

In figure 9 we show the relative error as a function of hazard rate at the optimal learning rate settings computed both from simulation and our analytic expression. The close agreement between theory and simulation provides some justification for the approximations we used. More generally, we see that the relative error increases with hazard rate and decreases slightly with more nodes. The biggest improvement in performance comes from increasing from one to two nodes.

Figure 9
Error (normalized by the variance of the prior, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e280.jpg) as a function of hazard rate for the reduced model at the optimal parameter settings.

Fits to experimental data

In this section, we ask how well our model describes human behavior by fitting versions of the model to behavioral data from a predictive-inference task [24]. Briefly, in this task, 30 human subjects (19 female, 11 male) were shown a sequence of numbers between 0 and 300 that were generated by a Gaussian change-point process. This process had a mean that was randomly sampled at every change-point and a standard deviation that was constant (set to either 5 or 10) for blocks of 200 trials. Samples were constrained to be between 0 and 300 by keeping the generative means away from these bounds (the generative means were sampled from uniform distribution [from 40 to 260]) and resampling the small fraction of samples outside of this range until they lay within the range. The hazard rate was set at 0.1 except for the first three trials following a change-point, in which case the hazard rate was zero.

The subjects were required to predict the next number in the sequence and obtained more reward the closer their predictions were to the actual outcome. In particular, subjects were required to minimize the mean absolute error between prediction and outcome, which we denote An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e281.jpg. Because prediction errors depended substantially on the specific sequence of numbers generated for the given session, the exact conversion between error and monetary reward was computed by comparing performance with two benchmarks: a lower benchmark (LB) and an higher benchmark (HB). The LB was computed as the mean absolute difference between sequential generated numbers. The HB was the mean difference between mean of the generative distribution on the previous trial and the generated number. Payout was then computed as follows:

equation image
(50)

A benefit of this task design is that the effective learning rates used by subjects on a trial-by-trial basis can be computed in terms of their predictions following each observed outcome, using the relationships in equation 1. Our previous studies indicated that these learning rates varied systematically as a function of properties of the generative process, including its standard deviation and the occurrence of change-points [17], [24].

To better understand the computational basis for these behavioral findings, we compared five different inference models: the full Bayesian model (‘full’), the reduced model with 1 to 3 nodes and the approximately Bayesian model of Nassar et al [17]. The Nassar et al model instantiates an alternative hypothesis to the mixture of fixed Delta rules by using a single Delta rule with a single, adaptive learning rate to approximate Bayesian inference.

On each trial, each of these models, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e283.jpg, produces a prediction An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e284.jpg about the location of the next data point. To simulate the effects of decision noise, we assume that the subjects' reported predictions, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e285.jpg, are subject to noise, such that

equation image
(51)

where An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e287.jpg is sampled from a Gaussian distribution with mean 0 and standard deviation An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e288.jpg that we fit as a free parameter for all models.

In addition to this noise parameter, we fit the following free parameters for each model: The full model and the model of Nassar et al. have a hazard rate as their only other parameter, the one-node model has a single learning rate and the remaining models with An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e289.jpg nodes (An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e290.jpg) have a hazard rate as well as the An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e291.jpg learning rates.

Our fits identified the model parameters that maximized the log likelihood of the observed human predictions, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e292.jpg, given each of the models, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e293.jpg, which is given by

equation image
(52)

We used the maximum likelihood value to approximate the log Bayesian evidence, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e295.jpg for each model using the standard Bayesian information criterion (BIC) approximation [25], which takes into account the different numbers of parameters in the different models; i.e.,

equation image
(53)

where An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e297.jpg is the number of free parameters in model An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e298.jpg.

Models were then compared at the group level using the Bayesian method of Stephan et al. [26]. Briefly, this method aggregates the evidence from each of the models for each of the subjects to estimate two measures of model fit. The first, which we refer to as the ‘model probability’, is an estimate of how likely it is that a given model generated the data from a randomly chosen subject. The second, termed the ‘exceedance probability’, is the probability that one model is more likely than any of the others to have generated the behavior of all of the subjects.

An important question when interpreting the model fits is the extent to which the different models are identifiable using these analyses. In particular we are interested in the extent to which different models can be separated on the basis of their behavior and the accuracy with which the parameters of each model can be fit.

The question of model identifiability is addressed in figure 10, where we plot two confusion matrices showing the model probability (A) and the exceedance probability (B) for simulated data. These matrices were generated using simulations that matched the human-subjects experiments, with the same values of the observed stimuli, the same number of trials per experiment and the same parameter settings as found by fitting the human data. Ideally, both confusion matrices should be the identity matrix, indicating that data fit to model An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e299.jpg is always generated by model An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e300.jpg and never by any other model (e.g., [27]). However, because of noise in the data and the limited number of trials in the experiment, it is often the case that not all of the models are completely separable. In the present case, there is good separation for the Nassar et al., full, 1-node, and 2-node models and reasonable separation between the 3-node model and others. When we extended this analysis to include 4- and 5-node models, we found that they were indistinguishable from the 3-node model. Thus, these models are not included in our analyses, and we consider the ‘3-node model’ to represent a model with 3 or more nodes. Note that the confusion matrix showing the exceedance probability (figure 10B) is closer to diagonal than the model probability confusion matrix (figure 10A). This result reflects the fact that exceedance probability is computed at the group level (i.e., that all the simulated data sets were generated by model M), whereas model probability computes the chance that any given simulation is best by model An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e301.jpg.

Figure 10
Confusion matrices.

To address the question of parameter estimability, we computed correlations between the simulated parameters and the parameter values recovered by the fitting procedure for each of the models. There was strong correspondence between the simulated and fit parameter values for all of the models and all correlations were significant (see supplementary table S1).

The 3-node model most effectively describes the human data (Figure 11), producing slightly better fits than the model of Nassar et al. at the group level. Figure 11A shows model probability, the estimated probability that any given subject is best fit by each of the models. This measure showed a slight preference for the 3-node model over the model of Nassar et al. Figure 11B shows the exceedance probability for each of the models, the probability that each of the models best fits the data at the group level. Because this measure aggregates across the group it magnifies the differences between the models and showed a clearer preference for the 3-node model. Table 1 reports the means of the corresponding fit parameters for each of the models (see also supplementary figure S1 for plots of the full distributions of the fit parameters). Consistent with the optimal parameters derived in the previous section (figure 9E), for the 2- and 3-node models, the learning rate of the 1st node is close to one (mean ~0.95).

Figure 11
Results of the model-fitting procedure.
Table 1
Table of mean fit parameter values for all models ± s.e.m.

Discussion

The world is an ever-changing place. Humans and animals must recognize these changes to make accurate predictions and good decisions. In this paper, we considered dynamic worlds in which periods of stability are interrupted by abrupt change-points that render the past irrelevant for predicting the future. Previous experimental work has shown that humans modulate their behavior in the presence of such change-points in a way that is qualitatively consistent with Bayesian models of change-point detection. However, these models appear to be too computationally demanding to be implemented directly in the brain. Thus we asked two questions: 1) Is there a simple and general algorithm capable of making good predictions in the presence of change-points? And 2) Does this algorithm explain human behavior? In this section we discuss the extent to which we have answered these questions, followed by a discussion of the question that motivated this work: Is this algorithm biologically plausible? Throughout we consider the broader implications of our answers and potential avenues for future research.

Does the reduced model make good predictions?

To address this question, we derived an approximation to the Bayesian model based on a mixture of Delta rules, each implemented in a separate ‘node’ of a connected graph. In this reduced model, each Delta rule has its own, fixed learning rate. The overall prediction is generated by computing a weighted sum of the predictions from each node. Because only a small number of nodes are required, the model is substantially less complex than the full Bayesian model. Qualitatively, the outputs of the reduced and full Bayesian models share many features, including the ability to quickly increase the learning rate following a change-point and reduce it during periods of stability. These features were apparent for the reduced model even with a small number of (2 or 3) nodes. Thus, effective solutions to change-point problems can be achieved with minimal computational cost.

For future work, it would be interesting to consider other generative distributions, such as a Gaussian with unknown mean and variance or multidimensional data (e.g., multidimensional Gaussians) to better assess the generality of this solution. In principle, these extensions should be straightforward to deal with in the current model, which would simply require the sufficient statistic An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e305.jpg to be a vector instead of a scalar. Another obvious extension would be to consider generative parameters that drift over time (perhaps in addition to abrupt changes at change-points) or a hazard rate that changes as a function of run-length and/or time.

Does the reduced model explain human behavior?

To address this question, we used a model-based analysis of human behavior on a prediction task with change-points. The reduced model fit the behavioral data better than either the full Bayesian model or a single learning-rate Delta rule. Our fits also suggest that a three-node model can in many cases be sufficient to explain human performance on the task. However, our experiment did not have the power to distinguish models with more that three nodes. Thus, although the results imply that the three-node model is better than the other models we tested, we cannot rule out the possibility that humans use significantly more that three learning rates.

Despite this qualification, it is an intriguing idea that the brain might use just a handful of learning rates. Our theoretical analysis suggests that this scheme would yield only a small cost in performance for the variety of different problems considered here. In this regard, our model can be seen as complementary to recent work showing that in many probabilistic-inference problems faced by humans [28] and pigeons [29], as few as just one sample from the posterior can be enough to generate good solutions.

It is also interesting to note that, for models with more than one node, the fastest learning rate was always close to one. Such a high learning rate corresponds to a Delta rule that does not integrate any information over time and simply uses the last outcome to form a prediction. This qualitative difference in the behavior of the fastest node could indicate a very different underlying process such as working memory for the last trial as is proposed in [30], [31].

One situation in which many nodes would be advantageous is the case in which the hazard rate changes as a function of run-length. In this case, only having a few run-lengths available would be problematic, because the changing hazard rate would be difficult to represent. Experiments designed to measure the effects of variable hazard rates on the ability to make predictions might therefore be able to distinguish whether multiple Delta rules are indeed present.

Is the reduced model biologically plausible?

The question of biological plausibility is always difficult to answer in computational neuroscience. This difficulty is especially true when the focus of the model is at the algorithmic level and is not directly tied to a specific neural architecture, like in this study. Nevertheless, one useful approach to help guide an answer to this question is to associate key components of the algorithm to known neurobiological mechanisms. Here we support the biological plausibility of our reduced model by showing that signatures of all the elements necessary to implement it have been observed in neural data.

In the reduced model, the update of each node uses a simple Delta rule with a fixed learning rate. The ‘Delta’ of such an update rule corresponds to a prediction error, correlates of which have been found throughout the brain, including notably brainstem dopaminergic neurons and their targets, and have been used extensively to model behavioral data [3][15].

More recently, several studies have also shown evidence for representations of different learning rates, as required by the model. Human subjects performing a statistical-learning task used a pair of learning rates, one fast and one slow, that were associated with BOLD activity in two different brain areas, with the hippocampus responsible for slow learning and the striatum for fast learning [32]. A related fMRI study showed different temporal integration in one network of brain areas including the amygdala versus another, more sensory network [33]. Complementary work at the neural level found a reservoir of many different learning rates in three brain regions (anterior cingulate cortex, dorsolateral prefrontal cortex, and the lateral intraparietal area) of monkeys performing a competitive game [34]. Likewise, neural correlates of different learning rates have been identified in each of the ventral tegmental area and habenula [35]. Finally, outside of the reward system, other fMRI studies using scrambled movies have found evidence for temporal receptive fields of increasingly long time scales (equivalent to decreasingly small learning rates) up the sensory processing hierarchy [36].

Applied to our model, these results suggest that each node is implemented in a distinct, although not necessarily anatomically separated, population of neurons. For our task and the above-referenced studies, in which trials last on the order of seconds, we speculate that the mean of a node is encoded in persistent firing of neurons. Alternatively, for tasks requiring learning over longer timescales, other mechanisms such as changes in synaptic weights might play key roles in these computations.

Our model also depends on the run-length distribution, An external file that holds a picture, illustration, etc.
Object name is pcbi.1003150.e306.jpg. Functionally, this distribution serves as a weighting function, determining how each of the different nodes (corresponding to different run lengths) contributes to the final prediction. In this regard, the run-length distribution can be thought of as an attentional filter, similar to mechanisms of spatial or feature-based attention, evident in multiple brain regions that enhance the output of certain signals and suppress others. For longer timescales, this kind of weighting process might have analogies to certain mechanisms of perceptual decision-making that involve the readout of appropriate sensory neurons [37]. Intriguingly, these readout mechanisms are thought to be shaped by experience – governed by a Delta-rule learning process – to ultimately enhance the most reliable sensory outputs and suppress the others [38], [39]. We speculate that a similar process might help select, from a reservoir of nodes with different learning rates, those that can most effectively solve a particular task.

The brain must also solve another challenge to directly implement the run-length distribution in our model. In particular, the update equation for the weights (Eq. 25) includes a constant of proportionality that serves to normalize the probability distribution. On a computer, ensuring that the run-length distribution is normalized is relatively straightforward: after the update we just divide by the sum of the node weights. In the brain, this procedure requires some kind of global divisive normalization among all areas coding different nodes. While such divisive normalization is thought to occur in the brain [40], it may be more difficult to implement over different brain regions that are far apart.

Mixture of Delta rules versus direct modulation of learning rate

An alternative account of variability in learning rates is that the brain uses a single Delta rule whose learning rate is modulated directly. This kind of model has been used previously to explain certain behavioral and imaging results in the context of change-point tasks [17], [21]. A leading candidate for this role is the neuromodulator norepinephrine (NE), which is released from the locus coeruleus (LC) and has been proposed to encode the unexpected uncertainty associated with change-points [41]. The wide-ranging projections of LC, which include most cortical and subcortical structures, and the neuromodulatory properties of NE, which adapts the gain of neural response functions [42], make this system ideally suited to deliver a global signal such as the learning rate. Control of LC could come from top-down projections from anterior cingulate cortex [16], amygdala [43], and posterior cingulate cortex [44], all of which have been proposed to encode learning rate.

Indirect evidence for this account comes from putative correlates of LC activity such as pupil dilation [43] and skin conductance response [43] that have been found to correlate with observed learning rate. However, such results are also consistent with our model if we assume that LC signals shifts in attentional focus to Delta rules with shorter learning rates, or a modified version of our model in which the learning rates of the different nodes adapt.

Our model-based analysis of behavioral data provides some evidence in favor of the present model over the fixed learning rate model of Nassar et al. However, because the experiment was not specifically designed to tease apart these two alternatives, and we did not consider every possible implementation of a variable learning rate model, the result should be treated with caution. To fully distinguish between these two accounts will require careful experimentation to determine whether the learning rate of individual neurons (using recordings from animals) or whole brain areas (using fMRI in humans) are variable or are fixed.

Supporting Information

Figure S1

Histograms of fit parameter values for all models. Each column represents a model, with the name of the model given at the top. Each row represents a single variable going, in order from top to bottom: hazard rate, decision noise standard deviation, learning rate 1, learning rate 2 and learning rate 3. Where a particular model does not have a particular parameter that box is left empty.

(TIF)

Table S1

Table showing correlation coefficient between simulated and fit parameter values.

(PDF)

Text S1

Derivation of error relative ground truth.

(PDF)

Funding Statement

This work was supported by grants NIH R01 EY015260 (to JIG) and F21 MH093099 (to MRN). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Bertsekas D, Tsitsiklis JN (1996) Neurodynamic Programming. Belmont, NJ: Athena Scientific.
2. Sutton RS, Barto AG (1998) Reinforcement Learning : An Introduction. Cambridge, Massachusetts: The MIT Press.
3. Rescorla RA,Wagner AR (1972) A Theory of Pavlovian conditioning: variations in the effectiveness of reinforcement and nonreinforcement. In: Black AH, Prokasy WF, editors, Classical conditioning II: current research and theory. New York: Appleton Century Crofts. chapter 3. pp. 64–99.
4. Miller RR, Barnet RC, Grahame NJ (1995) Assessment of the Rescolra-Wagner model. Psychological Bulletin 117: 363–386. [PubMed]
5. Schultz W, Dayan P, Montague PR (1997) A Neural Substrate of Prediction and Reward. Science 275: 1593–1599. [PubMed]
6. Holroyd CB, Coles MGH (2002) The Neural Basis of Human Error Processing: Reinforcement Learning, Dopamine, and the Error-Related Negativity. Psychological Review 109: 679–709. [PubMed]
7. Doherty JO, Critchley H, Deichmann R, Dolan RJ (2003) Dissociating Valence of Outcome from Behavioral Control in Human Orbital and Ventral Prefrontal Cortices. The Journal of Neuroscience 23: 7931–7939. [PubMed]
8. Brown JW, Braver TS (2005) Learned Predictions of Error Likelihood in the Anterior Cingulate Cortex. Science 307: 1118–1121. [PubMed]
9. Debener S, Ullsperger M, Siegel M, Fiehler K, Cramon DYV, et al. (2005) Trial-by-Trial Coupling of Concurrent Electroencephalogram and Functional Magnetic Resonance Imaging Identifies the Dynamics of Performance Monitoring. The Journal of Neuroscience 25: 11730–11737. [PubMed]
10. Seo H, Lee D (2007) Temporal Filtering of Reward Signals in the Dorsal Anterior Cingulate Cortex during a Mixed-Strategy Game. The Journal of Neuroscience 27: 8366–8377. [PMC free article] [PubMed]
11. Matsumoto M, Hikosaka O (2007) Lateral habenula as a source of negative reward signals in dopamine neurons. Nature 447: 1111–1115. [PubMed]
12. Matsumoto M, Matsumoto K, Abe H, Tanaka K (2007) Medial prefrontal cell activity signaling prediction errors of action values. Nature Neuroscience 10: 647–656. [PubMed]
13. Kennerley SW, Behrens TEJ, Wallis JD (2011) Double dissociation of value computations in orbitofrontal and anterior cingulate neurons. Nature Neuroscience 14: 1581–1589. [PMC free article] [PubMed]
14. Silvetti M, Seurinck R, Verguts T (2011) Value and prediction error in medial frontal cortex: integrating the single-unit and systems levels of analysis. Frontiers in Human Neuroscience 5: 1–15. [PMC free article] [PubMed]
15. Hayden BY, Pearson JM, Platt ML (2011) Neuronal basis of sequential foraging decisions in a patchy environment. Nature Neuroscience 14: 933–939. [PMC free article] [PubMed]
16. Behrens TEJ, Woolrich MW, Walton ME, Rushworth MFS (2007) Learning the value of information in an uncertain world. Nature Neuroscience 10: 1214–1221. [PubMed]
17. Nassar MR, Wilson RC, Heasly B, Gold JI (2010) An Approximately Bayesian Delta-Rule Model Explains the Dynamics of Belief Updating in a Changing Environment. The Journal of Neuroscience 30: 12366–12378. [PMC free article] [PubMed]
18. Adams RP, Mackay DJC (2007) Bayesian Online Changepoint Detection. Technical report, Cambridge University, Cambridge.
19. Fearnhead P, Liu Z (2007) On-line inference for multiple changepoint problems. J R Statist Soc B 69: 589–605.
20. Wilson RC, Nassar MR, Gold JI (2010) Bayesian Online Learning of the Hazard Rate in Change-Point Problems. Neural Computation 2476: 2452–2476. [PMC free article] [PubMed]
21. Krugel LK, Biele G, Mohr PNC, Li SC, Heekeren HR (2009) Genetic variation in dopaminergic neuromodulation inuences the ability to rapidly and exibly. PNAS 106: 17951–17956. [PubMed]
22. Barry JA, Hartigan D (1992) Product Partition Models for Change Point Problems. The Annals of Statistics 20: 260–279.
23. Wainwright MJ, Jordan MI (2008) Graphical Models, Exponential Families, and Variational Inference. Machine Learning 1: 1–305.
24. Nassar MR, Rumsey KM, Wilson RC, Parikh K, Heasly B, et al. (2012) Rational regulation of learning dynamics by pupil-linked arousal systems. Nature Neuroscience 15: 1040–1046. [PMC free article] [PubMed]
25. Schwarz G (1978) Estimating the dimension of a model. The Annals of Statistics 6: 461–464.
26. Stephan KE, Penny WD, Daunizeau J, Moran RJ, Friston KJ (2009) Bayesian model selection for group studies. Neuroimage 46: 1004–17. [PMC free article] [PubMed]
27. Steyvers M, Lee MD, Wagenmakers EJ (2009) A Bayesian analysis of human decision-making on bandit problems. Journal of Mathematical Psychology 53: 168–179.
28. Vul E, Goodman ND, Griffiths TL, Tenenbaum JB (2008) One and Done? Optimal Decisions From Very Few Samples. In: Proceedings of the 31st Annual Conference of the Cognitive Science Society.
29. Daw ND, Courville AC (2008) The pigeon as particle filter. In: Platt J, Koller D, Singer Y, Roweis S, editors, Advances in Neural Information Processing Systems 20, Cambridge, MA: MIT Press. pp. 369–376.
30. Collins AGE, Frank MJ (2012) How much of reinforcement learning is working memory, not reinforcement learning? A behavioral, computational, and neurogenetic analysis. Eur J Neurosci 35: 1024–35. [PMC free article] [PubMed]
31. Collins A, Koechlin E (2012) Reasoning, learning, and creativity: frontal lobe function and human decision-making. PLoS Biol 10: e1001293. [PMC free article] [PubMed]
32. Bornstein AM, Daw ND (2012) Dissociating hippocampal and striatal contributions to sequential prediction learning. European Journal of Neuroscience 35: 1011–1023. [PMC free article] [PubMed]
33. Gläscher J, Büchel C (2005) Formal learning theory dissociates brain regions with different temporal integration. Neuron 47: 295–306. [PubMed]
34. Bernacchia A, Seo H, Lee D, Wang XJ (2011) A reservoir of time constants for memory traces in cortical neurons. Nature Neuroscience 14: 366–372. [PMC free article] [PubMed]
35. Bromberg-Martin ES, Matsumoto M, Nakahara H, Hikosaka O (2010) Multiple Timescales of Memory in Lateral Habenula and Dopamine Neurons. Neuron 67: 499–510. [PMC free article] [PubMed]
36. Hasson U, Yang E, Vallines I, Heeger DJ, Rubin N (2008) A Hierarchy of Temporal Receptive Windows in Human Cortex. Journal of Neuroscience 28: 2539–2550. [PMC free article] [PubMed]
37. Gold JI, Shadlen MN (2007) The neural basis of decision making. Annu Rev Neurosci 30: 535–74. [PubMed]
38. Law CT, Gold JI (2008) Neural correlates of perceptual learning in a sensory-motor, but not a sensory, cortical area. Nat Neurosci 11: 505–13. [PMC free article] [PubMed]
39. Law CT, Gold JI (2009) Reinforcement learning can account for associative and perceptual learning on a visual-decision task. Nat Neurosci 12: 655–63. [PMC free article] [PubMed]
40. Heeger D (1993) Modeling simple-cell direction selectivity with normalized, half-squared, linear operators. Journal of Neurophysiology 70: 1885–1898. [PubMed]
41. Yu AJ, Dayan P (2005) Uncertainty, Neuromodulation, and Attention. Neuron 46: 681–692. [PubMed]
42. Servan-Schreiber AD, Printz H, Cohen JD (1990) Reports A Network Model of Catecholamine Effects: Gain, Signal-to-Noise Ratio, and Behavior. Science 249: 892–895. [PubMed]
43. Li J, Schiller D, Schoenbaum G, Phelps EA, Daw ND (2011) Differential roles of human striatum and amygdala in associative learning. Nature Neuroscience 14: 1250–1252. [PMC free article] [PubMed]
44. Pearson JM, Heilbronner SR, Barack DL, Hayden BY, Platt ML (2011) Posterior cingulate cortex: adapting behavior to a changing world. Trends in Cognitive Sciences 15: 143–151. [PMC free article] [PubMed]

Articles from PLoS Computational Biology are provided here courtesy of Public Library of Science