Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2966286

Formats

Article sections

- Abstract
- 1 Introduction
- 2 Inference when hazard rate is known
- 3 On-line inference of a constant hazard rate
- 4 Estimation of hazard rates in a change-point hierarchy
- 5 Node pruning for efficient computation
- 6 Simulations
- 7 Conclusion
- Supplementary Material
- References

Authors

Related links

Neural Comput. Author manuscript; available in PMC 2010 December 1.

Published in final edited form as:

PMCID: PMC2966286

NIHMSID: NIHMS236982

Robert C. Wilson, Department of Psychology, Green Hall, Princeton University, Princeton, NJ 08540, USA;

The publisher's final edited version of this article is available at Neural Comput

See other articles in PMC that cite the published article.

Change-point models are generative models of time-varying data in which the underlying generative parameters undergo discontinuous changes at different points in time known as change-points. Change-points often represent important events in the underlying processes, like a change in brain state reflected in EEG data or a change in the value of a company reflected in its stock price. However, change-points can be difficult to identify in noisy data streams. Previous attempts to identify change-points on-line using Bayesian inference relied on specifying in advance the rate at which they occur, called the hazard rate (*h*). This approach leads to predictions that can depend strongly on the choice of *h* and is unable to deal optimally with systems in which *h* is not constant in time. In this paper we overcome these limitations by developing a hierarchical extension to earlier models. This approach allows *h* itself to be inferred from the data, which in turn helps to identify when change-points occur. We show that our approach can effectively identify change-points in both toy and real data sets with complex hazard rates and how it can be used as an ideal-observer model for human and animal behavior when faced with rapidly changing inputs.

Whether one is a rat in a lab or a banker on Wall Street, making good, on-line inferences about the present to help predict the future is important for survival. However, such inferences can be difficult in a noisy and dynamic environment. In this paper we consider the problem of Bayesian on-line inference in models of dynamic environments that are characterized by “change-points,” which are defined as abrupt and potentially unsignaled events that have the effect of separating the observed time-series into independent epochs.

Such change-point models have been developed and applied to a variety of data, including stock markets (Chen and Gupta, 1997; Xuan and Murphy, 2007; Koop and Potter, 2004; Hsu, 1977), process control (Aroian and Levene, 1950), disease demographics (Denison and Holmes, 2001), DNA (Liu and Lawrence, 1999; Fearnhead and Liu, 2007), EEG (Bodenstein and Praetorius, 1977; Barlow et al., 1981), NMR (Adams and MacKay, 2007; Fearnhead, 2006), and the bee “waggle dance” (Xuan and Murphy, 2007). However, most of this progress has been restricted to offline inference, which uses the entire data stream (past, present, and future) to infer change-point locations (Smith, 1975; Barry and Hartigan, 1993; Stephens, 1994; Green, 1995; Chib, 1998; Fearnhead, 2006). Existing on-line approaches also have practical limitations, in particular requiring the unrealistic assumption that the frequency with which change-points occur, known as the hazard rate, *h*, is fixed and known in advance (Steyvers and Brown, 2006; Fearnhead and Liu, 2007; Adams and MacKay, 2007).

In this paper we we remove this limitation and present a novel Bayesian algorithm that can make on-line inferences about change-points in data in which the hazard rate is unknown and can itself undergo unsignaled change-points. The paper is organized as follows. We first review the previous Bayesian models that form the basis of our approach, paying particular attention to the limitations implied by a pre-specified hazard rate (section 2). We then show how these models can be extended to achieve on-line inference of a constant hazard rate (section 3) and a piecewise constant hazard rate in a change-point hierarchy (section 4). We then address ways of making the computations tractable by node pruning (section 5), give some numerical examples to show the effectiveness of this approach (section 6), and conclude (section 7).

Here we briefly review previous models that provide exact and efficient means for optimal, on-line inference in change-point problems (Fearnhead and Liu, 2007; Adams and MacKay, 2007). These models are based on the observation that the ability to identify a change-point depends critically on knowledge of the generative process that was active before the change-point. Thus, these models keep track of “runs” of data generated under stable conditions. However, these models require the hazard rate to be specified in advance, which we show can strongly affect model output.

For example, consider the simple change-point process illustrated in figure 1A. The data points (filled circles) are generated by adding Gaussian random noise to a mean value (dashed line). The mean value varies in a piecewise constant manner over time, changing abruptly at change-points but otherwise staying constant. Thus, sample-by-sample fluctuations in the data can reflect both noise and change-points. To predict the position of the data point at time 14, the last four points should be used to compute the current average with the smallest effects of sample-by-sample noise. These points represent the run-length, *r _{t}*, which is defined as the number of time steps since the most recent change-point. Run-lengths are plotted for the current example in figure 1B. Note that

Cartoon illustrating a simple change-point problem. **A.** Data points (circles) are generated by adding random noise to an underlying mean (dashed line) that undergoes two change-points, at times 5 and 10. **B.** The number of points since the last change-point, **...**

In general, the positions of change-points are not specified in advance but instead must be inferred from the data. Thus, optimal predictions of the next data point should consider all possible run-lengths and weigh them by the probability of the run-length given the data. More formally, if we write **x*** _{t}* as the data at time

$$p({\mathbf{x}}_{t+1}\mid {\mathbf{x}}_{1:t})=\sum _{{r}_{t}}p({\mathbf{x}}_{t+1}\mid {\mathbf{x}}_{t}^{({r}_{t})})p({r}_{t}\mid {\mathbf{x}}_{1:t})$$

(1)

where
${\mathbf{x}}_{t}^{({r}_{t})}(={\mathbf{x}}_{t-{r}_{t}+1:t})$ is the set of most recent data corresponding to run-length *r _{t}*. The run-length distribution

$$p({r}_{t}\mid {\mathbf{x}}_{1:t})=\frac{p({r}_{t},{\mathbf{x}}_{1:t})}{p({\mathbf{x}}_{1:t})}$$

(2)

where

$$p({\mathbf{x}}_{1:t})=\sum _{{r}_{t}}p({r}_{t},{\mathbf{x}}_{1:t})$$

(3)

The recursion relation for *p*(*r _{t},*

$$\begin{array}{l}p({r}_{t},{\mathbf{x}}_{1:t})=\sum _{{r}_{t-1}}p({r}_{t},{r}_{t-1},{\mathbf{x}}_{1:t})\\ =\sum _{{r}_{t-1}}p({r}_{t},{\mathbf{x}}_{t}\mid {r}_{t-1},{\mathbf{x}}_{1:t-1})p({r}_{t-1},{\mathbf{x}}_{1:t-1})\\ =\sum _{{r}_{t-1}}p({r}_{t}\mid {r}_{t-1})p({\mathbf{x}}_{t}\mid {\mathbf{x}}_{t-1}^{({r}_{t-1})})p({r}_{t-1},{\mathbf{x}}_{1:t-1})\end{array}$$

(4)

where
$p({\mathbf{x}}_{t}\mid {\mathbf{x}}_{t-1}^{({r}_{t-1})})$ is the predictive distribution given the most recent data points and *p*(*r _{t}*|

$$p({r}_{t}\mid {r}_{t-1})=\{\begin{array}{ll}1-h\hfill & \text{if}\phantom{\rule{0.16667em}{0ex}}{r}_{t}={r}_{t-1}+1\hfill \\ h\hfill & \text{if}\phantom{\rule{0.16667em}{0ex}}{r}_{t}=0\hfill \\ 0\hfill & \text{otherwise}\hfill \end{array}$$

(5)

These equations lead to the “message-passing update rule” for *p*(*r _{t}*|

This message-passing approach can be thought of as the forward sweep of a forward-backward algorithm applied to a hidden Markov model (HMM) in which the states correspond to different values of the run-length. The HMM approach to change-point models was developed in (Chib, 1998), where the author used Monte Carlo methods to perform inference in a model with a pre-specified number of change-points. See (Paquet, 2007) for more details of the HMM interpretation of the Adams-MacKay algorithm.

Despite excellent performance on many different data sets (Adams and MacKay, 2007; Fearnhead and Liu, 2007), a major limitation of this approach is that the hazard rate, *h*, must be specified in advance. Figure 2 highlights this limitation by plotting the output of the algorithm for the same data over time for different settings of *h*. Setting *h* to zero disallows the possibility of finding any change-points (figure 2A). In this case, the predictive mean at time *t* + 1 is just the mean of all the data points up to time *t*. The run-length distribution over time (bottom half of the panel) is zero everywhere except at *r _{t}* =

Effect of changing the hazard rate on the mean of the predictive distribution generated by the algorithm in (Adams and MacKay, 2007) for a single data set. Each panel shows a different setting of the hazard rate *h*, as indicated. The top half of each panel **...**

Thus, the performance of models that require a pre-specified hazard rate can depend critically on which value is chosen, but which value is “best” is not always obvious in advance. Here we aim to remove this limitation by proposing a novel, hierarchical Bayesian approach for the on-line estimation of the hazard rate in change-point problems. This approach allows for optimal inference in more demanding problems in which the hazard rate is not given and can vary (in a piecewise constant manner) over time.

In this section we develop a Bayesian model for inferring a constant hazard rate from raw data sequences. We model change-points as i.i.d. samples from a Bernoulli distribution, the rate of which is given by the unknown hazard rate. We develop our model by first considering the straightforward case in which change-point locations are known, then address the more challenging case in which change-point locations are unknown.

Suppose that the locations of change-points in a given sequence are known. Because change-points are all-or-nothing events, we can think of them as being generated (in discrete time) as samples from a Bernoulli process with a rate equal to the hazard rate. The hazard rate can then be inferred as the rate of the Bernoulli process given the binary observations.

Let *y _{t}* be a binary variable that denotes the presence (

$$\begin{array}{l}p({y}_{t+1}\mid {y}_{1:t})={\int}_{0}^{1}p({y}_{t+1}\mid h)p(h\mid {y}_{1:t})dh\\ =\frac{{\int}_{0}^{1}p(h){\prod}_{i=1}^{t+1}p({y}_{i}\mid h)dh}{{\int}_{0}^{1}p(h){\prod}_{i=1}^{t}p({y}_{i}\mid h)dh}\end{array}$$

(6)

where *p*(*h*) is the prior over the hazard rate.

By definition, *p*(*y _{i}* = 1|

$$p(h\mid {a}_{0},{b}_{0})=\text{Beta}(h;{a}_{0},{b}_{0})=\frac{\mathrm{\Gamma}({a}_{0}+{b}_{0})}{\mathrm{\Gamma}({a}_{0})\mathrm{\Gamma}({b}_{0})}{h}^{{a}_{0}-1}{(1-h)}^{{b}_{0}-1}$$

(7)

gives

$$\begin{array}{l}p({y}_{t+1}\mid {y}_{1:t})=\frac{{\int}_{0}^{1}{h}^{{a}_{t+1}+{a}_{0}-1}{(1-h)}^{{b}_{t+1}+{b}_{0}-1}dh}{{\int}_{0}^{1}{h}^{{a}_{t}}{(1-h)}^{{b}_{t}}dh}\\ =\frac{\mathrm{\Gamma}({a}_{t+1}+{a}_{0})\mathrm{\Gamma}({b}_{t+1}+{b}_{0})\mathrm{\Gamma}({a}_{t}+{b}_{t}+{a}_{0}+{b}_{0}-1)}{\mathrm{\Gamma}({a}_{t}+{a}_{0})\mathrm{\Gamma}({b}_{t}+{b}_{0})\mathrm{\Gamma}({a}_{t+1}+{b}_{t+1}+{a}_{0}+{b}_{0}-1)}\end{array}$$

(8)

where *a _{t}* is the number of change-points up to and including time

Also by definition, *a _{t}*

$$\begin{array}{l}p({y}_{t+1}=1\mid {y}_{1:t})={\stackrel{\sim}{h}}_{t+1}=\frac{{a}_{t}+{a}_{0}}{{a}_{t}+{b}_{t}+{a}_{0}+{b}_{0}}\\ p({y}_{t+1}=0\mid {y}_{1:t})=1-{\stackrel{\sim}{h}}_{t+1}=\frac{{b}_{t}+{a}_{0}}{{a}_{t}+{b}_{t}+{a}_{0}+{b}_{0}}\end{array}$$

(9)

Therefore, because *b _{t}* =

When the locations of the change-points are unknown, we do not know the change-point count, *a _{t}*, with certainty and we must maintain a joint probability distribution over both

$$\begin{array}{l}p({\mathbf{x}}_{t+1}\mid {\mathbf{x}}_{1:t})=\sum _{{r}_{t}}\sum _{{a}_{t}}p({\mathbf{x}}_{t+1}\mid {r}_{t},{a}_{t},{\mathbf{x}}_{1:t})p({r}_{t},{a}_{t}\mid {\mathbf{x}}_{1:t})\\ =\sum _{{r}_{t}}\sum _{{a}_{t}}p({\mathbf{x}}_{t+1}\mid {\mathbf{x}}_{t}^{({r}_{t})})p({r}_{t},{a}_{t}\mid {\mathbf{x}}_{1:t})\end{array}$$

(10)

*p*(*r _{t}, a_{t}*|

$$p({r}_{t},{a}_{t}\mid {\mathbf{x}}_{1:t})=\frac{p({r}_{t},{a}_{t},{\mathbf{x}}_{1:t})}{{\sum}_{{r}_{t}}{\sum}_{{a}_{t}}p({r}_{t},{a}_{t},{\mathbf{x}}_{1:t})}$$

(11)

and *p*(*r _{t}, a_{t},*

$$\begin{array}{l}p({r}_{t},{a}_{t},{\mathbf{x}}_{1:t})=\sum _{{r}_{t-1}}\sum _{{a}_{t-1}}p({r}_{t},{r}_{t-1},{a}_{t},{a}_{t-1},{\mathbf{x}}_{1:t})\\ =\sum _{{r}_{t-1}}\sum _{{a}_{t-1}}p({r}_{t},{a}_{t},{\mathbf{x}}_{t}\mid {r}_{t-1},{a}_{t-1},{\mathbf{x}}_{1:t-1})p({r}_{t-1},{a}_{t-1},{\mathbf{x}}_{1:t-1})\\ =\sum _{{r}_{t-1}}\sum _{{a}_{t-1}}p({r}_{t},{a}_{t}\mid {r}_{t-1},{a}_{t-1})p({\mathbf{x}}_{t}\mid {\mathbf{x}}_{t-1}^{({r}_{t-1})})p({r}_{t-1},{a}_{t-1},{\mathbf{x}}_{1:t-1})\end{array}$$

(12)

Furthermore, the explicit form of the change-point prior, *p*(*r _{t}, a_{t}*|

$$\begin{array}{l}p({r}_{t},{a}_{t}\mid {r}_{t-1},{a}_{t-1})={\int}_{0}^{1}p({r}_{t},{a}_{t}\mid h,{r}_{t-1},{a}_{t-1})p(h\mid {r}_{t-1},{a}_{t-1})dh\\ =\frac{\mathrm{\Gamma}({a}_{t-1}+1)\mathrm{\Gamma}({b}_{t-1}+1)}{\mathrm{\Gamma}({a}_{t-1}+{b}_{t-1}+1)}{\int}_{0}^{1}p({r}_{t},{a}_{t}\mid h,{r}_{t-1},{a}_{t-1}){h}^{{a}_{t-1}}{(1-h)}^{{b}_{t-1}}dh\end{array}$$

(13)

with

$$p({r}_{t},{a}_{t}\mid h,{r}_{t-1},{a}_{t-1})=\{\begin{array}{ll}1-h\hfill & \text{if}\phantom{\rule{0.16667em}{0ex}}{r}_{t}={r}_{t-1}+1\phantom{\rule{0.16667em}{0ex}}\text{and}\phantom{\rule{0.16667em}{0ex}}{a}_{t}={a}_{t-1}\hfill \\ h\hfill & \text{if}\phantom{\rule{0.16667em}{0ex}}{r}_{t}=0\phantom{\rule{0.16667em}{0ex}}\text{and}\phantom{\rule{0.16667em}{0ex}}{a}_{t}={a}_{t-1}+1\hfill \\ 0\hfill & \text{otherwise}\hfill \end{array}$$

(14)

where the last line of equation 13 and the form of equation 14 come directly from our assumption that change-points are generated as i.i.d. samples from a Bernoulli distribution.

Thus, performing the requisite integrals gives:

$$p({r}_{t},{a}_{t}\mid {r}_{t-1},{a}_{t-1})=\{\begin{array}{ll}{\scriptstyle \frac{{b}_{t-1}+1}{{a}_{t-1}+{b}_{t-1}+2}}=1-{\stackrel{\sim}{h}}_{t}\hfill & \text{if}\phantom{\rule{0.16667em}{0ex}}{r}_{t}={r}_{t-1}+1\phantom{\rule{0.16667em}{0ex}}\text{and}\phantom{\rule{0.16667em}{0ex}}{a}_{t}={a}_{t-1}\hfill \\ {\scriptstyle \frac{{a}_{t-1}+1}{{a}_{t-1}+{b}_{t-1}+2}}={\stackrel{\sim}{h}}_{t}\hfill & \text{if}\phantom{\rule{0.16667em}{0ex}}{r}_{t}=0\phantom{\rule{0.16667em}{0ex}}\text{and}\phantom{\rule{0.16667em}{0ex}}{a}_{t}={a}_{t-1}+1\hfill \\ 0\hfill & \text{otherwise}\hfill \end{array}$$

(15)

Note the similarity between this and equation 5, where the only difference is that we are using the inferred hazard rate * _{t}* instead of the pre-specified hazard rate

In this section we relax the constraint that the hazard rate is constant over time. Instead, we assume that the hazard rate is piecewise constant and is itself generated by a change-point process with its own, smaller hazard rate. More generally, we consider the possibility that this “second-order” hazard rate is generated from yet another change-point process with an even smaller, third-order hazard rate, and so on, to create what we term a “change-point hierarchy” of arbitrary depth. We describe first this generative model then how it is used for on-line inference.

A schematic illustrating the general change-point hierarchy generative model is shown in figure 4. The inputs to the generative model are the top-level hazard rate, *h*^{(0)}; the parameters for the priors on intermediate level hazard-rate distributions,
${\mathbf{a}}_{0}=\left\{{a}_{0}^{(1)},{a}_{0}^{(2)},\dots ,{a}_{0}^{(N)}\right\}$ and
${\mathbf{b}}_{0}=\left\{{b}_{0}^{(1)},{b}_{0}^{(2)},\dots ,{b}_{0}^{(N)}\right\}$; and the parameters, *χ _{p}*, that determine the prior distribution over the parameters,

Illustration of the generative model corresponding to the change-point hierarchy. See text for details.

The generative process starts at level 0 and uses the top level hazard rate, *h*^{(0)}, to sample the change-point locations for level 1. Specifically, for *t* = 1 to *T _{max}*,
${y}_{t}^{(1)}$ is sampled from a Bernoulli distribution with rate

Next, to get the change-point locations for level 2, ${y}_{t}^{(2)}$ is sampled from a Bernoulli process with rate ${h}_{t}^{(1)}$. As for level 1, if ${y}_{t}^{(2)}=1$ then there is a change-point and ${h}_{t}^{(2)}$ is sampled from a beta distribution with parameters ${a}_{0}^{(2)}$ and ${b}_{0}^{(2)}$. Otherwise, if ${y}_{t}^{(2)}=0$ there is no change-point and ${h}_{t}^{(2)}={h}_{t-1}^{(2)}$.

This pattern is the continued down the hierarchy until level *N* is reached. At that point, instead of sampling a hazard rate, we sample the prior parameters of the generative distribution, *η _{t}*, from the prior parameterized by

Note that this model of a change-point hierarchy assumes that all of the hazard rates are independent of the time since the last change-point. This simplifying assumption is key to making the inference problem tractable but comes at the expense of excluding generative models with with hazard rates that can depend on the current run length.

We now show how to infer hazard rates using the simplest example of a change-point hierarchy, with just three levels. We leave the general, *N*-level case to the supplementary material.

The approach here is similar to that of section 3, with two primary differences. The first is that we must now keep track of two kinds of run-length: the data-level run-length that we have already encountered, corresponding to the number of data points used to compute the predictive distribution of the data, ${r}_{t}^{(2)}$; and the high-level run-length, corresponding to the number of data points used in the computation of the hazard rate, ${r}_{t}^{(1)}$. The higher-order run-length can be written as

$${r}_{t}^{(1)}=({a}_{t}-{a}_{0})+({b}_{t}-{b}_{0})$$

(16)

where *a*_{0} and *b*_{0} are the pre-defined prior parameters of the beta distribution on
${h}_{t}^{(1)}$, and *a _{t}* and

The second difference with section 3 is that, because we allow for change-points in level 1,
${r}_{t}^{(1)}$ can transition to zero. Thus, to compute the predictive distribution, we must maintain a probability distribution over
${r}_{t}^{(2)},{r}_{t}^{(1)}$, and *a _{t}*. Equivalently, because of equation 16, this distribution can be expressed over
${r}_{t}^{(2)}$,

Thus, we can write the predictive distribution *p*(**x**_{t}_{+1}|**x**_{1:}* _{t}*) as

$$p({\mathbf{x}}_{t+1}\mid {\mathbf{x}}_{1:t})=\sum _{{r}_{t}}\sum _{{a}_{t}}\sum _{{b}_{t}}p({\mathbf{x}}_{t+1}\mid {\mathbf{x}}_{t}^{({r}_{t})})p({r}_{t},{a}_{t},{b}_{t}\mid {\mathbf{x}}_{1:t})$$

(17)

where *p*(*r _{t}, a_{t}, b_{t}*|

$$p({r}_{t},{a}_{t},{b}_{t}\mid {\mathbf{x}}_{1:t})=\frac{p({r}_{t},{a}_{t},{b}_{t},{\mathbf{x}}_{1:t})}{{\sum}_{{r}_{t}}{\sum}_{{a}_{t}}{\sum}_{{b}_{t}}p({r}_{t},{a}_{t},{b}_{t},{\mathbf{x}}_{1:t})}$$

(18)

and *p*(*r _{t}, a_{t}, b_{t},*

$$\begin{array}{l}p({r}_{t},{a}_{t},{b}_{t},{\mathbf{x}}_{1:t})=\sum _{{r}_{t-1}}\sum _{{a}_{t-1}}\sum _{{b}_{t-1}}p({r}_{t},{r}_{t-1},{a}_{t},{a}_{t-1},{b}_{t},{b}_{t-1},{\mathbf{x}}_{1:t})\\ =\sum _{{r}_{t-1}}\sum _{{a}_{t-1}}\sum _{{b}_{t-1}}p({r}_{t},{a}_{t},{b}_{t},{\mathbf{x}}_{t}\mid {r}_{t-1},{a}_{t-1},{b}_{t-1},{\mathbf{x}}_{1:t-1})p({r}_{t-1},{a}_{t-1},{b}_{t-1},{\mathbf{x}}_{1:t-1})\\ =\sum _{{r}_{t-1}}\sum _{{a}_{t-1}}\sum _{{b}_{t-1}}p({r}_{t},{a}_{t},{b}_{t}\mid {r}_{t-1},{a}_{t-1},{b}_{t-1})p({\mathbf{x}}_{t}\mid {\mathbf{x}}_{t-1}^{({r}_{t-1})})p({r}_{t-1},{a}_{t-1},{b}_{t-1},{\mathbf{x}}_{1:t-1})\end{array}$$

(19)

The change-point prior, *p*(*r _{t}, a_{t}, b_{t}*|

$$p({r}_{t},{a}_{t},{b}_{t}\mid {r}_{t-1},{a}_{t-1},{b}_{t-1})={\int}_{0}^{1}p({r}_{t},{a}_{t},{b}_{t}\mid {h}_{t}^{(1)},{r}_{t-1},{a}_{t-1},{b}_{t-1})p({h}_{t}^{(1)}\mid {r}_{t-1},{a}_{t-1},{b}_{t-1}){dh}_{t}^{(1)}$$

(20)

Now, by definition,

$$p({r}_{t},{a}_{t},{b}_{t}\mid {h}_{t}^{(1)},{r}_{t-1},{a}_{t-1},{b}_{t-1})=\{\begin{array}{ll}(1-{h}_{t}^{(1)})(1-{h}^{(0)})\hfill & \text{if}\phantom{\rule{0.16667em}{0ex}}{r}_{t}={r}_{t-1}+1,{a}_{t}={a}_{t-1}\phantom{\rule{0.16667em}{0ex}}\text{and}\phantom{\rule{0.16667em}{0ex}}{b}_{t}={b}_{t-1}+1\hfill \\ {h}_{t}^{(1)}(1-{h}^{(0)})\hfill & \text{if}\phantom{\rule{0.16667em}{0ex}}{r}_{t}=0,{a}_{t}={a}_{t-1}+1\phantom{\rule{0.16667em}{0ex}}\text{and}\phantom{\rule{0.16667em}{0ex}}{b}_{t}={b}_{t-1}\hfill \\ (1-{h}_{t}^{(1)}){h}^{(0)}\hfill & \text{if}\phantom{\rule{0.16667em}{0ex}}{r}_{t}={r}_{t-1}+1,{a}_{t}={a}_{0}\phantom{\rule{0.16667em}{0ex}}\text{and}\phantom{\rule{0.16667em}{0ex}}{b}_{t}={b}_{0}\hfill \\ {h}_{t}^{(1)}{h}^{(0)}\hfill & \text{if}\phantom{\rule{0.16667em}{0ex}}{r}_{t}=0,{a}_{t}={a}_{0}\phantom{\rule{0.16667em}{0ex}}\text{and}\phantom{\rule{0.16667em}{0ex}}{b}_{t}={b}_{0}\hfill \\ 0\hfill & \text{otherwise}\hfill \end{array}$$

(21)

and

$$p({h}_{t}^{(1)}\mid {r}_{t-1},{a}_{t-1},{b}_{t-1})=\frac{\mathrm{\Gamma}({a}_{t-1}+1)\mathrm{\Gamma}({b}_{t-1}+1)}{\mathrm{\Gamma}({a}_{t-1}+{b}_{t-1}+1)}{\left({h}_{t}^{(1)}\right)}^{{a}_{t-1}}{\left(1-{h}_{t}^{(1)}\right)}^{{b}_{t-1}}$$

(22)

Therefore, if we define ${\stackrel{\sim}{h}}_{t}^{(1)}$ as

$${\stackrel{\sim}{h}}_{t}^{(1)}=\frac{{a}_{t-1}+1}{{a}_{t-1}+{b}_{t-1}+2}$$

(23)

Then we have the following for the change-point prior

$$p({r}_{t},{a}_{t},{b}_{t}\mid {r}_{t-1},{a}_{t-1},{b}_{t-1})=\{\begin{array}{ll}(1-{\stackrel{\sim}{h}}_{t}^{(1)})(1-{h}^{(0)})\hfill & \text{if}\phantom{\rule{0.16667em}{0ex}}{r}_{t}={r}_{t-1}+1,{a}_{t}={a}_{t-1}\phantom{\rule{0.16667em}{0ex}}\text{and}\phantom{\rule{0.16667em}{0ex}}{b}_{t}={b}_{t-1}+1\hfill \\ {\stackrel{\sim}{h}}_{t}^{(1)}(1-{h}^{(0)})\hfill & \text{if}\phantom{\rule{0.16667em}{0ex}}{r}_{t}=0,{a}_{t}={a}_{t-1}+1\phantom{\rule{0.16667em}{0ex}}\text{and}\phantom{\rule{0.16667em}{0ex}}{b}_{t}={b}_{t-1}\hfill \\ (1-{\stackrel{\sim}{h}}_{t}^{(1)}){h}^{(0)}\hfill & \text{if}\phantom{\rule{0.16667em}{0ex}}{r}_{t}={r}_{t-1}+1,{a}_{t}={a}_{0}\phantom{\rule{0.16667em}{0ex}}\text{and}\phantom{\rule{0.16667em}{0ex}}{b}_{t}={b}_{0}\hfill \\ {\stackrel{\sim}{h}}_{t}^{(1)}{h}^{(0)}\hfill & \text{if}\phantom{\rule{0.16667em}{0ex}}{r}_{t}=0,{a}_{t}={a}_{0}\phantom{\rule{0.16667em}{0ex}}\text{and}\phantom{\rule{0.16667em}{0ex}}{b}_{t}={b}_{0}\hfill \\ 0\hfill & \text{otherwise}.\hfill \end{array}$$

(24)

This expression leads to the simple message-passing algorithm for prediction outlined in box 2 in the supplementary material. The algorithm is quite similar to that for inferring a constant hazard rate (box 1 supplementary material), with the main difference being the number of messages that each node sends to its children. In particular, the form of the change-point prior in equation 24 implies that each node now has four children, corresponding to the two different run-lengths independently increasing or going to zero.

We now consider practical aspects of implementation. In particular, a naïve implementation of the theory leads to an algorithm whose computational complexity increases rapidly over time, with the number of nodes per time step going as *t*^{2}^{N}^{−3}, where *N* is the number of levels in the hierarchy. Such an increase very quickly becomes a burden and in this section we present a method for systematically reducing the complexity of the computations in the hierarchical model through node pruning. The pruning algorithm is based on stratified resampling and reduces the number of nodes while maintaining the representational capacity in the remaining nodeset. Our method dramatically improves the efficiency of the algorithm and is complementary to, and thus can be combined with, other node pruning algorithms (Adams and MacKay, 2007; Fearnhead and Liu, 2007).

Previous approaches (Adams and MacKay, 2007; Fearnhead and Liu, 2007) to the node pruning problem have used the weight associated with each node (e.g. *p*(*r _{t}*|

Although such approaches are intuitive, easy to implement, and fast (essentially requiring a constant number of operations per time step), they all encounter the same problem: for on-line inference in change-point problems, nodes with relatively small weights at the present time can become important in the future. Empirically, we have found this problem to be particularly cumbersome in the 3-level hierarchies, because *h*^{(0)} is small and thus nodes representing a high-level change-point always have a high chance of being removed. To overcome this problem we propose a novel pruning algorithm that does not use the weights and instead reduces the number of nodes by merging ‘similar’ nodes (predictive distributions) together. The key insight is that predictive distributions in the model tend to be similar if they correspond to similar run-lengths. We first discuss the case of binary data sampled from a Bernoulli distribution before extending this approach to the general case.

Here we consider the question: How close must the run-lengths of any two nodes be before we consider them to be similar? We consider a bin of length *δ*(*r*_{1}) that includes all run lengths between *r*_{1} and *r*_{2} (= *r*_{1} + *δ*(*r*_{1})) and consider a worst case scenario for how much the means of the predictive distributions could change across this bin. Specifically, if *α*(*r _{i}*) and

$$\overline{\rho}({r}_{2})=\frac{\alpha ({r}_{2})+1}{{r}_{2}+2}$$

(25)

With a bin size of *δ*, the largest possible value for the mean at *r*_{1} (= *r*_{2} − *δ*), (*r*_{1}), given *α*_{2} is

$$\overline{\rho}{({r}_{1})}_{\mathit{max}}=\frac{\alpha ({r}_{2})+1}{{r}_{2}-\delta +2}$$

(26)

and the smallest possible value is

$$\overline{\rho}{({r}_{1})}_{\mathit{min}}=\frac{\alpha ({r}_{2})-\delta +1}{{r}_{2}-\delta +2}$$

(27)

Thus the change in across a bin of size *δ* is bounded by the larger of Δ* _{min}* and Δ

$$\begin{array}{l}{\mathrm{\Delta}}_{\mathit{min}}=\overline{\rho}({r}_{2})-\overline{\rho}{({r}_{1})}_{\mathit{min}}\\ =\frac{\delta ({r}_{2}-\alpha ({r}_{2})+1)}{({r}_{2}+2)({r}_{2}-\delta +2)}\end{array}$$

(28)

and

$$\begin{array}{l}{\mathrm{\Delta}}_{\mathit{max}}=\overline{\rho}{({r}_{1})}_{\mathit{max}}-\overline{\rho}({r}_{2})\\ =\frac{\delta (\alpha ({r}_{2})+1)}{({r}_{2}+2)({r}_{2}-\delta +2)}\end{array}$$

(29)

By definition, *α*(*r*_{2}) ≤ *r*_{2}, and therefore we have that both Δ* _{max}* and Δ

$$\mathrm{\Delta}=\frac{\delta}{{r}_{2}-\delta +2}$$

(30)

Now, if we require that Δ is constant as a function of *r*_{2}, which is equivalent to requiring that the pruning procedure not group together any nodes that have a mean differing by more than Δ regardless of the run-length, then we can find the following expression for *δ* as a function of *r*_{2}:

$$\delta =\frac{({r}_{2}+2)\mathrm{\Delta}}{1+\mathrm{\Delta}}=({r}_{1}+2)\mathrm{\Delta}$$

(31)

which implies that

$$log({r}_{2}+2)-log({r}_{1}+2)=log(1+\mathrm{\Delta})$$

(32)

i.e., that log(*r*) is binned uniformly with bin size equal to log(1 + Δ).

Equation 32 implies that we can prune *M* nodes down to
$min\phantom{\rule{0.16667em}{0ex}}\left[{\scriptstyle \frac{logM}{log(1+\mathrm{\Delta})}},M\right]$ nodes with a loss of precision in the estimate of *ρ* that is bounded by Δ. For large *M* this will result in a substantial reduction in computational complexity. The left hand column of figure 5 (panels A, D, G, and J) shows the effects of the pruning algorithm on model output and computational complexity in the case of Bernoulli data.

Similar results can be found for more general distributions with the proviso that the constraints are probabilistic. In particular, we consider the change in mean across a bin of size *δ* between run-lengths *r*_{1} and *r*_{2}. The mean at *r*_{2} is given by

$$\mu ({r}_{2})=\frac{1}{{r}_{2}+{v}_{p}}\left(\sum _{i=t-{r}_{2}+1}^{t}{x}_{i}+{\chi}_{0}\right)$$

(33)

where *χ*_{0} and *v _{p}* are parameters of the prior distribution. Note that in this section we only consider scalar data. A similar analysis for the vector case yields similar constraints on the magnitude of the

Because the data, *x _{i}*, in the general case might not be bounded, we can no longer give hard constraints for the maximum and minimum values for the mean at run-length

$$\mu {({r}_{1})}_{\pm}=\frac{1}{{r}_{2}+{v}_{p}-\delta}\left(\sum _{i=t-{r}_{2}+1}^{t}{x}_{i}\pm S\delta +{\chi}_{0}\right).$$

(34)

If we define the two possible changes in mean across the bin, Δ_{+} and Δ_{−} as

$${\mathrm{\Delta}}_{+}=\mu {({r}_{1})}_{+}-\mu ({r}_{2})\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\text{and}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}{\mathrm{\Delta}}_{-}=\mu ({r}_{2})-\mu {({r}_{1})}_{-}$$

(35)

then we can write

$${\mathrm{\Delta}}_{\pm}=\frac{(S\pm \mu ({r}_{2}))\delta}{{r}_{2}+{v}_{p}-\delta}$$

(36)

By definition, with probability *f*

$$-S\le \mu ({r}_{2})\le S$$

(37)

Therefore, with probability *f*, we have

$${\mathrm{\Delta}}_{\pm}\le \mathrm{\Delta}=\frac{2S\delta}{{r}_{2}+{v}_{p}-\delta}$$

(38)

Dividing through by 2*S* gives

$$\frac{\delta}{{r}_{2}+{v}_{p}-\delta}=\frac{\mathrm{\Delta}}{2S}=k$$

(39)

which is reminiscent of equation 30. Indeed if we require that this fraction, *k*, is constant, then we find that

$$\delta =\frac{({r}_{2}+{v}_{p})k}{1+k}=k({r}_{1}+{v}_{p})$$

(40)

which gives that

$$log({r}_{2}+{v}_{p})-log({r}_{1}+{v}_{p})=log(1+k)$$

(41)

Thus log(*r* + *v _{p}*) is binned uniformly into bins of size log(1 +

In the three-level case, the similarity between nodes is based on not only the low-level run-length, *r*^{(2)}, but also the high-level run-length, *r*^{(1)}, and the node’s estimate of the hazard rate, (Note that we have reintroduced the superscript notation, *r*^{(2)} for the low-level run-length and *r*^{(1)} for the high-level run-length to avoid ambiguity). The variables *r*^{(1)} and relate to the change-point and non-change-point counts, *a* and *b* introduced in section 4.2, via

$${r}^{(1)}=a+b$$

(42)

$$\stackrel{\sim}{h}=\frac{a+{a}_{p}}{{r}^{(1)}+{a}_{p}+{b}_{p}}$$

(43)

As before, we grid the low-level run-length, *r*^{(2)}, space logarithmically; i.e., log(*r*^{(2)} + *v _{p}*) is uniformly binned into bins of size log(1 +

In this section we demonstrate the applicability of the algorithm by testing it on three different data sets. The main motivation in developing our model was as an ideal observer for rule-switching and other change-point tasks used in numerous behavioral, neurophysiological, and neuroimaging studies (Behrens et al., 2007; Steyvers and Brown, 2006; Brown and Steyvers, 2009; Yu and Dayan, 2005; Lau and Glimcher, 2005; Berg., 1948; Corrado et al., 2005; Averbeck and Lee, 2007; Sugrue et al., 2004). Therefore, the first two examples demonstrate how our algorithm aids the understanding of specific, published experiments. As an additional example, we show how the model can be used to find high-level change-point structure in real-world stock-market data.

In (Behrens et al., 2007), human subjects were were engaged in a task that required tracking the rate of a Bernoulli process given binary input (i.e., reward or no reward) as in figure 7A. As shown in figure 7B, this Bernoulli rate, *ρ _{t}* undergoes a series of change-points. Moreover, the rate of these change-points varies over time, starting from an initial ‘volatile’ phase in which change-points occur every 25 time steps or so and then moving into a stable phase in which there is only one change-point in about 600 time steps.

Bernoulli data example, similar to that in (Behrens et al., 2007). **A.** Binary data representing the presence (vertical black line) or absence (vertical white line) of reward at a given time. **B.** The probability of reward delivery, *ρ*_{t}, (gray dashed **...**

In (Behrens et al., 2007) the authors developed an approximate model for inference in this system in which they assumed random-walk dynamics for the reward rate. They then analyzed fMRI data for correlates of the parameters in their model, thus mapping the abstract computation onto a network of brain structures. However, because their model cannot capture the abrupt changes that are present in their behavioral experiment, it is not, in fact, the ideal observer. In contrast, our model was developed explicitly to identify these abrupt changes. Consequently, a three-level hierarchical model with a Bernoulli data generative distribution is the ideal-observer model for this experiment and may be better suited for functional imaging analysis. An additional advantage of our model is that, by changing the data generative distribution to (for example) Gaussian, it can also deal with continuous instead of binary reward values.

As shown in figure 7, our model does an excellent job of tracking both the reward rate (figure 7B) and, with a slightly longer lag, the hazard rate *h _{t}* (7C) over time. This lag is a consequence of the fact that the hazard rates in this model are fairly low and that it takes on the order of 1/(Δ

In figure 8 we demonstrate the ability of the algorithm to deal with a change-point task that is an extension of (Steyvers and Brown, 2006). In this task, subjects are asked to to try to predict the next location of a stimulus whose position is determined by a sampling from a Gaussian distribution whose mean and variance change randomly at change-points whose locations are sampled from a Bernoulli distribution with a time-varying hazard rate.

Gaussian example. **A** The data over time (thin gray line) are sampled from a Gaussian distribution whose mean and variance undergo change-points. The black line indicates the model’s estimate of the predictive mean. **B** The model’s estimate **...**

In (Steyvers and Brown, 2006), the authors also developed an ideal-observer model of this task that assumed a fixed hazard rate, was offline, and was based on Monte Carlo sampling of the posterior distributions. In contast, our approach allows us to compute optimal online behavior in more challenging environments in which the hazard rate changes over time.

As in the Bernoulli example, a three-level hierarchy model with node pruning does a good job of tracking both the data (figure 8A) and, with a slightly longer lag, the underlying hazard rate (8B) over time. These effects result from the model’s ability to identify both low- and high-level change-points (8C–E).

Figure 8A shows the input (gray line) and the predictive mean (black line) computed by the algorithm, which clearly follows the data quite faithfully. The same can be said of the model’s estimate of the hazard rate (black line - panel B) which closely follows the true, underlying hazard rate (dashed gray line).

To demonstrate the general applicability of the model, we used it to analyze the daily returns from General Motors (GM) stock from January 10th, 1972, to December 23rd, 2009. Following (Adams and MacKay, 2007), we compute the log daily returns as

$${R}_{t}=log\left(\frac{{p}_{t}^{\text{close}}}{{p}_{t-1}^{\text{close}}}\right)$$

(44)

where
${p}_{t}^{\text{close}}$ is the closing price on day *t*. We model this as being sampled from a zero-mean Laplace distribution with a variable scale factor, *λ*; i.e., between two change-points, *R _{t}* is sampled from

$${R}_{t}\sim \frac{1}{Z}exp\left(-\frac{\mid {R}_{t}\mid}{{\lambda}_{t}}\right)$$

(45)

The results of running the model on this data are shown in figure 9. The model tracks the log daily returns faithfully (figure 9A). This tracking is based on an estimate of the hazard rate that changed over time, in particular showing an increase in the frequency of low-level change-points towards the end of the time series. This effect reflected shorter low-level run-lengths starting around 2001, implying a high-level change-point at that time. It is fun to observe, although by no means a strong causal statement, that this time is not long after the appointment of Richard Wagoner as CEO in June 2000. Accordingly, the model’s estimate of the hazard rate shows a marked increase towards the end of the simulation, when the high-level change-point at 2000 becomes apparent (panel B). Note that this higher-order structure in the data cannot be revealed without the hierarchical change-point model

Change-point analysis of daily GM stock returns, 1972–2009. **A.** Daily log return (*R*_{t}, (gray line) and the estimated scale factor (*λ*_{t}, black line) from equation 44 plotted as a function of time. **B.** The model’s estimate of the hazard **...**

We introduced a novel Bayesian model for on-line inference of change-points from noisy data streams. Unlike previous approaches, our model does not require the rate of occurrence of change-points, known as the hazard rate, to be specified in advance. Rather, our model generates on-line estimates of the hazard rate itself, which then can be used to help identify change-points. The novelty of our approach rests primarily on a “change-point hierarchy” in which the hazard rate is governed by an arbitrary number of higher-order change-point processes. Thus, the model can infer hazard rates that can change with arbitrary complexity. This approach also includes a novel pruning algorithm that dramatically reduces the computational complexity of this and related models. The model is an ideal observer for several different psychophysics paradigms and has applications to real world data sets such as stock prices.

Robert C. Wilson, Department of Psychology, Green Hall, Princeton University, Princeton, NJ 08540, USA.

Matthew R. Nassar, Department of Neuroscience, 116 Johnson Pavilion, University of Pennsylvania, Philadelphia, PA 19104, USA.

Joshua I. Gold, Department of Neuroscience, 116 Johnson Pavilion, University of Pennsylvania, Philadelphia, PA 19104, USA.

- Adams RP, MacKay DJ. Technical report. University of Cambridge; Cambridge, UK: 2007. Bayesian online changepoint detection.
- Aroian LA, Levene H. The effctiveness of quality control charts. Journal of the American Statistical Association. 1950;45(252):520–529.
- Averbeck BB, Lee D. Prefrontal neural correlates of memory for sequences. Journal of Neuroscience. 2007;27(9):2204–2211. [PubMed]
- Barlow J, Creutzfeldt O, Michael D, Houchin J, Epelbaum H. Automatic adaptive segmentation of clinical eegs. Electroencephalography and Clinical Neurophysiology. 1981;51:512–525. [PubMed]
- Barry D, Hartigan JA. A bayesian analysis for change point problems. Journal of the American Statistical Association. 1993;88(421):309–319.
- Behrens TEJ, Woolrich MW, Walton ME, Rushworth MFS. Learning the value of information in an uncertain world. Nature Neuroscience. 2007;10(9):1214–1221. [PubMed]
- Berg EA. A simple objective technique for measuring flexibility in thinking. J Gen Psychol. 1948;39:15–22. [PubMed]
- Bodenstein G, Praetorius HM. Feature extraction from the electroencephalogram by adaptive segmentation. Proceedings of the IEEE. 1977;65(5):642–652.
- Brown SD, Steyvers M. Detecting and predicting changes. Cognitive Psychology. 2009;58:49–67. [PubMed]
- Chen J, Gupta AK. Testing and locating variance changepoints with application to stock prices. Journal of the American Statistical Association. 1997;92(438):739–747.
- Chib S. Estimation and comparison of multiple change-point models. Journal of Econometrics. 1998;86:221–241.
- Corrado GS, Sugrue LP, Seung HS, Newsome WT. Linear-nonlinear-poisson models of primate choice dynamics. Journal of Experimental Analysis of Behavior. 2005;84(3):581–617. [PMC free article] [PubMed]
- Denison DGT, Holmes CC. Bayesian partitioning for estimating disease risk. Biometrics. 2001;57:143–149. [PubMed]
- Fearnhead P. Exact and efficient bayesian inference for multiple changepoint problems. Stat Comput. 2006;16:203–213.
- Fearnhead P, Liu Z. On-line inference for multiple changepoint problems. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2007;69(4):589–605.
- Green PJ. Reversible jump markov chain monte carlo computation and bayesian model determination. Biometrika. 1995;82(4):711–32.
- Hsu DA. Tests for variance shift at an unknown time point. Applied Statistics. 1977;26(3):279–284.
- Koop GM, Potter SM. Technical report. Federal Reserve Bank; New York: 2004. Forecasting and estimating multiple change-point models with and unknown number of change points.
- Lau B, Glimcher PW. Dynamic response-by-response models of matching behavior in rhesus monkeys. Journal of the Experimental Analysis of Behavior. 2005;84(3):555–579. [PMC free article] [PubMed]
- Liu JS, Lawrence CE. Bayesian inference on biopolymer models. Bioinformatics. 1999;15(1):38–52. [PubMed]
- Paquet U. Empirical bayesian change point detection. 2007. http://www.ulrichpaquet.com/notes/changepoints.pdf.
- Smith AFM. A bayesian approach to inference about a change-point in a sequence of random variables. Biometrika. 1975;62(2):407–416.
- Stephens DA. Bayesian retrospective multiple-changepoint identification. Applied Statistics. 1994;43(1):159–178.
- Steyvers M, Brown S. Prediction and change detection. In: Weiss Y, Scholkopf B, Platt J, editors. Advances in Neural Information Processing Systems. Vol. 18. MIT Press; 2006. pp. 1281–1288.
- Sugrue LP, Corrado GS, Newsome WT. Matching behavior and the representation of value in the parietal cortex. Science. 2004;304:1782–1787. [PubMed]
- Wainwright MJ, Jordan MI. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning. 2008;1(1–2):1–305.
- Xuan X, Murphy K. Modeling changing dependency structure in multivariate time series. 24th International Conference on Machine Learning.2007.
- Yu AJ, Dayan P. Uncertainty, neuromodulation, and attention. Neuron. 2005;46(4):681–692. [PubMed]