Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2943396

Formats

Article sections

- Abstract
- 1 Introduction
- 2 Modelling and posterior distributions
- 3 Markov chain Monte Carlo approach
- 4 Focusing on the low-probability component
- 5 Sequential Monte Carlo approach
- 6 Discussion
- Supplementary Material
- References

Authors

Related links

Bayesian Anal. Author manuscript; available in PMC 2010 September 21.

Published in final edited form as:

Bayesian Anal. 2010; 5(3): 1–22.

PMCID: PMC2943396

NIHMSID: NIHMS232288

See other articles in PMC that cite the published article.

One of the challenges in using Markov chain Monte Carlo for model analysis in studies with very large datasets is the need to scan through the whole data at each iteration of the sampler, which can be computationally prohibitive. Several approaches have been developed to address this, typically drawing computationally manageable subsamples of the data. Here we consider the specific case where most of the data from a mixture model provides little or no information about the parameters of interest, and we aim to select subsamples such that the information extracted is most relevant. The motivating application arises in flow cytometry, where several measurements from a vast number of cells are available. Interest lies in identifying specific rare cell subtypes and characterizing them according to their corresponding markers. We present a Markov chain Monte Carlo approach where an initial subsample of the full dataset is used to guide selection sampling of a further set of observations *targeted* at a scientifically interesting, low probability region. We define a Sequential Monte Carlo strategy in which the targeted subsample is augmented sequentially as estimates improve, and introduce a stopping rule for determining the size of the targeted subsample. An example from flow cytometry illustrates the ability of the approach to increase the resolution of inferences for rare cell subtypes.

Advances in technology in biological research, as in other fields, are challenging our ability to routinely analyse increasing large data sets. In the motivating application area of flow cytometry, routine assays generate multiple measurements on cell surface markers on each of tens of thousands to millions of individual cells (Chan et al. 2008). Mixture models are applied for cell subtype classification and discrimination, and specific interests often relate to characteristics of rather rare subtypes. For example, polyfunctional lymphocyte subsets that are of interest in predicting vaccine efficacy (Seder et al. 2008) may have frequencies of 0.01% or less of the total blood cell population. Markov chain Monte Carlo is a powerful tool for drawing inferences in mixture models, but its use requires computations on the full dataset at each iteration. This is a drawback in cases of very large datasets, so some approaches have been developed to focus on randomly drawn subsamples of data. For example, Ridgeway and Madigan (2003) proposed a two-step algorithm of drawing subsamples via Sequential Monte Carlo, and this was improved upon by Balakrishnan and Madigan (2006) by introducing a rejuvenation step based on a kernel smoothing approximation similar to Liu and West (2000). Here, the observations of interest are rare, so that random subsamples typically contain very few observations of the rare subtype, and new approaches are required.

Generally, we are interested in drawing inferences about low probability regions of sample space in mixture model analyses of large datasets. We use traditional Bayesian mixtures admitting uncertainty about the number of components (e.g. Müller et al. 1996; MacEachern 1998; Ishwaran and James 2002; Suchard et al. 2010). We focus on inferences about a low probability mixture component, or a group of several low probability components that together represent a scientifically relevant subpopulation. Our central idea is to use an initial random subsample of data in order to construct a weight function directed around the region of interest, and use this to subsequently draw a *targeted subsample* of data preferentially selected from that region of interest. This builds on traditional ideas of selection and weighted sampling (e.g. Heckman 1979; Bayarri and Berger 1998) and their application in discovery sampling (West 1994, 1996). Here the use of non-parametric Bayesian mixture models allows us to link regions in sample space with specific components of the model and naturally identify subsets of observations which are relevant to the scientific question at hand through a component-driven weight function. We implement a two-step Markov chain Monte Carlo approach that first uses the random subsample to obtain an initial posterior, then adds the targeted subsample to draw component-specific inferences. We extend the method to a Sequential Monte Carlo algorithm whereby the targeted subsample is augmented sequentially, guided by a stopping rule, to successively refine inferences on the rare subpopulation, to the extent feasible.

In contexts such as our motivating flow cytometry applications, Gaussian mixtures are used as flexible overall models and scientifically relevant subpopulations are identified by (typically, small) *subsets* of Gaussian components that can reflect non-Gaussianity within subpopulations (Chan et al. 2008). Hence, with no loss of generality here, we consider a Gaussian mixture for *p*–variate data, with *N* samples *x*_{i}, (*i* = 1, …, *N*), defining a data set *X.* The density of the mixture is

$$f({x}_{i}\mid \mu ,\Sigma )=\sum _{k=1}^{K}{\pi}_{k}N({x}_{i}\mid {\mu}_{k},{\Sigma}_{k}),$$

and we adopt a standard truncated Dirichlet process model to define the prior over the mixing probabilities based on some (large) upper bound *K* (Ishwaran and James 2002). Let

$$\theta =\{\alpha ,{\pi}_{1:K},{\varphi}_{1:K},\phantom{\rule{1em}{0ex}}{\varphi}_{k}=\{{\mu}_{k},{\Sigma}_{k}\}.$$

(1)

The mixture model can be realized through the configuration indicators *z*_{i} for each observation *x*_{i} with prior *p*(*z*_{i} = *k* | π) = π_{k}, so that we obtain the standard hierarchical model

$$({x}_{i}\mid {z}_{i}=k,{\varphi}_{k})\sim N({x}_{i}\mid {\varphi}_{k}),\phantom{\rule{1em}{0ex}}({\varphi}_{k}\mid G)\sim G,\phantom{\rule{1em}{0ex}}(G\mid \alpha ,{G}_{0})\sim DP(\alpha ,{G}_{0}),$$

(2)

where *G*(·) is an uncertain distribution function, *G*_{0}(·) is the prior mean of *G*(·) and α > 0 the total mass, or precision of the DP. From the Pólya urn scheme,

$${\varphi}_{k}\mid {\varphi}_{1},\dots ,{\varphi}_{k-1}\sim \frac{\alpha}{k-1+\alpha}{G}_{0}(\cdot )+\frac{1}{k-1+\alpha}\sum _{i=1}^{k-1}{\delta}_{{\varphi}_{i}}(\cdot ).$$

(3)

The truncated Dirichlet process prior is such that π_{1} = *V*_{1} and ${\pi}_{k}={V}_{k}\times {\Pi}_{i=1}^{k-1}(1-{V}_{i})$, *k* > 1, where *V*_{i} ~ *Be*(1, α), *i* < *K* independently over *i* and *V*_{K} = 1. Prior specification for each component *k* is completed with a traditional normal-inverse Wishart form,

$${G}_{0}({\mu}_{k},{\Sigma}_{k})=N({\mu}_{k}\mid {\mu}_{0},{t}_{0}{\Sigma}_{k})IW({\Sigma}_{k}\mid {s}_{0}{S}_{0})$$

(4)

and with a Gamma prior for the Dirichlet concentration parameter α ~ *Ga*(η_{1}, η_{2}). Placing a prior on α (see Ishwaran and James 2002) allows us to draw inferences about the number of mixture components through the role of α of the Pólya urn scheme as the prior number of observations in each component. Finally, we label components in decreasing weight π as an identifiability criterion to address the label-switching problem.

In general, a scientific question may define focus on a specific region of sample space. Here we take one key example, that of identifiable regions of low probability but of high scientific importance. For specificity, we assume this is defined by a low probability component of the mixture. In more general versions, this could be defined by a small set of mixture components, or other devices. Hence, we focus on targeting inferences about the parameters ϕ_{k*} = (μ_{k*}, Σ_{k*}) of a low-probability component *k**.

The objective is to identify and analyze subsamples of the data which contain information about the specific subset of the parameters of interest. The key idea is to obtain a rough estimate of the low-probability component *k** based on a random subsample, which is subsequently used to draw weighted subsamples of the data that are more likely to be relevant to our analysis, providing us with higher resolution about the structure of the distribution in the region of interest.

The direct approach is to follow a two-step procedure of Markov chain Monte Carlo samplers. We use an initial, randomly drawn subsample from the data in order to obtain an estimate of the parameters, and use this estimate to draw a more informative subsample. The two subsamples are then combined in a joint Markov chain Monte Carlo sampler to provide us with a posterior distribution of ϕ_{k*} which will be an improved estimate with respect to the total posterior based on the whole sample. Although interest specifically lies in estimating the parameters of component *k** given by μ_{k*}, Σ_{k*}, inference on the full set of μ, Σ is required in order to carry out the analysis.

We denote the two subsamples (random and targeted) *X*^{R} and *X*^{T} of size *n*^{R} and *n*^{T} respectively, where *n*^{T} *n*^{R} throughout this paper. The first is drawn randomly from the data, whereas the second is drawn according to weights *w*_{i} 1 ≤ *i* ≤ *N*. We aim to choose the weights so that the targeted subsample is expected to be enriched in observations from component *k**; one specific choice is to take

$${w}_{i}=N({x}_{i}\mid m,{S}_{\tau}),$$

the density of the normal distribution in which *m, S* are estimates of μ_{k*}, Σ_{k*} from the initial analysis based on an initial random subsample, and *S*_{τ} = *TST* where $T=\mathrm{diag}({\tau}_{1}^{1\u22152},\dots ,{\tau}_{p}^{1\u22152})$ is a *p* × *p* diagonal matrix based on a set of positive *variance multipliers* τ_{j}, (*j* = 1, …, *p*). These allow us to concentrate (or expand) targeted resampling differentially in different dimensions with due regard for correlation structure, if desired. We use the notations *w*_{i} = *w*(*x*_{i}) and *w*_{i} = *w*(*x*_{i} | *m*, τ, *S*), the latter when the explicit dependence of the weight function on *m*, τ, *S* is to be high-lighted.

The likelihood of the data (*X*^{R}, *X*^{T}) then takes the following form. For observations in the random subsample:

$$\begin{array}{cc}\hfill f({x}_{i}^{R}\mid \pi ,\mu ,\Sigma )=& \sum _{k=1}^{K}{\pi}_{k}N({x}_{i}^{R}\mid {\mu}_{k},{\Sigma}_{k}),\phantom{\rule{1em}{0ex}}i=1,\dots ,{n}^{R}.\hfill \\ \hfill f({x}_{i}^{R}\mid {z}_{i}^{R}=k,\mu ,\Sigma )=& N({x}_{i}^{R}\mid {\mu}_{k},{\Sigma}_{k}),\phantom{\rule{1em}{0ex}}i=1,\dots ,{n}^{R},\hfill \end{array}$$

(5)

The first expression provides the likelihood of the standard mixture model, whereas the second expression represents the likelihood *conditionally* on the configuration indicator ${z}_{i}^{R}$, the component where observation ${x}_{i}^{R}$ belongs.

Similarly, for observations in the targeted subsample:

$$f({x}_{i}^{T}\mid \pi ,\mu ,\Sigma ,m,\tau ,S)=\sum _{k=1}^{K}{\stackrel{~}{\pi}}_{k}(\theta ,m,\tau ,S)N({x}_{i}^{T}\mid {\stackrel{~}{\mu}}_{k},{\stackrel{~}{\Sigma}}_{k}),\phantom{\rule{1em}{0ex}}i=1,\dots ,{n}^{T},$$

(6)

with

$$\begin{array}{cc}\hfill {\stackrel{~}{\pi}}_{k}(\theta ,m,\tau ,S)=& \frac{{\pi}_{k}N({\mu}_{k}\mid m,({S}_{\tau}+{\Sigma}_{k}))}{\sum _{k=1}^{K}{\pi}_{k}N({\mu}_{k}\mid m,({S}_{\tau}+{\Sigma}_{k}))},\hfill \\ \hfill {\stackrel{~}{\Sigma}}_{k}=& {({\Sigma}_{k}^{-1}+{S}_{\tau}^{-1})}^{-1},\hfill \\ \hfill {\stackrel{~}{\mu}}_{k}=& {\stackrel{~}{\Sigma}}_{k}({\Sigma}_{k}^{-1}{\mu}_{k}+{S}_{\tau}^{-1}m),\hfill \end{array}$$

(7)

and

$$\begin{array}{cc}\hfill f({x}_{i}^{T}\mid {z}_{i}^{T}=k,\mu ,\Sigma ,m,\tau ,S)\propto & w({x}_{i}^{T}\mid m,\tau ,S)N({x}_{i}^{T}\mid {\mu}_{k},{\Sigma}_{k})\hfill \\ \hfill \propto & N({x}_{i}^{T}\mid {\stackrel{~}{\mu}}_{k},{\stackrel{~}{\Sigma}}_{k}),i=1,\dots ,{n}^{T}.\hfill \end{array}$$

(8)

Note here that, conditional on all the parameters, targeted observations are independent, using the infinite population assumption for the weights *w*(*x _{i}*). This means that we assume a very large number of observations within the region of non-negligible support of the weight function

The first Markov chain Monte Carlo sampler is a standard blocked Gibbs sampler (Ishwaran and James 2002) to simulate *p*(α, π, μ, Σ | *X*^{R}). In order to carry out the second Markov chain Monte Carlo sampling based on the random and targeted sub-samples combined, the posterior distributions of the parameters of α, π, μ, Σ have to be re-calculated to define efficient proposals. Although conjugacy for μ, Σ, π is lost, we keep the standard normal-inverse Wishart and Dirichlet priors for the mixture model.

**The posterior for z** for both subsamples is multinomial with probabilities

$$p({z}_{i}=k\mid {X}^{R},{X}^{T},\pi ,\mu ,\Sigma )\propto {\pi}_{k}N({x}_{i}\mid {\mu}_{k},{\Sigma}_{k}).$$

(9)

**The posterior for π** has density

$$\begin{array}{cc}\hfill p(\pi \mid {X}^{R},{X}^{T},z,\mu ,\Sigma ,m,\tau ,S)=& p(\pi \mid {X}^{R},{z}^{R},\mu ,\Sigma )\prod _{k=1}^{K}{\stackrel{~}{\pi}}_{k}^{{n}_{k}^{T}}\hfill \\ \hfill =& p(\pi \mid {X}^{R},{z}^{R},\mu ,\Sigma )\prod _{k=1}^{K}{\left(\frac{{\pi}_{k}N({\mu}_{k}N\mid m,{({S}_{\tau}^{-1}+{\Sigma}_{k}^{-1})}^{-1})}{\sum _{j=1}^{K}{\pi}_{j}N({\mu}_{j}\mid m,{({S}_{\tau}^{-1}+{\Sigma}_{j}^{-1})}^{-1})}\right)}^{{n}_{k}^{T}}.\hfill \end{array}$$

The contribution of the targeted subsample to the posterior distribution for π provides little additional information about the distribution of π when the elements *S*_{τ} are small, and becomes more significant as the elements of *S*_{τ} increase, allowing observations in the targeted subsample to belong to components other than *k**.

**The posterior for α** only depends on the data through *V* and thus will have the usual posterior distribution (see Ishwaran and James 2002)

$$\alpha \phantom{\rule{thickmathspace}{0ex}}\sim \phantom{\rule{thickmathspace}{0ex}}Ga\left({\eta}_{1}+K-1,{\eta}_{2}-\sum _{i=1}^{K-1}\text{log}(1-{V}_{i})\right).$$

(10)

**The posterior for μ _{k}** can be calculated exactly as

$${\mu}_{k}\mid X,z,{\Sigma}_{k},m,\tau ,S,\sim N({m}_{k}^{\mu},{S}_{k}^{\mu}),$$

(11)

where ${S}_{k}^{\mu}$, ${m}_{k}^{\mu}$ are given by

$$\begin{array}{cc}\hfill {S}_{k}^{\mu}& ={\Sigma}_{k}{(1\u2215{t}_{0}+{n}_{k}^{R}+{n}_{k}^{T}{({S}_{\tau}^{-1}{\Sigma}_{k}+I)}^{-1})}^{-1},\hfill \\ \hfill {m}_{k}^{\mu}& ={S}_{k}^{\mu}({n}_{k}{\Sigma}_{k}^{-1}{\stackrel{\u2012}{x}}_{k}-{n}_{k}^{T}{({S}_{\tau}^{-1}{\Sigma}_{k}+I)}^{-1}{S}_{\tau}^{-1}+{\mu}_{0}),\hfill \end{array}$$

(12)

where *n _{k}* is the total number of data points in component

**The posterior for Σ _{k}** has density

$$p({\Sigma}_{k}\mid X,z,{\mu}_{k},m,\tau ,S)\propto {\mid {\Sigma}_{k}\mid}^{-{s}_{0}}{\mid {\Sigma}_{k}\mid}^{-{n}_{k}^{R\u22152}}\times {\mid {S}_{\tau}^{-1}+{\Sigma}_{k}^{-1}\mid}^{{n}_{k}^{T\u22152}}\phantom{\rule{0ex}{0ex}}\times \text{exp}=\left\{-\frac{{S}_{0}{\Sigma}_{k}^{-1}}{2}-\sum _{i=1}^{{n}_{k}}\frac{{x}_{i}^{T}{\Sigma}_{k}^{-1}{x}_{i}}{2}+{n}^{R}{\mu}_{k}^{T}{\Sigma}_{k}^{-1}{\stackrel{\u2012}{x}}^{R}-\frac{{n}_{k}^{R}}{2}{\mu}_{k}^{T}{\Sigma}_{k}^{-1}{\mu}_{k}\phantom{\rule{0ex}{0ex}}-\frac{{n}_{k}^{T}}{2}{\mu}_{k}^{T}{({S}_{\tau}^{-1}{\Sigma}_{k}+I)}^{-1}{\Sigma}_{k}^{-1}{\mu}_{k}-{n}_{k}^{T}{\mu}_{k}^{T}{({S}_{\tau}^{-1}{\Sigma}_{k}+I)}^{-1}{S}_{\tau}^{-1}m\phantom{\rule{0ex}{0ex}}-\frac{{n}_{k}^{T}}{2}m{({\Sigma}_{k}^{-1}{S}_{\tau}+I)}^{-1}{S}_{\tau}m\right\}.$$

(13)

The non-standard posteriors (10) and (13) lead to the need for creative approximations to define efficient MCMC proposals, as now developed.

We construct a MCMC sampler with stationary distribution *p*(μ, Σ, *z, V,* α | *X ^{R}, X^{T}*). The chain is initialized by drawing μ, Σ,

- Update
*z*by generating from the posterior given in Equation (9). - Update
*V*through a Metropolis-Hastings step by generating from the posterior based only on the initial random subsample,with$$p({V}_{k}\mid {X}^{R})\sim Be(1+{n}_{k}^{R},\alpha +\sum _{l=k+1}^{K}{n}_{k}^{R}),$$(14)*V*= 1. Set ${\pi}_{k}={V}_{k}{\Pi}_{j=1}^{k-1}(1-{V}_{j})$ and accept the proposed move with probability_{K}Recall that $\stackrel{~}{\pi}(\theta ,m,\tau ,S)$ given in Equation (7) corresponds to the component probability weights in the targeted subsample. If the targeted subsample is indeed drawn such that most of its points belong to component$$\text{min}\left(1,\prod _{i=1}^{K}{\left(\frac{{\stackrel{~}{\pi}}_{i}^{\prime}(\theta ,m,\tau ,S)}{{\stackrel{~}{\pi}}_{i}(\theta ,m,\tau ,S)}\right)}^{{n}_{i}^{T}}\right).$$*k**, the acceptance probability will be ≈ 1. - Update α from its posterior given
*V*given in Equation (10). - Update each μ
_{k}through a Gibbs step usinggiven in Equation (11).$${\mu}_{k}\mid X,z,{\Sigma}_{k},m,S\sim N({m}_{k}^{\mu},{S}_{k}^{\mu})$$(15) - For each Σ
_{k}, we construct a proposal $q({\Sigma}_{k}^{\prime}\mid {X}^{R},{X}^{T},z,\mu )$ for a Metropolis-Hastings step using the fact thatWe use the inverse transformation to obtain$$f({x}_{i}^{T}\mid {X}^{R},{z}_{i}=k,\mu ,\Sigma ,m,\tau ,S)=N({x}_{i}^{T}\mid {\stackrel{~}{\mu}}_{k},{\stackrel{~}{\Sigma}}_{k}).$$where$${\stackrel{~}{x}}_{i}\mid {z}_{i}=k,\mu ,\Sigma ,m,\tau ,S\sim N({\mu}_{k},{\Sigma}_{k}),$$(16)In practice, of course, Σ$$\begin{array}{cc}\hfill {\stackrel{~}{x}}_{i}^{R}=& {x}_{i}^{R},\hfill \\ \hfill {\stackrel{~}{x}}_{i}^{T}=& {\Sigma}_{k}({\stackrel{~}{\Sigma}}_{k}^{-1}{x}_{i}^{T}-{S}_{\tau}^{-1}m).\hfill \end{array}$$(17)_{k}is not known. A similar transformation of*X*can be obtained using an estimate of Σ^{T}_{k}, providing a proposal distributionwhere$$q=({\Sigma}_{k}^{\prime}\mid {X}^{R},{X}^{T},z,\mu )\sim IW({W}_{k}+{S}_{0},{n}_{k}+{s}_{0}+p-1),$$(18)In order to increase the variance of the proposal kernel, a discount factor may also be used.$${W}_{k}=\sum _{{z}_{i}=k}{\stackrel{~}{x}}_{i}{\stackrel{~}{x}}_{i}^{T}-\frac{1}{{n}_{k}}\sum _{{z}_{i}=k}{\stackrel{~}{x}}_{i}\sum _{{z}_{i}=k}{\stackrel{~}{x}}_{i}^{T}.$$(19)

The Markov chain Monte Carlo sampler sweeps through the updates described above, yielding estimates for the posterior distribution of the parameters of interest. However, due to the high number of parameters to be estimated and the difficulty in defining efficient proposals, the acceptance rate quickly drops to zero for targeted subsamples of moderate size.

The dimensionality of the problem, combined with the difficulty to construct efficient proposals, results in Markov chain Monte Carlo samplers which require very long running times in order to reach stationarity with respect to the posterior distribution. At the same time, the approach described above does not exploit the results from the initial run based on the random sample, except for extracting the estimates of μ_{k*}, Σ_{k*}. We describe how the dimensionality of the problem can be greatly reduced using the posterior distribution estimates obtained from the initial Markov chain Monte Carlo simulation.

Notice that the objective is to draw inferences about a region in the sample space which has very low probability. Consequently, very few points in the initial random sample will belong to that region. On the other hand, the targeted sample will, generally, contain observations from the low-probability region. This implies that the posterior distribution of the parameters based on both the random and targeted samples (*X ^{R}, X^{T}*)

$$p(\mu ,\Sigma \mid {X}^{R},{X}^{T})\phantom{\rule{thickmathspace}{0ex}}=\phantom{\rule{thickmathspace}{0ex}}\sum _{{z}^{R}}p(\mu ,\Sigma \mid {X}^{R},{X}^{T},{z}^{R})\times p({z}^{R}\mid ,{X}^{R},{X}^{T}),$$

can be approximated as

$$p(\pi ,\mu ,\Sigma \mid {X}^{R},{X}^{T})\phantom{\rule{thickmathspace}{0ex}}=\phantom{\rule{thickmathspace}{0ex}}\sum _{{z}^{R}}\underset{\left(a\right)}{\underbrace{p(\pi ,\mu ,\Sigma \mid {X}^{R},{X}^{T},{z}^{R})}}\times \underset{\left(b\right)}{\underbrace{p({z}^{R}\mid {X}^{R})}},$$

(20)

using *p*(*z ^{R}* |

The second Markov chain Monte Carlo is then adapted to a set of chains run for a set of samples. Each chain will provide posterior estimates for the parameters *conditionally* on a fixed draw of *z ^{R}*, so that, marginally, we obtain a set of samples from the (approximate) posterior distribution shown in Equation (20). Specifically, for chains

The focused approach drastically reduces the dimensionality of the algorithm, and as a result the computational complexity. However, Metropolis-Hastings updates still show low acceptance rates, since the posteriors conditional on *X ^{R}* and (

For each of a set of particles *j* = 1 : *J*, draw a sample of (*z,* π, μ, Σ) | *X ^{R}* from the posterior distribution estimates obtained in the Markov chain Monte Carlo sampler. Then repeatedly augment the targeted subsample and mutate the parameter draws through the following steps.

For *j* = 1 : *J* and for a fixed sequence of variance scaling vectors τ = *t*_{1}, …, *t _{J}*:

- Select a particle
*u*uniformly at random and set ${m}_{j}={\left\{{\mu}_{{k}^{\ast}}^{j}\right\}}_{u}$ and ${S}_{j}={\left\{{\Sigma}_{{k}^{\ast}}^{j}\right\}}_{u}$ where ${\left\{{\varphi}_{{k}^{\ast}}^{j}\right\}}_{u}$ is the sample of the*u*th particle at step*j*for component*k**. - Draw another batch of
*B*targeted observations ${X}_{j}^{T}$*without replacement*according to weights*w*_{i}*w*(*x*|_{i}*m*)._{j}, t_{j}, S_{j} - Using a fixed number of Metropolis-Hastings steps following the iterates described in the Markov chain Monte Carlo approach above, update the configuration indicators
*z*, the weights π, the concentration parameter α and the component-specific parameters μ, Σ by repeating the following updates:^{T}- (a)
- (b)Similar to the posterior distribution for μ given in (15), the posterior distribution of μ
_{k}now becomeswhere$${\mu}_{k}\mid {X}^{R},{X}_{1:j}^{T},z,{\Sigma}_{k}\sim N({m}_{k}^{\mu},{S}_{k}^{\mu}),$$(21)with$$\begin{array}{cc}\hfill {S}_{k}^{\mu}=& {({\Sigma}_{k}^{-1}\u2215{t}_{0}+{n}_{k}^{R}{\Sigma}_{k}^{-1}+B\sum _{i=1}^{j}{({U}_{i}^{-1}{\Sigma}_{k}+I)}^{-1}{\Sigma}_{k}^{-1})}^{-1}\hfill \\ \hfill {m}_{k}^{\mu}=& {S}_{k}^{\mu}\left({n}_{k}{\Sigma}_{i}^{-1}{\stackrel{\u2012}{x}}_{k}-B\sum _{i=1}^{j}{({U}_{i}^{-1}{\Sigma}_{k}+I)}^{-1}{U}_{i}^{-1}{m}_{i}+{\mu}_{0}\right).\hfill \end{array}$$(22)*U*=_{i}*S*_{τ}at the values*S*=*S*and τ =_{i}*t*._{i} - (c)Update each Σ
_{k}using a proposal distribution similar to (18), replacing*m, S*in Equation (17) bywhere$$\begin{array}{cc}\hfill {\stackrel{~}{x}}_{i}^{R}=& {x}_{i}^{R},\hfill \\ \hfill {\stackrel{~}{x}}_{i}^{T}=& {\Sigma}_{k}\left({\stackrel{~}{\Sigma}}_{k}^{-1}{x}_{i}^{T}{U}_{batch\left(i\right)}^{-1}{m}_{batch\left(i\right)}\right),\hfill \end{array}$$(23)*batch*(*i*) represents the batch of targeted observations that ${x}_{i}^{T}$ was sampled in.

The sequential Monte Carlo algorithm described does not include a particle resampling step; since each particle has a fixed *z*^{R}, re-sampling would result in poor coverage of the *z*^{R} space. The algorithm can be classified as a Sequential Important Sampler (Gordon et al. 1993). At each draw *j*, the samples of μ and Σ are obtained through a Metropolis-Hastings kernel targeting the posterior $p(\mu ,\Sigma \mid {X}^{R},{X}_{1:j}^{T},{m}_{1:j},{S}_{1:j})$. In this Sequential Importance Sampling setting, asymptotically (as the number of particles becomes sufficiently large), the approximation will converge to the true posterior, since each particle (provided a reasonable number of Metropolis-Hastings updates) is indeed sampling from the posterior. A larger number of Metropolis-Hastings updates will be needed when the component structure changes through the targeted sample in order to fit the emerging components. The tuning parameter vector τ allows monitoring of dispersal of the targeted sample and also the adequacy of the infinite population sampling assumption (see Appendix 6).

Owing to the method in which the parameters *m, S* of the weight function are fixed at each step of the re-sampling, weight functions located around different regions of sample space may be chosen. When the low-probability component follows a mixture distribution between different regions of sample space, this will be reflected in the estimates obtained from each particle, resulting in each particle potentially corresponding to a different component. Through our adaptive algorithm, the sample space is explored flexibly and posterior estimates of the parameters are updated incrementally as the targeted subsample is augmented, allowing more efficient inferences.

This approach immediately poses the question of when to stop drawing observations for the targeted subsample. Ideally, we would like the targeted sample to contain all data points of component *k**. In order to address this, we introduce a decision rule such that the targeted sample stops being augmented when no more data points in the remaining original data show a high probability of belonging to component *k**. A natural approach to use is the Bayes Factor for that component, using for example a threshold of the order *e*^{2}; see West and Harrison (1997). In other words, we introduce an extra decision step, as follows:

- 5aFor each unsampled observation
*x*_{i}, define the Bayes Factorwhere π$$B{F}_{k}\ast \left({x}_{i}\right)=\frac{{\pi}_{k}\ast \left({x}_{i}\right)\u2215(1-{\pi}_{k}\ast \left({x}_{i}\right))}{{\pi}_{k}\ast \u22151-{\pi}_{k}\ast )}$$_{k*}(*x*_{i}) ∞ π_{k* }*N*(*x*_{i}| μ_{k*}, Σ_{k*}). Stop sampling if there are no observations such that*BF*_{k*}(*x*_{i}) >*BF*_{threshold}, a specified threshold.

The calculation of the Bayes Factor is computationally demanding; as an alternative, the stopping rule may be expressed purely as a function of the weights. In other words:

- 5bIf there are fewer than
*N*_{threshold}unsampled observations within a*c*_{threshold}contour of the weight function, stop.

The threshold values are important for monitoring the impact of the infinite population assumption on our inferences, representing the `number of points carrying significant weight'. The values chosen will depend upon the number *B* of batched targeted observations drawn at each iteration of the sampler and the dimension of the data; for example, one may use *N*_{threshold} = 3*B* and *c*_{threshold} = exp(−*p*/4), where *p* is the number of markers.

The Sequential Monte Carlo approach provides an efficient method of drawing inferences about parameters relevant to a low probability region of sample space, at the same time allowing the algorithm to automatically monitor the number of observations in the region of interest.

We implement our methods on a 2-dimensional synthetic dataset with 5,000 observations from a Gaussian mixture model with 5 components. This serves to illustrate the approach in a synthetic context where the relatively small sample size allows comparison with posterior computations based on the full data set. The model structure, shown in Figure 1, was chosen so that the component of interest can be visually separated from the rest, at the same time having significant overlap with the remaining components. The component *k** is defined as centered within a certain region, and closest to its estimate of the mean after the initial samples from the posterior distribution based on a random subsample of size 700. If no such component exists, the weight function is initialized with *m* = *m*_{0}, the centre of the pre-specified region, and *S* = *S*_{0}/*s*_{0}, the prior mode of Σ. Using our sequential Monte Carlo approach, we draw *B* = 10 observations at a time, fixing τ_{j} = 1, *j* = 1,…,*p.* We define the stopping rule using *N*_{threshold} = 20 and *c*_{threshold} = exp(−0.5), resulting in 200 targeted obervations. We carry out a standard Gibbs sampler on the full dataset, and compare the results between the two methods.

On the left, a scatter plot of the 2 dimensional data. The full data are shown in yellow, the random subsample in red, and the targeted subsample after the SMC algorithm in blue. The contours show the component structure of a single draw from the total **...**

The simulation results show that the posterior distribution conditional on a targeted subsample of 900 observations, in addition to the random sample of 700 initial observations, shows an improvement of the posterior estimates comparable to those obtained using the complete dataset, but using less than 20% of the total observations; this by far exceeds the expected improvement using a random subsample of the same size, providing a much more efficient algorithm. For example, the boxplot shown in Figure 1 representing posterior inferences on the first dimension of the mean vector of component *k** is centered closer to the true mean with a much tighter posterior variance. As expected, the posterior variance is still lower using the full data set: this is both because there are more data in component *k** (the stopping rule will always leave some data out in order to monitor the infinite popultion assumption), and because it carries more information about the component weights π.

The motivating example for this study is a problem arising in flow cytometry, where cellular subtypes may be associated with one (or more) components of a Gaussian mixture model (see Chan et al. 2008). Flow cytometers detect fluorescent reporter markers that typically correspond to specific cell surface or intracellular proteins on individual cells, and can assay millions of such cells in a fluid stream in minutes. Datasets are typically very large, and as a result inference on the full data is computationally prohibitive. Interest lies in identifying and characterizing rare cell subtypes using a mixture model fitted on those markers. The ability to identify such rare cell subsets plays important roles in many medical contexts - for example, the detection of antigen-specific cells with MHC class I or class II markers, identification of polyfunctional T lymphocytes that correlate with vaccine efficacy or host resistance to pathogens, or in resolving variants of already low frequency cell types, e.g. subtypes of conventional dendritic cells.

We use a dataset of 50,000 data points from human peripheral blood cells, with 6 marker measurements each: Forward Scatter (measure of cell size), Side Scatter (measure of cell granularity), CD4 (marker for helper T cells), IFNg+IL-2 (effector cytokines), CD8 (a marker for cytotoxic T cells), CD3 (marker for all T cells)^{1}. The objective is to provide higher resolution on the structure and patterns of covariation of cells of a specific cell subtype, specifically cells high in CD3 and/or CD8 secreting IL-2/IFN-g when challenged with a specific viral antigen. In other words, we are particularly interested in observations with large values in the 4th dimension together with the 5th and/or 6th. The data show a clear component structure for some of the markers (see Figure 2), whereas in others the rare cell subtypes of interest are not separated. To illustrate our methods and for ease of exposition, we adapt our algorithm by targeting inferences towards the component with CD4 and IFNg+IL-2 (3rd and 4th dimension) centred closest to a specific point, and set each τ_{j} to be very large, namely τ_{j} = 1000 for all but the 4th dimension, focusing on the secretion of IFNg+IL-2. Here we set *K* = 200, as we expect the maximum number of components to be far fewer than 200. Owing to the structure of the model, using an upper bound much larger than the number of components necessary does not affect the accuracy of the posterior estimates.

Scatter plots of the data for the last 4 markers: CD4, IFNg, CD8 and CD3. The complete data set is shown in yellow. We aim to use the random subsample (shown in red) in order to obtain samples from the initial posterior *p*(μ,Σ,π,α| **...**

An initial random subsample of size 5,000 is drawn, providing us with initial estimates *m, S* for the mean and covariance of the component closest to the high CD4+ region. Due to the strong covariation between the markers, several components are needed (see Figure 3) in order to capture the inhomogeneity of the data. Using initial weights *w*(*x*) = *N*(*x* | *m, S*_{τ}), we apply our sequential Monte Carlo algorithm to obtain a complete targeted subsample in terms of the stopping rule as well as posterior samples for all our parameters. We draw 30 observations at a time so that *B* = 30, with *c*_{threshold} = 0.2 and *N*_{threshold} = 30, resulting in a total of 390 targeted observations.

Flow cytometry data analysis via using the Sequential Monte Carlo targeted re-sampling algorithm: sample realization of the mixture model (a) based on the random subsample and (b) based on both the random subsample and the targeted subsample. Crosses **...**

Figure 3 shows estimated posterior distributions for the total number of components based on the initial random subsample, and subsequently after the SMC sampler given both the random and targeted subsamples. We observe the efficacy of the targeted approach, reflected in part through improved identification of the structure of the model density in the targeted region (Figure 4). More specifically, observing samples from the mixture model (see Figure 3) in the CD4 and IFNg/IL-2 markers before and after the targeted subsample, we see that the targeted approach has led to finer resolution on the mixture structure in the region of the targeted rare cell subtypes, providing higher resolution about the structure and covariation of their markers. Importantly, this revealed components in the low probability subregion which emerge due to the covariation with the remaining markers. This is confirmed by the analysis on the full dataset, shown in Appendix 6, where several more components are inferred. The improvement of the estimates obtained using our selection sampling approach on a total subsample (random and targeted) of size 5390 by far exceeds the expected improvement using a random subsample of the same size. The findings agree with the biologists' expectation that cell subtypes often have a non-Gaussian structure, (see Chan et al. 2008; Pyne et al. 2009), and provide an efficient method of detecting and drawing inferences about rare populations in the presence of very large datasets.

One of the key aspects of this work consists in defining the low probability region of interest and specifying the weight function used in the targeted sample. Naturally, the low probability region in sample space is strongly driven by the scientific question in hand. Based on that, and taking into account algorithmic tractability and efficiency, different weight functions may be used. In this work we presented methods relating to inferences about a specific component, defined in terms of an identifying criterion. In the flow cytometry example used in this paper, this was chosen as the component with mean closest to a specific point. Although the weight function used had a Gaussian shape, the analysis revealed a non-Gaussian structure in the region of interest; using mixtures of components as a weight function would be a straightforward extension of our methods. In fact, using a hierarchical model of mixtures of Gaussian mixtures may provide a better fit to the non-Gaussian inhomogeneous structure of the flow cytometry data; our targeted subsampling approach can be implemented using such models at little additional computational cost.

In the case of the flow cytometry data, an alternative to the identifying criterion of the component of interest would be to use the component containing a specific cell of known rare cell subtype. A natural extension to the weight functions used in this work stems from the fact that, in the original flow cytometry data, the identifying criterion for the component of interest is not defined on a fixed number of dimensions. Instead, it is defined as the set of markers which are significant in identifying the component in the region of low probability in sample space, which itself is unknown. In other words, the Gaussian mixture may be defined only on a subset of the *p* markers (unknown), such that we draw inferences about the parameters of the mixture $p({\theta}^{q}\mid X)\phantom{\rule{1em}{0ex}}{x}_{i}\in {\mathbb{R}}^{q}$ corresponding to variable dimensions 1 : *q, q* ≤ *p*. The targeted learning about θ^{q} can be incorporated in the analysis such that, within the sequential design, the weight function *w*(*x*) *N*(*x* | *m, S*) is updated at each round of re-sampling both in terms of the mean *m* and covariance *S* of the Gaussian distribution, but also in terms of the markers over which the weight function is defined. In the case of flow cytometry data, this can be viewed as soft gating of cells into cell subtypes, based on both the values of the individual markers, but also the set of significant markers.

One of the main challenges in drawing inferences about targeted subsamples is constructing efficient proposals for the parameters of interest, as the convergence of the algorithms is influenced by several factors. The size of the targeted subsample in relation to the random subsample plays a significant role. This becomes especially important when the assumption of a large number of observations within the region of interest is breached, as this would lead to a likelihood used for the targeted subsample which deviates severely from the true likelihood because of sampling without replacement. The multiplicative elements of τ also play significant roles in constructing a weight function which is wide enough to not violate the infinite weights assumption, at the same time targeting the region of interest.

Furthermore, the size of the initial subsample affects the posterior variance of the estimators. In fact, the random subsample may contain no observations in the low probability region. The weight function can be treated as having a prior mean and variance, used if no more information is available in the random subsample. As more observations are drawn sequentially from the specific region, this weight function is updated to include further information. Although the posterior estimates will be comparable, a very small initial subsample will strongly affect the efficiency of the proposal kernels, and possibly require overall longer running times to obtain similar results.

Research was partially supported by grants to Duke University from the National Science Foundation (DMS-0342172) and the National Institutes of Health (grant P50-GM081883, P30-AI064518-0 and contract HHSN268200500019C). Aspects of the research were also partially supported by the NSF grant DMS-0635449 to the Statistical and Applied Mathematical Sciences Institute. CC and MW were also partially supported on this research by NIH grant RC1AI086032. Any opinions, findings and conclusions or recommendations expressed in this work are those of the authors and do not necessarily reflect the views of the NSF or NIH.

In the case of the flow cytometry dataset under study, MCMC inference on the full dataset is feasible. The results are shown below in Figure 5, comparing the 3 relevant GMM structures in the case of (a) a random subsample (b) a random and a targeted subsample and (c) the full data. Here it's not possible to draw a direct comparison between the posterior distributions of the component *k**, because the component structure in the random, targeted and full dataset case changes significantly (with a much larger number of components in the full dataset case, as is expected by the Dirichlet process prior). The comparison shows that, indeed, there are more components in the region of interest using the full data, as is also suggested by our targeted approach.

In both the Markov chain Monte Carlo and Sequential Monte Carlo approaches described above, the targeted sample was weighted according to *w*(*x*) = *N*(*x* | *m,S*_{τ}), where *m* and *S* are estimates of the mean and covariance of the low-probability component *k**, and *S*_{τ} = *TST* where $T=\mathrm{diag}({\tau}_{1}^{1\u22152},\dots ,{\tau}_{p}^{1\u22152})$ is a *p* × *p* diagonal matrix based on a set of positive *variance multipliers* τ_{j}, (*j* = 1, . . . , *p*)*.* The τ_{j} are tuning parameters. Larger values will allow for wider dispersal of the targeted sub-sample, accounting for uncertainty of the initial estimate of ϕ_{k*}. As elements τ_{j} decrease, the weights *w _{i}* in the targeted sample become more concentrated and so favour fewer potential data points. One result of this is that it the assumption of an infinite number of points with non-negligible weight becomes invalid. If our initial estimate of (μ

Consider the one-dimensional case where *S*_{τ} = τ*S* so that *w*(*x _{i}*)

Example in one dimension, here the blue curve represents the mixture *f*(*x* | π,μ,Σ) and the red line the density of the low-probability component *N*(*x* | μ_{k*},Σ_{k*}). The black curve then represents the weight function **...**

Considering the overlap between the distribution of the targeted subsample and the low-probability component, we plot the common area, denoted by *A*(τ), for varying τ, and obtain the graph shown in Figure 7. As is seen from Figure 7, in terms of maximizing the overlap between the low probability component and the targeted subsample, the optimum value of τ varies. Specifically, the closer the remaining components are to the component of interest (and similarly the higher their variance) yields a lower value for the optimum τ, and the same happens when the weight of the component of interest decreases.

Example of *A*(τ) for several values of (μ_{k*}, π_{k*}), using a numerical approximation of the integral in order to calculate the common area.

Combining the above results with the fact that a large τ will improve the acceptance rate for μ, Σ but reduce the acceptance rate for π, and taking into account uncertainty on the $S={\widehat{\Sigma}}_{{k}^{\ast}}$, it is apparent that the optimum coefficient τ is not uniquely 1, and plays a significant role which affects many levels of the analysis.

Matlab code implementing the analysis described here, with examples, is available at http://ftp.stat.duke.edu/WorkingPapers/09-26.html.

^{1}Data from an NIAID/BD IntraCellular Staining Quality Assurance Panel (ICS QAP) kindly provided by the Duke Center for AIDS Research (CFAR) Immune Monitoring Core

- Balakrishnan S, Madigan D. A one-pass sequential Monte Carlo method for Bayesian analysis of massive datasets. Bayesian Analysis. 2006;1(2):345–362.
- Bayarri MJ, Berger J. Robust Bayesian analysis of selection models. Annals of Statistics. 1998;26(2):645–659.
- Carvalho C, Johannes M, Lopes H, Polson N. Particle Learning and Smoothing. Statistical Science. 2010 To appear.
- Chan C, Feng F, Ottinger J, Foster D, West M, Kepler T. Statistical mixture modeling for cell subtype identification in flow cytometry. Cytometry A. 2008;73:693–701. [PMC free article] [PubMed]
- Chopin N. A sequential particle filter method for static models. Biometrika. 2002;89(3):539–552.
- Doucet A, De Freitas N, Gordon N. Sequential Monte Carlo Methods in Practice. Springer-Verlag; New York: 2001.
- Gordon NJ, Salmond DJ, Smith AFM. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. IEE Proceedings. 1993;140:107–113.
- Heckman JJ. Sample selection bias as a specification error. Econometrica: Journal of the econometric society. 1979;47(1):153–161.
- Ishwaran H, James L. Approximate Dirichlet process computing in finite normal mixtures: Smoothing and prior information. Journal of Computational and Graphical Statistics. 2002;11:508–532.
- Liu J, West M. Combined parameter and state estimation in simulation-based filtering. In: Doucet A, De Freitas JFG, Gordon NJ, editors. Sequential Monte Carlo Methods in Practice. Springer-Verlag; New York: 2000. pp. 197–223.
- MacEachern SN. Estimating mixture of Dirichlet process models. Journal of Computational and Graphical Statistics. 1998;7(2):223–238.
- Müller P, Erkanli A, West M. Bayesian curve fitting using multivariate normal mixtures. Biometrika. 1996;83(1):67.
- Pyne S, Hu X, Wang K, Rossin E, Lin TI, Maier LM, Baecher-Allan C, McLachlan GJ, Tamayo P, Hafler DA, De Jagera PL, Mesirova JP. Automated high-dimensional flow cytometric data analysis. Proceedings of the National Academy of Sciences. 2009;106(21):8519. [PubMed]
- Ridgeway G, Madigan D. A sequential Monte Carlo method for Bayesian analysis of massive datasets. Data Mining and Knowledge Discovery. 2003;7(3):301–319. [PMC free article] [PubMed]
- Seder R, Darrah P, Roederer M. T-cell quality in memory and protection: implications for vaccine design. Nature Reviews Immunology. 2008;8(4):247–258. [PubMed]
- Suchard M, Wang Q, Chan C, Frelinger J, Cron A, West M. Understanding GPU programming for statistical computation: Studies in massively parallel massive mixtures. Journal of Computational and Graphical Statistics. 2010;19:419–438. [PMC free article] [PubMed]
- West M. Discovery sampling and selection models. In: Gupta SS, Berger JO, editors. Statistical Decision Theory and Related Topics. Springer-Verlag; New York: 1994. pp. 221–235.
- West M. Inference in successive sampling discovery models. Journal of Econometrics. 1996;75(1):217–238.
- West M, Harrison PJ. Bayesian Forecasting and Dynamic Models. 2nd edition Springer-Verlag; New York: 1997.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |