Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Bayesian Anal. Author manuscript; available in PMC 2010 September 21.
Published in final edited form as:
Bayesian Anal. 2010; 5(3): 1–22.
PMCID: PMC2943396

Selection Sampling from Large Data Sets for Targeted Inference in Mixture Modeling


One of the challenges in using Markov chain Monte Carlo for model analysis in studies with very large datasets is the need to scan through the whole data at each iteration of the sampler, which can be computationally prohibitive. Several approaches have been developed to address this, typically drawing computationally manageable subsamples of the data. Here we consider the specific case where most of the data from a mixture model provides little or no information about the parameters of interest, and we aim to select subsamples such that the information extracted is most relevant. The motivating application arises in flow cytometry, where several measurements from a vast number of cells are available. Interest lies in identifying specific rare cell subtypes and characterizing them according to their corresponding markers. We present a Markov chain Monte Carlo approach where an initial subsample of the full dataset is used to guide selection sampling of a further set of observations targeted at a scientifically interesting, low probability region. We define a Sequential Monte Carlo strategy in which the targeted subsample is augmented sequentially as estimates improve, and introduce a stopping rule for determining the size of the targeted subsample. An example from flow cytometry illustrates the ability of the approach to increase the resolution of inferences for rare cell subtypes.

Keywords: Flow cytometry, large data sets, mixture models, rare events, resampling, selection sampling, sequential Monte Carlo

1 Introduction

Advances in technology in biological research, as in other fields, are challenging our ability to routinely analyse increasing large data sets. In the motivating application area of flow cytometry, routine assays generate multiple measurements on cell surface markers on each of tens of thousands to millions of individual cells (Chan et al. 2008). Mixture models are applied for cell subtype classification and discrimination, and specific interests often relate to characteristics of rather rare subtypes. For example, polyfunctional lymphocyte subsets that are of interest in predicting vaccine efficacy (Seder et al. 2008) may have frequencies of 0.01% or less of the total blood cell population. Markov chain Monte Carlo is a powerful tool for drawing inferences in mixture models, but its use requires computations on the full dataset at each iteration. This is a drawback in cases of very large datasets, so some approaches have been developed to focus on randomly drawn subsamples of data. For example, Ridgeway and Madigan (2003) proposed a two-step algorithm of drawing subsamples via Sequential Monte Carlo, and this was improved upon by Balakrishnan and Madigan (2006) by introducing a rejuvenation step based on a kernel smoothing approximation similar to Liu and West (2000). Here, the observations of interest are rare, so that random subsamples typically contain very few observations of the rare subtype, and new approaches are required.

Generally, we are interested in drawing inferences about low probability regions of sample space in mixture model analyses of large datasets. We use traditional Bayesian mixtures admitting uncertainty about the number of components (e.g. Müller et al. 1996; MacEachern 1998; Ishwaran and James 2002; Suchard et al. 2010). We focus on inferences about a low probability mixture component, or a group of several low probability components that together represent a scientifically relevant subpopulation. Our central idea is to use an initial random subsample of data in order to construct a weight function directed around the region of interest, and use this to subsequently draw a targeted subsample of data preferentially selected from that region of interest. This builds on traditional ideas of selection and weighted sampling (e.g. Heckman 1979; Bayarri and Berger 1998) and their application in discovery sampling (West 1994, 1996). Here the use of non-parametric Bayesian mixture models allows us to link regions in sample space with specific components of the model and naturally identify subsets of observations which are relevant to the scientific question at hand through a component-driven weight function. We implement a two-step Markov chain Monte Carlo approach that first uses the random subsample to obtain an initial posterior, then adds the targeted subsample to draw component-specific inferences. We extend the method to a Sequential Monte Carlo algorithm whereby the targeted subsample is augmented sequentially, guided by a stopping rule, to successively refine inferences on the rare subpopulation, to the extent feasible.

2 Modelling and posterior distributions

In contexts such as our motivating flow cytometry applications, Gaussian mixtures are used as flexible overall models and scientifically relevant subpopulations are identified by (typically, small) subsets of Gaussian components that can reflect non-Gaussianity within subpopulations (Chan et al. 2008). Hence, with no loss of generality here, we consider a Gaussian mixture for p–variate data, with N samples xi, (i = 1, …, N), defining a data set X. The density of the mixture is


and we adopt a standard truncated Dirichlet process model to define the prior over the mixing probabilities based on some (large) upper bound K (Ishwaran and James 2002). Let


The mixture model can be realized through the configuration indicators zi for each observation xi with prior p(zi = k | π) = πk, so that we obtain the standard hierarchical model


where G(·) is an uncertain distribution function, G0(·) is the prior mean of G(·) and α > 0 the total mass, or precision of the DP. From the Pólya urn scheme,


The truncated Dirichlet process prior is such that π1 = V1 and πk=Vk×Πi=1k1(1Vi), k > 1, where Vi ~ Be(1, α), i < K independently over i and VK = 1. Prior specification for each component k is completed with a traditional normal-inverse Wishart form,


and with a Gamma prior for the Dirichlet concentration parameter α ~ Ga1, η2). Placing a prior on α (see Ishwaran and James 2002) allows us to draw inferences about the number of mixture components through the role of α of the Pólya urn scheme as the prior number of observations in each component. Finally, we label components in decreasing weight π as an identifiability criterion to address the label-switching problem.

In general, a scientific question may define focus on a specific region of sample space. Here we take one key example, that of identifiable regions of low probability but of high scientific importance. For specificity, we assume this is defined by a low probability component of the mixture. In more general versions, this could be defined by a small set of mixture components, or other devices. Hence, we focus on targeting inferences about the parameters ϕk* = (μk*, Σk*) of a low-probability component k*.

The objective is to identify and analyze subsamples of the data which contain information about the specific subset of the parameters of interest. The key idea is to obtain a rough estimate of the low-probability component k* based on a random subsample, which is subsequently used to draw weighted subsamples of the data that are more likely to be relevant to our analysis, providing us with higher resolution about the structure of the distribution in the region of interest.

The direct approach is to follow a two-step procedure of Markov chain Monte Carlo samplers. We use an initial, randomly drawn subsample from the data in order to obtain an estimate of the parameters, and use this estimate to draw a more informative subsample. The two subsamples are then combined in a joint Markov chain Monte Carlo sampler to provide us with a posterior distribution of ϕk* which will be an improved estimate with respect to the total posterior based on the whole sample. Although interest specifically lies in estimating the parameters of component k* given by μk*, Σk*, inference on the full set of μ, Σ is required in order to carry out the analysis.

We denote the two subsamples (random and targeted) XR and XT of size nR and nT respectively, where nT [dbl less than] nR throughout this paper. The first is drawn randomly from the data, whereas the second is drawn according to weights wi 1 ≤ iN. We aim to choose the weights so that the targeted subsample is expected to be enriched in observations from component k*; one specific choice is to take


the density of the normal distribution in which m, S are estimates of μk*, Σk* from the initial analysis based on an initial random subsample, and Sτ = TST where T=diag(τ112,,τp12) is a p × p diagonal matrix based on a set of positive variance multipliers τj, (j = 1, …, p). These allow us to concentrate (or expand) targeted resampling differentially in different dimensions with due regard for correlation structure, if desired. We use the notations wi = w(xi) and wi = w(xi | m, τ, S), the latter when the explicit dependence of the weight function on m, τ, S is to be high-lighted.

The likelihood of the data (XR, XT) then takes the following form. For observations in the random subsample:


The first expression provides the likelihood of the standard mixture model, whereas the second expression represents the likelihood conditionally on the configuration indicator ziR, the component where observation xiR belongs.

Similarly, for observations in the targeted subsample:






Note here that, conditional on all the parameters, targeted observations are independent, using the infinite population assumption for the weights w(xi). This means that we assume a very large number of observations within the region of non-negligible support of the weight function w(x), creating an inherent trade-off between the validity of the infinite population assumption and the focus of the targeted subsample. These can be monitored through the normalizing constant of the sampling weights Σi=1Nw(xi), and tuned through the parameter vector τ (see Appendix for a more detailed analysis of the role of τ).

The first Markov chain Monte Carlo sampler is a standard blocked Gibbs sampler (Ishwaran and James 2002) to simulate p(α, π, μ, Σ | XR). In order to carry out the second Markov chain Monte Carlo sampling based on the random and targeted sub-samples combined, the posterior distributions of the parameters of α, π, μ, Σ have to be re-calculated to define efficient proposals. Although conjugacy for μ, Σ, π is lost, we keep the standard normal-inverse Wishart and Dirichlet priors for the mixture model.

The posterior for z for both subsamples is multinomial with probabilities


The posterior for π has density


The contribution of the targeted subsample to the posterior distribution for π provides little additional information about the distribution of π when the elements Sτ are small, and becomes more significant as the elements of Sτ increase, allowing observations in the targeted subsample to belong to components other than k*.

The posterior for α only depends on the data through V and thus will have the usual posterior distribution (see Ishwaran and James 2002)


The posterior for μk can be calculated exactly as


where Skμ, mkμ are given by


where nk is the total number of data points in component k and nkT is the number of data points in that component coming from the targeted subsample. Notice that the contribution of the targeted subsample to the posterior variance of μk is nkT(Sτ1Σk+I)1, and since S is an estimate of Σk, this quantity has diagonal elements of the order of nkTτj(1+τj)1 for j = 1, …, p; a more concentrated weight function reduces the information about μk.

The posterior for Σk has density


The non-standard posteriors (10) and (13) lead to the need for creative approximations to define efficient MCMC proposals, as now developed.

3 Markov chain Monte Carlo approach

We construct a MCMC sampler with stationary distribution p(μ, Σ, z, V, α | XR, XT). The chain is initialized by drawing μ, Σ, z, V, α from their priors, then iterates through the following steps.

  1. Update z by generating from the posterior given in Equation (9).
  2. Update V through a Metropolis-Hastings step by generating from the posterior based only on the initial random subsample,
    with VK = 1. Set πk=VkΠj=1k1(1Vj) and accept the proposed move with probability
    Recall that π~(θ,m,τ,S) given in Equation (7) corresponds to the component probability weights in the targeted subsample. If the targeted subsample is indeed drawn such that most of its points belong to component k*, the acceptance probability will be ≈ 1.
  3. Update α from its posterior given V given in Equation (10).
  4. Update each μk through a Gibbs step using
    given in Equation (11).
  5. For each Σk, we construct a proposal q(ΣkXR,XT,z,μ) for a Metropolis-Hastings step using the fact that
    We use the inverse transformation to obtain
    In practice, of course, Σk is not known. A similar transformation of XT can be obtained using an estimate of Σk, providing a proposal distribution
    In order to increase the variance of the proposal kernel, a discount factor may also be used.

The Markov chain Monte Carlo sampler sweeps through the updates described above, yielding estimates for the posterior distribution of the parameters of interest. However, due to the high number of parameters to be estimated and the difficulty in defining efficient proposals, the acceptance rate quickly drops to zero for targeted subsamples of moderate size.

4 Focusing on the low-probability component

The dimensionality of the problem, combined with the difficulty to construct efficient proposals, results in Markov chain Monte Carlo samplers which require very long running times in order to reach stationarity with respect to the posterior distribution. At the same time, the approach described above does not exploit the results from the initial run based on the random sample, except for extracting the estimates of μk*, Σk*. We describe how the dimensionality of the problem can be greatly reduced using the posterior distribution estimates obtained from the initial Markov chain Monte Carlo simulation.

Notice that the objective is to draw inferences about a region in the sample space which has very low probability. Consequently, very few points in the initial random sample will belong to that region. On the other hand, the targeted sample will, generally, contain observations from the low-probability region. This implies that the posterior distribution of the parameters based on both the random and targeted samples (XR, XT)


can be approximated as


using p(zR | XR, XT) ≈ p(zR | XR). The approximation is reasonable since XT is centred around a specific region, implying that the component structure of most of the sample space remains unchanged after introducing XT, while the number of observations in zR in the region of interest is far outnumbered by the zT s in that region. The approximation (a) requires integrating over a much smaller set of parameters zT and can be calculated much more efficiently, and (b) is known from the first Markov chain Monte Carlo run, allowing us to update only zT and draw zR from the existing posterior samples. This de-couples the z-dependence of the random and the targeted sample, greatly reducing the dimensionality of the second analysis, since nT [dbl less than] nR. By fixing the configuration indicators of the random subsample, the component sums of XR and updates of zR remain unchanged.

The second Markov chain Monte Carlo is then adapted to a set of chains run for a set of samples. Each chain will provide posterior estimates for the parameters conditionally on a fixed draw of zR, so that, marginally, we obtain a set of samples from the (approximate) posterior distribution shown in Equation (20). Specifically, for chains l = 1 : L, draw a sample from (z, π, μ, Σ)l | XR and apply the second sampler for each chain only on π, zT, μ, Σ, α | XR, XT, (zR)l keeping zR fixed, combining samples at the end. In effect, the algorithm amounts to an Importance Sampler (Doucet et al. 2001). This approach greatly reduces both the complexity of the calculations per sweep, as well as the total number of samples required in order to obtain a good approximation of the posterior distribution. However, because the posteriors μk*, Σk* | XR and μk*, Σk* | XR, XT may differ substantially, the sampler still suffers from very low acceptance rates and with a moderately sized targeted subsample can fail to reach the region in parameter space of high posterior probability.

5 Sequential Monte Carlo approach

The focused approach drastically reduces the dimensionality of the algorithm, and as a result the computational complexity. However, Metropolis-Hastings updates still show low acceptance rates, since the posteriors conditional on XR and (XR, XT), respectively, can be very different in the subspace of parameters related to the targeted component. In addition, the size of the targeted subsample is chosen manually rather than through an automated procedure. Both drawbacks may be addressed drawing the targeted sample through a Sequential Monte Carlo simulation rather than using a two-step procedure. A large number of random samples (particles) is used to approximate the targeted sequence of distributions, so that asymptotically this converges to the true target distribution; see Doucet et al. (2001), Chopin (2002), Carvalho et al. (2010). We consider a sequential scheme such that the targeted sample is selected B data points at a time, at each draw updating parameter estimates for a set of particles.

For each of a set of particles j = 1 : J, draw a sample of (z, π, μ, Σ) | XR from the posterior distribution estimates obtained in the Markov chain Monte Carlo sampler. Then repeatedly augment the targeted subsample and mutate the parameter draws through the following steps.

For j = 1 : J and for a fixed sequence of variance scaling vectors τ = t1, …, tJ:

  1. Select a particle u uniformly at random and set mj={μkj}u and Sj={Σkj}u where {ϕkj}u is the sample of the uth particle at step j for component k*.
  2. Draw another batch of B targeted observations XjT without replacement according to weights wi [proportional, variant] w(xi | mj, tj, Sj).
  3. Using a fixed number of Metropolis-Hastings steps following the iterates described in the Markov chain Monte Carlo approach above, update the configuration indicators zT, the weights π, the concentration parameter α and the component-specific parameters μ, Σ by repeating the following updates:
    • (a)
      Update zT through update (9), π through update (14) and α through (10).
    • (b)
      Similar to the posterior distribution for μ given in (15), the posterior distribution of μk now becomes
      with Ui = Sτ at the values S = Si and τ = ti.
    • (c)
      Update each Σk using a proposal distribution similar to (18), replacing m, S in Equation (17) by
      where batch(i) represents the batch of targeted observations that xiT was sampled in.

The sequential Monte Carlo algorithm described does not include a particle resampling step; since each particle has a fixed zR, re-sampling would result in poor coverage of the zR space. The algorithm can be classified as a Sequential Important Sampler (Gordon et al. 1993). At each draw j, the samples of μ and Σ are obtained through a Metropolis-Hastings kernel targeting the posterior p(μ,ΣXR,X1:jT,m1:j,S1:j). In this Sequential Importance Sampling setting, asymptotically (as the number of particles becomes sufficiently large), the approximation will converge to the true posterior, since each particle (provided a reasonable number of Metropolis-Hastings updates) is indeed sampling from the posterior. A larger number of Metropolis-Hastings updates will be needed when the component structure changes through the targeted sample in order to fit the emerging components. The tuning parameter vector τ allows monitoring of dispersal of the targeted sample and also the adequacy of the infinite population sampling assumption (see Appendix 6).

Owing to the method in which the parameters m, S of the weight function are fixed at each step of the re-sampling, weight functions located around different regions of sample space may be chosen. When the low-probability component follows a mixture distribution between different regions of sample space, this will be reflected in the estimates obtained from each particle, resulting in each particle potentially corresponding to a different component. Through our adaptive algorithm, the sample space is explored flexibly and posterior estimates of the parameters are updated incrementally as the targeted subsample is augmented, allowing more efficient inferences.

This approach immediately poses the question of when to stop drawing observations for the targeted subsample. Ideally, we would like the targeted sample to contain all data points of component k*. In order to address this, we introduce a decision rule such that the targeted sample stops being augmented when no more data points in the remaining original data show a high probability of belonging to component k*. A natural approach to use is the Bayes Factor for that component, using for example a threshold of the order e2; see West and Harrison (1997). In other words, we introduce an extra decision step, as follows:

  • 5a
    For each unsampled observation xi, define the Bayes Factor
    where πk* (xi) ∞ πk* N(xi | μk*, Σk*). Stop sampling if there are no observations such that BFk* (xi) > BFthreshold, a specified threshold.

The calculation of the Bayes Factor is computationally demanding; as an alternative, the stopping rule may be expressed purely as a function of the weights. In other words:

  • 5b
    If there are fewer than Nthreshold unsampled observations within a cthreshold contour of the weight function, stop.

The threshold values are important for monitoring the impact of the infinite population assumption on our inferences, representing the `number of points carrying significant weight'. The values chosen will depend upon the number B of batched targeted observations drawn at each iteration of the sampler and the dimension of the data; for example, one may use Nthreshold = 3B and cthreshold = exp(−p/4), where p is the number of markers.

The Sequential Monte Carlo approach provides an efficient method of drawing inferences about parameters relevant to a low probability region of sample space, at the same time allowing the algorithm to automatically monitor the number of observations in the region of interest.

5.1 Example: Synthetic dataset

We implement our methods on a 2-dimensional synthetic dataset with 5,000 observations from a Gaussian mixture model with 5 components. This serves to illustrate the approach in a synthetic context where the relatively small sample size allows comparison with posterior computations based on the full data set. The model structure, shown in Figure 1, was chosen so that the component of interest can be visually separated from the rest, at the same time having significant overlap with the remaining components. The component k* is defined as centered within a certain region, and closest to its estimate of the mean after the initial samples from the posterior distribution based on a random subsample of size 700. If no such component exists, the weight function is initialized with m = m0, the centre of the pre-specified region, and S = S0/s0, the prior mode of Σ. Using our sequential Monte Carlo approach, we draw B = 10 observations at a time, fixing τj = 1, j = 1,…,p. We define the stopping rule using Nthreshold = 20 and cthreshold = exp(−0.5), resulting in 200 targeted obervations. We carry out a standard Gibbs sampler on the full dataset, and compare the results between the two methods.

Figure 1
On the left, a scatter plot of the 2 dimensional data. The full data are shown in yellow, the random subsample in red, and the targeted subsample after the SMC algorithm in blue. The contours show the component structure of a single draw from the total ...

The simulation results show that the posterior distribution conditional on a targeted subsample of 900 observations, in addition to the random sample of 700 initial observations, shows an improvement of the posterior estimates comparable to those obtained using the complete dataset, but using less than 20% of the total observations; this by far exceeds the expected improvement using a random subsample of the same size, providing a much more efficient algorithm. For example, the boxplot shown in Figure 1 representing posterior inferences on the first dimension of the mean vector of component k* is centered closer to the true mean with a much tighter posterior variance. As expected, the posterior variance is still lower using the full data set: this is both because there are more data in component k* (the stopping rule will always leave some data out in order to monitor the infinite popultion assumption), and because it carries more information about the component weights π.

5.2 Example: Flow cytometry

The motivating example for this study is a problem arising in flow cytometry, where cellular subtypes may be associated with one (or more) components of a Gaussian mixture model (see Chan et al. 2008). Flow cytometers detect fluorescent reporter markers that typically correspond to specific cell surface or intracellular proteins on individual cells, and can assay millions of such cells in a fluid stream in minutes. Datasets are typically very large, and as a result inference on the full data is computationally prohibitive. Interest lies in identifying and characterizing rare cell subtypes using a mixture model fitted on those markers. The ability to identify such rare cell subsets plays important roles in many medical contexts - for example, the detection of antigen-specific cells with MHC class I or class II markers, identification of polyfunctional T lymphocytes that correlate with vaccine efficacy or host resistance to pathogens, or in resolving variants of already low frequency cell types, e.g. subtypes of conventional dendritic cells.

We use a dataset of 50,000 data points from human peripheral blood cells, with 6 marker measurements each: Forward Scatter (measure of cell size), Side Scatter (measure of cell granularity), CD4 (marker for helper T cells), IFNg+IL-2 (effector cytokines), CD8 (a marker for cytotoxic T cells), CD3 (marker for all T cells)1. The objective is to provide higher resolution on the structure and patterns of covariation of cells of a specific cell subtype, specifically cells high in CD3 and/or CD8 secreting IL-2/IFN-g when challenged with a specific viral antigen. In other words, we are particularly interested in observations with large values in the 4th dimension together with the 5th and/or 6th. The data show a clear component structure for some of the markers (see Figure 2), whereas in others the rare cell subtypes of interest are not separated. To illustrate our methods and for ease of exposition, we adapt our algorithm by targeting inferences towards the component with CD4 and IFNg+IL-2 (3rd and 4th dimension) centred closest to a specific point, and set each τj to be very large, namely τj = 1000 for all but the 4th dimension, focusing on the secretion of IFNg+IL-2. Here we set K = 200, as we expect the maximum number of components to be far fewer than 200. Owing to the structure of the model, using an upper bound much larger than the number of components necessary does not affect the accuracy of the posterior estimates.

Figure 2
Scatter plots of the data for the last 4 markers: CD4, IFNg, CD8 and CD3. The complete data set is shown in yellow. We aim to use the random subsample (shown in red) in order to obtain samples from the initial posterior p(μ,Σ,π,α| ...

An initial random subsample of size 5,000 is drawn, providing us with initial estimates m, S for the mean and covariance of the component closest to the high CD4+ region. Due to the strong covariation between the markers, several components are needed (see Figure 3) in order to capture the inhomogeneity of the data. Using initial weights w(x) = N(x | m, Sτ), we apply our sequential Monte Carlo algorithm to obtain a complete targeted subsample in terms of the stopping rule as well as posterior samples for all our parameters. We draw 30 observations at a time so that B = 30, with cthreshold = 0.2 and Nthreshold = 30, resulting in a total of 390 targeted observations.

Figure 3
Flow cytometry data analysis via using the Sequential Monte Carlo targeted re-sampling algorithm: sample realization of the mixture model (a) based on the random subsample and (b) based on both the random subsample and the targeted subsample. Crosses ...

Figure 3 shows estimated posterior distributions for the total number of components based on the initial random subsample, and subsequently after the SMC sampler given both the random and targeted subsamples. We observe the efficacy of the targeted approach, reflected in part through improved identification of the structure of the model density in the targeted region (Figure 4). More specifically, observing samples from the mixture model (see Figure 3) in the CD4 and IFNg/IL-2 markers before and after the targeted subsample, we see that the targeted approach has led to finer resolution on the mixture structure in the region of the targeted rare cell subtypes, providing higher resolution about the structure and covariation of their markers. Importantly, this revealed components in the low probability subregion which emerge due to the covariation with the remaining markers. This is confirmed by the analysis on the full dataset, shown in Appendix 6, where several more components are inferred. The improvement of the estimates obtained using our selection sampling approach on a total subsample (random and targeted) of size 5390 by far exceeds the expected improvement using a random subsample of the same size. The findings agree with the biologists' expectation that cell subtypes often have a non-Gaussian structure, (see Chan et al. 2008; Pyne et al. 2009), and provide an efficient method of detecting and drawing inferences about rare populations in the presence of very large datasets.

Figure 4
Posterior distributions for the number of components in the Gaussian mixture model, given only the random subsample p(k | XR) shown in black and given both the random and targeted subsample p(k | XR,XT) shown in white.

6 Discussion

One of the key aspects of this work consists in defining the low probability region of interest and specifying the weight function used in the targeted sample. Naturally, the low probability region in sample space is strongly driven by the scientific question in hand. Based on that, and taking into account algorithmic tractability and efficiency, different weight functions may be used. In this work we presented methods relating to inferences about a specific component, defined in terms of an identifying criterion. In the flow cytometry example used in this paper, this was chosen as the component with mean closest to a specific point. Although the weight function used had a Gaussian shape, the analysis revealed a non-Gaussian structure in the region of interest; using mixtures of components as a weight function would be a straightforward extension of our methods. In fact, using a hierarchical model of mixtures of Gaussian mixtures may provide a better fit to the non-Gaussian inhomogeneous structure of the flow cytometry data; our targeted subsampling approach can be implemented using such models at little additional computational cost.

In the case of the flow cytometry data, an alternative to the identifying criterion of the component of interest would be to use the component containing a specific cell of known rare cell subtype. A natural extension to the weight functions used in this work stems from the fact that, in the original flow cytometry data, the identifying criterion for the component of interest is not defined on a fixed number of dimensions. Instead, it is defined as the set of markers which are significant in identifying the component in the region of low probability in sample space, which itself is unknown. In other words, the Gaussian mixture may be defined only on a subset of the p markers (unknown), such that we draw inferences about the parameters of the mixture p(θqX)xiRq corresponding to variable dimensions 1 : q, qp. The targeted learning about θq can be incorporated in the analysis such that, within the sequential design, the weight function w(x) [proportional, variant] N(x | m, S) is updated at each round of re-sampling both in terms of the mean m and covariance S of the Gaussian distribution, but also in terms of the markers over which the weight function is defined. In the case of flow cytometry data, this can be viewed as soft gating of cells into cell subtypes, based on both the values of the individual markers, but also the set of significant markers.

One of the main challenges in drawing inferences about targeted subsamples is constructing efficient proposals for the parameters of interest, as the convergence of the algorithms is influenced by several factors. The size of the targeted subsample in relation to the random subsample plays a significant role. This becomes especially important when the assumption of a large number of observations within the region of interest is breached, as this would lead to a likelihood used for the targeted subsample which deviates severely from the true likelihood because of sampling without replacement. The multiplicative elements of τ also play significant roles in constructing a weight function which is wide enough to not violate the infinite weights assumption, at the same time targeting the region of interest.

Furthermore, the size of the initial subsample affects the posterior variance of the estimators. In fact, the random subsample may contain no observations in the low probability region. The weight function can be treated as having a prior mean and variance, used if no more information is available in the random subsample. As more observations are drawn sequentially from the specific region, this weight function is updated to include further information. Although the posterior estimates will be comparable, a very small initial subsample will strongly affect the efficiency of the proposal kernels, and possibly require overall longer running times to obtain similar results.

Supplementary Material



Research was partially supported by grants to Duke University from the National Science Foundation (DMS-0342172) and the National Institutes of Health (grant P50-GM081883, P30-AI064518-0 and contract HHSN268200500019C). Aspects of the research were also partially supported by the NSF grant DMS-0635449 to the Statistical and Applied Mathematical Sciences Institute. CC and MW were also partially supported on this research by NIH grant RC1AI086032. Any opinions, findings and conclusions or recommendations expressed in this work are those of the authors and do not necessarily reflect the views of the NSF or NIH.


A: Flow cytometry data

In the case of the flow cytometry dataset under study, MCMC inference on the full dataset is feasible. The results are shown below in Figure 5, comparing the 3 relevant GMM structures in the case of (a) a random subsample (b) a random and a targeted subsample and (c) the full data. Here it's not possible to draw a direct comparison between the posterior distributions of the component k*, because the component structure in the random, targeted and full dataset case changes significantly (with a much larger number of components in the full dataset case, as is expected by the Dirichlet process prior). The comparison shows that, indeed, there are more components in the region of interest using the full data, as is also suggested by our targeted approach.

Figure 5
Samples of the Gaussian mixture model structure in the case of the (a) random subsample (b) random and targeted subsample and (c) full data. Red points represent randomly selected observations while blue represent targeted observations. Components are ...

B: Weight functions

In both the Markov chain Monte Carlo and Sequential Monte Carlo approaches described above, the targeted sample was weighted according to w(x) = N(x | m,Sτ), where m and S are estimates of the mean and covariance of the low-probability component k*, and Sτ = TST where T=diag(τ112,,τp12) is a p × p diagonal matrix based on a set of positive variance multipliers τj, (j = 1, . . . , p). The τj are tuning parameters. Larger values will allow for wider dispersal of the targeted sub-sample, accounting for uncertainty of the initial estimate of ϕk*. As elements τj decrease, the weights wi in the targeted sample become more concentrated and so favour fewer potential data points. One result of this is that it the assumption of an infinite number of points with non-negligible weight becomes invalid. If our initial estimate of (μk*, Σk*) is poor, small elements τj will restrict the targeted sample to a region away from the full low probability region of interest. Within the context of the Metropolis-Hastings updates, as elements τj increase, the acceptance rate for μ, Σ increases, since the targeted sample looks more like the random sample. In that case, the posterior distribution of ϕk* is not pulled too far from the proposed values. At the same time, as elements τj increase, acceptance rates for π decrease since the information about π given by the targeted sample becomes significant, and the proposed values (which are based only on the random subsample) become less acceptable.

Consider the one-dimensional case where Sτ = τS so that w(xi) [proportional, variant] N(xi | m, τS). Assume that μ, Σ, π are known and that there is an infinite number of data points. The weight function becomes w(xi) ≈ N(xi | μk*, τΣk*), and the coefficient τ (here scalar) may be chosen such that the probability of drawing data points from the low-probability component is maximized. An example is shown in Figure 6.

Figure 6
Example in one dimension, here the blue curve represents the mixture f(x | π,μ,Σ) and the red line the density of the low-probability component N(x | μk*k*). The black curve then represents the weight function ...

Considering the overlap between the distribution of the targeted subsample and the low-probability component, we plot the common area, denoted by A(τ), for varying τ, and obtain the graph shown in Figure 7. As is seen from Figure 7, in terms of maximizing the overlap between the low probability component and the targeted subsample, the optimum value of τ varies. Specifically, the closer the remaining components are to the component of interest (and similarly the higher their variance) yields a lower value for the optimum τ, and the same happens when the weight of the component of interest decreases.

Figure 7
Example of A(τ) for several values of (μk*, πk*), using a numerical approximation of the integral in order to calculate the common area.

Combining the above results with the fact that a large τ will improve the acceptance rate for μ, Σ but reduce the acceptance rate for π, and taking into account uncertainty on the S=Σ^k, it is apparent that the optimum coefficient τ is not uniquely 1, and plays a significant role which affects many levels of the analysis.

C: Software

Matlab code implementing the analysis described here, with examples, is available at


1Data from an NIAID/BD IntraCellular Staining Quality Assurance Panel (ICS QAP) kindly provided by the Duke Center for AIDS Research (CFAR) Immune Monitoring Core


  • Balakrishnan S, Madigan D. A one-pass sequential Monte Carlo method for Bayesian analysis of massive datasets. Bayesian Analysis. 2006;1(2):345–362.
  • Bayarri MJ, Berger J. Robust Bayesian analysis of selection models. Annals of Statistics. 1998;26(2):645–659.
  • Carvalho C, Johannes M, Lopes H, Polson N. Particle Learning and Smoothing. Statistical Science. 2010 To appear.
  • Chan C, Feng F, Ottinger J, Foster D, West M, Kepler T. Statistical mixture modeling for cell subtype identification in flow cytometry. Cytometry A. 2008;73:693–701. [PMC free article] [PubMed]
  • Chopin N. A sequential particle filter method for static models. Biometrika. 2002;89(3):539–552.
  • Doucet A, De Freitas N, Gordon N. Sequential Monte Carlo Methods in Practice. Springer-Verlag; New York: 2001.
  • Gordon NJ, Salmond DJ, Smith AFM. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. IEE Proceedings. 1993;140:107–113.
  • Heckman JJ. Sample selection bias as a specification error. Econometrica: Journal of the econometric society. 1979;47(1):153–161.
  • Ishwaran H, James L. Approximate Dirichlet process computing in finite normal mixtures: Smoothing and prior information. Journal of Computational and Graphical Statistics. 2002;11:508–532.
  • Liu J, West M. Combined parameter and state estimation in simulation-based filtering. In: Doucet A, De Freitas JFG, Gordon NJ, editors. Sequential Monte Carlo Methods in Practice. Springer-Verlag; New York: 2000. pp. 197–223.
  • MacEachern SN. Estimating mixture of Dirichlet process models. Journal of Computational and Graphical Statistics. 1998;7(2):223–238.
  • Müller P, Erkanli A, West M. Bayesian curve fitting using multivariate normal mixtures. Biometrika. 1996;83(1):67.
  • Pyne S, Hu X, Wang K, Rossin E, Lin TI, Maier LM, Baecher-Allan C, McLachlan GJ, Tamayo P, Hafler DA, De Jagera PL, Mesirova JP. Automated high-dimensional flow cytometric data analysis. Proceedings of the National Academy of Sciences. 2009;106(21):8519. [PubMed]
  • Ridgeway G, Madigan D. A sequential Monte Carlo method for Bayesian analysis of massive datasets. Data Mining and Knowledge Discovery. 2003;7(3):301–319. [PMC free article] [PubMed]
  • Seder R, Darrah P, Roederer M. T-cell quality in memory and protection: implications for vaccine design. Nature Reviews Immunology. 2008;8(4):247–258. [PubMed]
  • Suchard M, Wang Q, Chan C, Frelinger J, Cron A, West M. Understanding GPU programming for statistical computation: Studies in massively parallel massive mixtures. Journal of Computational and Graphical Statistics. 2010;19:419–438. [PMC free article] [PubMed]
  • West M. Discovery sampling and selection models. In: Gupta SS, Berger JO, editors. Statistical Decision Theory and Related Topics. Springer-Verlag; New York: 1994. pp. 221–235.
  • West M. Inference in successive sampling discovery models. Journal of Econometrics. 1996;75(1):217–238.
  • West M, Harrison PJ. Bayesian Forecasting and Dynamic Models. 2nd edition Springer-Verlag; New York: 1997.