Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2705170

Formats

Article sections

- Abstract
- 1 Introduction
- 2 The Model
- 3 Bayesian Inference for the SLCA Model
- 4 Pollution Data Example
- 5 Discussion
- References

Authors

Related links

Comput Stat Data Anal. Author manuscript; available in PMC 2010 June 15.

Published in final edited form as:

Comput Stat Data Anal. 2009 June 15; 53(8): 3057–3069.

doi: 10.1016/j.csda.2008.07.037PMCID: PMC2705170

NIHMSID: NIHMS110869

University of Minnesota, Division of Biostatistics

See other articles in PMC that cite the published article.

A spatial latent class analysis model that extends the classic latent class analysis model by adding spatial structure to the latent class distribution through the use of the multinomial probit model is introduced. Linear combinations of independent Gaussian spatial processes are used to develop multivariate spatial processes that are underlying the categorical latent classes. This allows the latent class membership to be correlated across spatially distributed sites and it allows correlation between the probabilities of particular types of classes at any one site. The number of latent classes is assumed fixed but is chosen by model comparison via cross-validation. An application of the spatial latent class analysis model is shown using soil pollution samples where 8 heavy metals were measured to be above or below government pollution limits across a 25 square kilometer region. Estimation is performed within a Bayesian framework using MCMC and is implemented using the OpenBUGS software.

Multivariate binary measurements are very common in many research areas, such as sociology, psychology, and medical diagnotic tests where multiple questions or tests are given to subjects (or patients) each with a simple 0-1 response. Two basic but commonly employed methods for dealing with the multivariate nature of the data are to either analyze all the measurements separately or instead to create some overall summary index of the indicators and then analyze the index as the single observed measure. Both methods fail to take account of the fact that different indicators might be correlated. On the other hand, the latent class model is a multivariate model which has been widely used for this type of data when it is reasonable to hypothesize the measurements as representations of some underlying clusters in the study population. The basic idea of the latent class analysis (LCA) model is that the indicators are correlated because there are unobserved clusters (i.e. latent classes) in the population and responses to the indicators differ for different clusters thus creating correlation between the indicators when the clusters are unknown. The model assumes that within each latent class, the responses to the indicators are, in fact, independent.

Commonly the LCA models applied in the literature make the assumption that the indicators are correlated within the same subject, but are independent for different subjects. This is often a reasonable assumption in problems where measurements are taken on independent individuals, but may not be appropriate in cases such as multiple indicators collected repeatedly over time, or over different spatial locations. There are examples of the latent class model being used to model multivariate binary data collected across time. For example, Reboussin et al. (1999) considered a latent transition model for data consisting of repeated measures of multiple indicators of health. In contrast, the current paper considers the latent class model for multivariate binary data collected across geographic regions.

The motivating data of this paper (see Figure 1) includes binary indicators for eight different heavy metals found in the soil which indicate whether the level of the heavy metal is over the legal pollution threshold or not at 97 spatially referenced locations throughout a 25 square kilometer region. It is of interest to identify contaminated areas and specifically to examine whether there are parts of the region that tend to share high levels of certain combinations of heavy metals, which may then help identify pollution sources. An LCA type model can be useful for this data since there are multiple indicators for each location and the indicators are likely to be correlated due to some common unknown and unobserved pollution sources. The challenge of this data is that observations of each of the eight variables are likely correlated over different locations and different variables observed at different locations may also be cross-correlated.

Sampled locations and values of heavy metals in the soil data. Values of 1 and 0 indicate soil sample has above and below regulatory limit for the respective heavy metal. Plot on bottom right provides location labels

In this paper we introduce a spatial latent class analysis model. In the first part of this hierarchical model we define the relationship between the observed indicators and the latent classes the same way as in a classic LCA model, and in the second part of the model we use a multinomial probit to model the latent classes by introducing continuous random variables underlying them. The benefit of the multinomial probit model is that spatial covariance structures can be introduced that take account of the spatial correlations between the latent class variables at different locations.

In Section 2 we review the classic LCA model, and introduce the spatial latent class analysis (SLCA) model. In Section 3 we describe inference for the parameters in a fully Bayesian framework and propose to use cross-validation methods for model comparison and examining fit. In Section 4 we show the results of the SLCA model applied to the soil pollution example. It is found in the soil pollution data example that a 3-class model does not fit well in favor of the smaller 2-class model. In order to demonstrate that the SLCA can indeed fit a model with more than 2 classes, the SLCA model is also fit to a simulated data example in Section 4 where the truth is 3 underlying spatially distributed latent classes. Section 5 provides a summary discussion.

The latent class model was designed to explain relationships between variables in a multidimensional contingency table (Clogg and Goodman, 1984; McCutcheon, 1987). Consider a sample of size *N* where each individual has *P* binary indicator variables. Let **Y**_{i} = (*Y*_{i1},…,*Y*_{iP}) be the *P* × 1 vector of observed binary indicators for individual *i*. Each individual is in one of the *K* underlying classes and we let *C _{i}* represent his latent class. It is assumed that the classes form a complete partition of the population and that Σ

$$P({Y}_{\mathit{ij}}=1|{C}_{i}=k)={\pi}_{\mathit{jk}},\phantom{\rule{0.2em}{0ex}}i=1,\dots ,N;\phantom{\rule{0.2em}{0ex}}j=1,\dots ,P;\phantom{\rule{0.2em}{0ex}}k=1,\dots ,K.$$

(1)

Second, given the class membership, the individual’s responses are independent:

$$P({Y}_{i1}={y}_{i1},\dots ,{Y}_{\mathit{iP}}={y}_{\mathit{iP}}|{C}_{i}=k)=\prod _{j=1}^{P}P({Y}_{\mathit{ij}}={y}_{\mathit{ij}}|{C}_{i}=k).$$

(2)

Then (1) and (2) imply the marginal distribution of **Y**_{i}:

$$P({Y}_{i1}={y}_{i1},\dots ,{Y}_{\mathit{iP}}={y}_{\mathit{iP}})=\sum _{k=1}^{K}P({C}_{i}=k)\prod _{j=1}^{P}{\pi}_{\mathit{jk}}^{{y}_{\mathit{ij}}}{(1-{\pi}_{\mathit{jk}})}^{1-{y}_{\mathit{ij}}}.$$

(3)

In the classic LCA, individuals are considered iid with *P*(*C _{i}* =

$$P({\mathbf{Y}}_{1},\dots ,{\mathbf{Y}}_{N}|{\mathbf{X}}_{1},\dots ,{\mathbf{X}}_{N})=\prod _{i=1}^{N}[\sum _{k=1}^{K}{\eta}_{\mathit{ki}}\prod _{j=1}^{P}{\pi}_{\mathit{jk}}^{{y}_{\mathit{ij}}}{(1-{\pi}_{\mathit{jk}})}^{1-{y}_{\mathit{ij}}}].$$

Statistical inference and applications of this model have been well studied and are straightforward to implement. For example the Mplus software (Muthen and Muthen, 1998) can be used for inference using maximum likelihood implementing an EM algorithm and Winbugs can be used for a fully Bayesian approach, see e.g. the “Biops” example in the Winbugs example manual vol. 2 (Speigalhalter, et al., 2004).

The nature of the multivariate binary data in our pollution example and the appeal of the LCA model for modeling multiple binary indicators motivate us to consider the LCA model. But since the heavy metal samples are collected over certain spatial locations, it is necessary to further develop the LCA model to take account of the possible spatial correlations in the data.

We propose a Spatial Latent Class Analysis (SLCA) Model where the relationship between the binary indicators and latent classes is defined in the same way as (1) and (2), but the model will take account of the spatial correlation between observations at different sites by putting a spatial model on the underlying latent classes. This means that when forming the joint distribution of the **Y** from (3) we will model jointly the *P*(*C*_{1}, *C*_{2}, …, *C _{n}*) via a spatial process so that

Because in general the latent classes represent K unordered categories, it is necessary to build a spatial model that can handle multinomial responses. Spatial models for continuous responses are much more developed than those for categorical outcomes though there is a relatively large literature focused on spatial models for binary responses. The autologistic model (Besag, 1974) was developed for binary data and is commonly used but it is not easily extended to multinomial responses. Diggle et al. (1998) considered a binomial logit link model with a spatially correlated normal random effect. A straightforward extension of their approach to multinomial spatial data is to generalize the binomial logit link to the multinomial logit link. However, a limitation of the multinomial logit model is that it assumes the alternative categories are mutually independent which may not be reasonable (Alvarez and Nagler 2001; Lacy and Burden 1999; McFadden et al. 1981; Cheng and Long (2007)). In the spatial econometrics literature, the probit model has been used extensively for modeling spatially distributed binary data (McMillen 1992; Beron and Vijverberg 2004; Holloway et al. 2002; Smith and LeSage 2004). Moreover, the probit model is easily extended to the multinomial probit (Daganzo 1979) which allows for interdependencies between the different categories through correlations of an error term specified within the model (described below). The multinomial probit model is appealing for modeling spatially referenced multinomial data since continuous latent variables are introduced allowing common spatial covariance structures for continuous outcomes to be included (Bolduc 1992,1999, Bolduc et al. 1997; Schmidheiny 2003, Beron and Vijverberg 2004, Mohammadian et al. 2005). In this paper we use a multinomial probit model for modeling the spatial latent classes and incorporate spatial structure through a multivariate spatial process.

Now we define the multinomial probit model. Let *C _{i}* be the latent class variable with

$${C}_{i}=k\phantom{\rule{1em}{0ex}}\mathrm{if}\phantom{\rule{0.2em}{0ex}}{z}_{\mathit{ik}}=\mathit{max}({\mathbf{z}}_{i})$$

(4)

and

$${\mathbf{z}}_{i}={\mathit{\mu}}_{z}+{\mathbf{u}}_{i},$$

where * μ_{z}* is a

$${\mathbf{w}}_{i}={\mathit{\mu}}_{w}+{\mathit{\delta}}_{i},$$

(5)

where * δ_{i}* is a (

$${d}_{i}=\{\begin{array}{ll}0& \mathrm{if}\phantom{\rule{0.2em}{0ex}}\mathrm{max}({\mathbf{w}}_{i})<0\\ k& \mathrm{if}\phantom{\rule{0.2em}{0ex}}\mathrm{max}({\mathbf{w}}_{i})={w}_{\mathit{ik}}.\end{array}$$

(6)

Note *d _{i}* is a simple reparametrization of

A particularly flexible class of models for multivariate spatial data is developed by considering the elements of the multivariate spatial process as linear combinations of a set of independent univariate spatial processes. This type of model is often referred to as the linear model of coregionalization (LMC) in geostatistics (Grzebyk and Wackernagel 1994; Wackernagel 2003 Chapter 26). For lattice or areal data, the multivariate conditional autoregressive (MCAR) model (Gelfand and Vounatsou 2003) can similarly be described as a linear combination of independent univariate CAR models.

We consider a relatively simple example to illustrate this constructive modelling strategy. Suppose **Y**(**s*** _{i}*) is a

$$\mathbf{Y}({\mathbf{s}}_{i})=\mathbf{Av}({\mathbf{s}}_{i}),$$

where **A** is a *p* × *p* full rank matrix and the components of **v**(**s*** _{i}*),

$$\sum =\mathbf{C}\otimes \mathbf{T},$$

where **T** = **AA**^{T}. Note that this specification is equivalent to a separable covariance specification as in Mardia and Goodall (1993), where **T** is interpreted as the within-site covariance matrix between variables and **C** is interpreted as the across-site covariance matrix. In practice, **A** is commonly assumed to be lower or upper triangular without loss of generality, since for any positive definite matrix **T**, there is a unique lower or upper triangular matrix **A** such that **T** = **AA**^{T}. The model just introduced can be extended more generally if we allow the *p* independent processes *υ*_{1},…,*υ _{p}* to have different distributions with covariance matrices

$$\sum =\sum _{j=1}^{p}{\mathbf{C}}_{j}\otimes {\mathbf{T}}_{j},$$

where
${\mathbf{T}}_{j}={\mathbf{A}}_{j}\phantom{\rule{0.1em}{0ex}}{\mathbf{A}}_{j}^{T}$ and **A**_{j} is the *j ^{th}* column of

Returning to the development of the SLCA model, we specify the * δ_{i}* in (5) as a multivariate spatial process using the method of linear combinations of independent univariate spatial processes, i.e.,

$${\mathit{\delta}}_{i}\equiv \mathit{\delta}({\mathbf{s}}_{i})=\mathbf{Av}({\mathbf{s}}_{i}),\phantom{\rule{1em}{0ex}}(i=1,\dots ,n),$$

(7)

where **A** is a (*K* − 1)×(*K* − 1) lower triangular matrix and the *K* − 1 elements of **v**(**s*** _{i}*) are independent zero-mean Gaussian spatial processes. Let

In the soil heavy metal pollution example, the data were collected on a regular grid, so the conditional autoregressive (CAR) model is reasonable to consider. Thus for each spatial component, *υ _{k}*(

$$\mathbf{C}({\mathit{\alpha}}_{k})=\frac{1}{{\tau}_{k}}\mathbf{M}{({I}_{n}-{\rho}_{k}\phantom{\rule{0.1em}{0ex}}\mathbf{H})}^{-1}$$

(8)

where *ρ _{k}* is the “spatial association” parameter,

The SLCA model is then made up of (1) and (2) relating the observed variables to the latent classes, and a multinomial probit model with spatial correlation for *P*(*C*_{1}, *C*_{2}, … *C _{n}*) with (6) defining the class membership where the

In the relationship between the latent class *C _{i}* and latent vector

$$P({C}_{i}|{\mu}_{w},{\sum}_{w})=P({C}_{i}|s{\mu}_{w},{s}^{2}{\sum}_{w}).$$

Because of this identifiability problem of scale, it is common to fix the the (1, 1) element of **Σ**_{w} to 1, to achieve identification of the model. This constraint is accomplished in the spatial version of the model by appropriately constraining elements of **A** in (7) and constraining variance parameters in **v**(**s*** _{i}*). Specifically given a lower triangular matrix

Let **Φ** be the vector that contains all the unknown parameters in the model,

$$\mathbf{\Phi}=\left\{\left\{{\pi}_{\mathit{jk}},\phantom{\rule{0.1em}{0ex}}j=1,\dots ,P,k=1,\dots ,K\right\},{\mu}_{w},\mathbf{A},{\alpha}_{1},\dots ,{\alpha}_{K-1}\right\}.$$

Let *P*(**Φ**) be the joint prior for these parameters. We define *P*(**Φ**) as the product of the prior distributions of each parameter in **Φ**, i.e., we use independent priors for all parameters. For the conditional response probabilities *π _{jk}*,

Let **Y**_{i} be a vector of *P* binary indicators at location *i*, and
${\mathbf{Y}}^{T}=({\mathbf{Y}}_{1}^{T},\dots ,{\mathbf{Y}}_{n}^{T})$ is the *PN* × 1 vector containing all the data. Then based on the SLCA model, the joint posterior for all the unknown parameters **Φ** and all the latent class variables for each individual, i.e. **C** = (*C*_{1},…*C _{N}*) is

$$\begin{array}{lll}P(\mathbf{\Phi},\mathbf{C}|\mathbf{Y})& \propto & P({C}_{1},\dots ,{C}_{N})\prod _{i=1}^{N}\prod _{j=1}^{P}P({Y}_{\mathit{ij}}|{C}_{i})P(\mathbf{\Phi})\\ & \propto & \prod _{k=1}^{K-1}P({\nu}_{k}({\mathbf{s}}_{1})\dots {\nu}_{k}({\mathbf{s}}_{n})|{\alpha}_{k})\prod _{i=1}^{N}\prod _{j=1}^{P}P({Y}_{\mathit{ij}}|\left\{{\pi}_{\mathit{jk}}\right\},{\mathit{\mu}}_{w},\mathbf{A},\mathbf{v}({\mathbf{s}}_{i}))P(\mathbf{\Phi})\end{array}$$

(9)

Note *C _{i}* is a deterministic function of

A commonly used criteria for model comparison in hierarchical Bayesian models is the Deviance Information Criteria (DIC, Spiegelhalter, et al., 2002). But in the case of latent variable models, in particular mixture models such as the SLCA, the DIC is open to different variations depending on how the latent variables are considered. Celeux et al. (2006) present a careful examination of eight different variants of the DIC for mixture models, some which readily produce negative effective dimensions for no apparent reason. Because of uncertainty about the reliability of DIC as a model comparison method for mixture models, we considered cross-validation predictive distributions which have been advocated for mixture models (Dey et al. 1995)

In general, in cross-validation a validation set of observations **Y _{v}** is held to the side and the remaining data denoted

With multivariate outcome data as we are modeling with the SLCA model, where the P-vector **Y**_{i} is modeled across locations *i* = 1 … *N*, there are two ways to interpret what is meant by leave-one-out cross validation. We can leave-out the entire vector of p-observations at location *i*, i.e. **Y _{v}** =

Thirty-six heavy metals were measured in 97 soil samples collected on a 500 × 500 meter grid in the urban area of Slovenia’s Capital, Ljubljana (Komac and Sajn, 2001). Among these thirty-six heavy metals, eight are considered most toxic, i.e., Cadmium-Cd, Cobalt-Co, Cromium-Cr, Copper-Cu, Nickel-Ni, Lead-Pb, Zinc-Zn, and Mercury-Hg, and are of interest in this example. The eight variables have been dichotomized based on whether the measurement of the particular heavy metal is over the limit defined by the Slovenian legislation. Let *X _{ij}* be the measurement of heavy metal

Figure 1 presents the observed data (1’s and 0’s) for the 8 different heavy metals at the 97 spatial locations. The color shading in the images was drawn using the **interp** function in R and simply provides a smooth linear interpolation of the observed data. Cobalt (Co) and Lead (Pb) are seen to be above the limit the most commonly while Nickel (Ni) is the least common to be observed above the limit. There appears to be some spatial clustering of high values within the different maps and similarities across the different heavy metals in terms of where the high values are located. The SLCA model will be used to model both the relationship among the different heavy metals as well as their spatial similarity.

The plot on the lower right of Figure 1 shows the locations where the heavy metal samples were collected and provides numerical labels for those locations used in later Figures. The units of distance on the x and y axes are in kilometers so each square within the grid is 500 × 500 meters. As the samples were collected on a regular grid, we define two locations to be neighbors if they are adjacent by one step along the grid horizontally, vertically or diagonally. In addition, there were four sites (location 4, 22, 49, and 64) where a second observation was taken very nearby the regular grid locations and these sites are considered neighbors with the respective nearby grid location as well as with the adjacent grid sites.

The goal of this study is to examine whether the spatial similarity of the 8 different heavy metals can be explained by some shared underlying latent classes that are themselves spatially varying. The SLCA model can: identify the number of potential underlying classes, describe the relationship between the heavy metal indicators and the underlying classes (i.e. *π _{jk}*), and determine the probability that certain locations are of a particular latent class type

To determine the number of underlying classes and compare the performance of the classic LCA model and SLCA model, we consider four models. Model 1 and Model 2 have two latent classes, and Model 3 and Model 4 have three latent classes. Model 1 and Model 3 are classic LCA model, assuming the latent classes are independent across locations, and Model 2 and Model 4 are the SLCA model, assuming the latent classes at different locations are spatially correlated with a CAR covariance structure. More specifically:

**Model 1**: In Model 1, (1), (2), and (3) are assumed for*K*= 2 to model the relationship between the binary indicator**Y**_{i}with the underlying classes. We also assume the multinomial probit model (4) and (6) for*K*= 2 to model the underlying classes. Note that when*K*= 2,**w**_{i}is univariate, i.e.,*w*=_{i}*μ*+_{w}*δ*where_{i}*μ*is a scalar parameter and_{w}*δ*is a normal random variable, and (6) becomes_{i}$${d}_{i}=\{\begin{array}{cc}0& \mathrm{if}\phantom{\rule{0.2em}{0ex}}{w}_{i}<0\\ 1& \mathrm{if}\phantom{\rule{0.2em}{0ex}}{w}_{i}>0\end{array}.$$(10)We further assume*δ*,_{i}*i*= 1,…,*N*are i.i.d and normally distributed with mean 0 and precision*τ*, and_{δ}*τ*is fixed to be 1 to preserve the identifiability of the model as discussed in the Section 3.1._{δ}**Model 2**: Model 2 has all the same assumptions as Model 1, except that we assume*δ*,_{i}*i*= 1,…,*N*are spatially correlated over different locations and have a CAR distribution, i.e.,*δ*~*CAR*(*ρ*,1).**Model 3**: In Model 3, latent class model (1), (2) and (3), and multinomial logit model (4),and (6) are assumed for*K*= 3. Here**w**_{i}=+**μ**_{w}in (6) is a vector of length 2, and we assume**δ**_{i}is a bivariate normal random vector independent over locations**δ**_{i}*i*= 1,…,*N*, i.e., ${\mathit{\delta}}_{i}\stackrel{i.i.d.}{~}\phantom{\rule{0.2em}{0ex}}N\left(\mathbf{0},\mathbf{\Omega}\right),$ where**Ω**is a positive definite matrix of size 2, with the (1,1) element fixed to 1.**Model 4**: In Model 4, we assume (1), (2), (3), (4),and (6) for*K*= 3 as in Model 3. Also**w**_{i}=+**μ**_{w}in (6) is a vector of length 2, different from Model 3, we assume**δ**_{i}is a bivariate normal random vector that is spatially correlated over locations**δ**_{i}*i*= 1,…,*N*according to (7), i.e.,=**δ**_{i}**Av**(**s**), where_{i}**A**is a lower triangular matrix of size 2 with (1,1) element fixed to 1,**v**(**s**)_{i}^{T}= (*υ*_{1}(**s**),_{i}*υ*_{2}(**s**)) is a vector of two uncorrelated spatial process with_{i}*υ*_{1}~*CAR*(*ρ*_{1}, 1) and*υ*_{2}~*CAR*(*ρ*_{2}, 1).

Tables 1 and and22 show the posterior means and 95% credible intervals of the the probabilities of heavy meatal *j* being above the limit given the different class memberships *k*, i.e. the *π _{jk}* as defined in (1). In the last column for reference are the simple observed proportions of locations where the respective heavy metal is over the legal limit. Note from Table 1 that the two two-class models Model 1 and Model 2 have very similar posterior means and 95% credible intervals for

Posterior means and 95% credible intervals for *π*_{jk} from the heavy metal soil data - Model 1 and 2

Posterior means and 95% credible intervals for *π*_{jk} from the heavy metal soil data - Model 3 and 4

Examining the 3-class models, we see that Classes 0 and 1 are very similar to the those in the 2-class model and in Class 2 most of the posteriors for *π _{jk}* have hardly moved from the priors which were uniform (0,1). In all cases the Class 2 95% credible intervals overlap the other two intervals for the other latent classes. Hence it is not advised to interpret the values of the posterior means

Focusing on the 2-class models, it is of interest to examine the differences between Class 0 and Class 1 in terms of which heavy metals are more or less likely to be over the limit. Within Class 0, Co has the highest probability of being over the limit (0.84) and this is higher than when the location is of Class type 1 (0.64). On the other hand, all of the rest of the heavy metals have higher probability of being over the limit in Class type 1 than when the location is of Class type 0. Further we note that Zn is the element with the most distinct difference in the two classes, with it only being over the limit 3% of the time in Class 0 and 73% of the time in Class 1. Notice that for all three of Cu, Pb, and Zn that the 95% credible intervals do not overlap between the two classes suggesting that these are distinguished best by the two classes. On the other hand, Cr is the least distinguished between the classes with it having probability 0.19 in class 0 and only slightly higher at 0.25 in class 1 and its credible intervals greatly overlap.

In Figure 2, the upper left hand plot shows the CPO values for the spatial verses iid 2-class models. In sum, the CPOs are larger for the spatial model compared to the iid model (also see Table 1). Similary the 8-fold cross-validation summary (Table 1) supports the 2-class spatial model (SLCA) as having better cross-validation prediction than the 2-class iid model (LCA) for this data. In the 2-class SLCA model, *ρ* has a posterior mean of 0.72 with the 2.5% and 97.5% percentile 0.17 and 0.97 respectively. Though this posterior is quite variable, the addition of this vague spatial correlation has improved the predictability of the model which implies there is some information gained by incorporating neighboring points. We see in the upper right-hand plot of Figure 2 the trend that the CPO values get worse as the number of heavy metals over the limit at a particular site increases. This means that the SLCA model is not predicting as many heavy metals to be over the limit as there actually are. Recall though that it not the goal of the model to predict any particular heavy metal or the total number of heavy metals but to predict latent classes that result in regions with different distributions of heavy metals.

An image plot of the posterior probabilities that each location is in latent Class 1 is shown in the bottom left of Figure 2. We see a distinctive contiguous line of locations with high probability of being in this class running from the bottom of the region up and to the right. There is also a region in the upper center with higher probability of being from latent Class 1. The parameter *μ _{w}* in model 2 has posterior mean of 0.12 which translates on the probit scale to a probability of 0.55, which represents the overall marginal probability of a location being in class 0. This matches what is seen in the image plot where close to half of the region has a high probability of being in class 1. The bottom right plot in Figure 2 figure relates these latent class probabilities to a simple count of the number of heavy metals over the limit at each point. We see that as the total sum increases, the probability of that site being from Class 1 increases, specifically if the count is 5 or more out of 8 then the probability is effectively 1. On the other hand, if there are none or only 1 metal over the limit, the probability is that the site is from Class 0. The values of particular interest are those in the 2-4 range where depending on which heavy metals are present, the probability of class 1 is either high or low. We see for example that site 11 has 3 heavy metals over the limit (Ni, Cr, and Co, from Figure 1) but is predicted to be in the latent class 0. As was seen in Table 1, Ni and Cr provide little distinction in the two classes and Cr is found to be higher in Class 0 so it follows that this site would be from Class 0. If we examine site 89, it also has 3 heavy metals over the limit but they are Cu, Pb, and Zn the metals found to have much higher probability of being over the limit in Class 1 as compared to Class 0, and so this site has high probability of being from Class 1. The SLCA model does not just quantitatively describe the data in terms of the number of heavy metals over the limit but it provides some qualitative distinctions of the region.

To demonstrate the SLCA model can be used to identify more than 2 underlying latent classes, we provide a generated data example where the truth is 3 underlying spatially varying latent classes. The set up of the simulated example is the same as the heavy metals example with N=97 observation on the same grid as in the example and 8 different binary measured variables at each location. The true values for the *π _{jk}* are either 0.1 or 0.8 and are shown as x’s in the top 3 plots in Figure 3. The true values of the latent classes at each location are shown as numbers in the bottom 3 plots in Figure 3, and the other parameters governing the spatial process are shown in Table 3.

Results for parameters of distribution of underlying latent classes for simulated 3-class SLCA model

The four models considered in the previous subsection were fit to the simulated data and as expected the SLCA models had better cross-validation prediction than the non-spatial models and the 3-class SLCA model performed best in terms of the 8-fold CV summary. The resulting posterior estimates from the 3-class SLCA model are presented. In the top of Figure 3, the 95% credible intervals of the *π _{jk}* parameters are shown and in all but one case, (

Image plots of the posterior fitted probability that each region is from each of the 3 latent classes is shown in the bottom of Figure 3. There is very good agreement between the true value of the latent class (shown as numbers) and the posterior probability that the location is in the particular class (shown via the coloring with darker color indicating higher probability). In other words, with high probability, the model fits the the correct latent class at each location.

Table 3 shows the posterior estimates for the parameters governing the underlying spatial structure of the latent classes. Using vague priors as described in Section 3.2 for all of the parameters we find that the posterior for the parameters in **A** governing the covariance between the latent classes are quite wide. We replace *a*_{3} ~ *Uniform*(0, 1000) and *a*_{2} ~ *N*(0, 0.001) with *a*_{3} ~ *Uniform*(0, 10) and *a*_{2} ~ *N*(0, 0.1) and present those results also in Table 3 labeled as “informative priors for *a*_{2} and *a*_{3}”. Note, the well-behaved posterior intervals for the *π _{jk}* and the

In this paper, we developed a spatial latent class analysis model that extends the classic LCA model by adding spatial structure to the latent class distribution through the use of the multinomial probit model. The multinomial probit model transfers a difficult problem of modeling correlated multinomial data to a relative straightforward multivariate Gaussian process modeling problem.

A simple extension to the proposed SLCA model taking advantage of the multinomial probit is to replace *μ _{w}* by a regression component including observed covariates. Although the data was not available to us, it would be a great next step in the heavy metal soil pollution example to include as covariates, for example, observed distances to known pollution sources.

In this paper we presented 2 cross-validation methods useful for choosing a best model and hence choosing the value for *K*. While this method of model choice appeared to behave reasonably, it is computationally intensive and it is of interest to further develop useful fit criterion for mixture models like the SLCA. Moreover, an interesting extension would be to treat *K* as an unknown parameter directly in the model. While intuitively nice, this extension is expected to be computationally quite difficult as it requires sampling from multiple parameter spaces as *k* changes using a technique such as reversible jump (Green 1995).

Finally, in our data example, we chose to use a CAR model to account for the spatial similarity but it may have been more natural given that the spatial support was a continuous region to use a geostatistical model. For example, a covariance structure based on an exponential variogram may have worked well. While this may be desirable, the practical matter is that computationally the geostatistical models are computationally very slow to run in Winbugs. The algorithm is of order *N*^{3}. Hence for practical reasons we used the CAR model which has better computational stability. Furthermore, while we used linear combinations of univariate CAR models to build a multivariate spatial process, there are more general formulations of the multivariate conditional autoregressive (MCAR) model (Jin et al. 2005) which provide even more flexible modeling in cases where there is enough data to inform the parameter estimation.

**Publisher's Disclaimer: **This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

- Alvarez RM, Nagler J. Correlated disturbances in discrete choice models: a comparison of multinomial probit and logit models. Working Paper at California Institute of Technology, Division of the Humanities and Social Sciences 2001
- Bandeen-Roche K, Miglioretti DL, Zeger SL, Rathouz PJ. Latent variable regression for multiple discrete outcomes. Journal of the American Statistical Association. 1997;92:1375–1386.
- Beron K, Vijverberg WPM. Probit in a spatial context: A Monte Carlo analysis. In: Anselin L, Florax RJGM, Rey SJ, editors. Advances in Spatial Economectrics: Methodology Tools and Applications. Chapter 8. Springer; 2004. pp. 169–192.
- Besag J. Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society, Series B. 1974;23:192–236.
- Bolduc D. Generalized autoregressive errors in the multinomial probit model. Transportation Research B. 1992;26(2):155–170.
- Bolduc D. A practical technique to estimate multinomial probit models in transportation. Transportation Research B. 1999;33:63–79.
- Bolduc D, Fortin B, Gordon S. Multinomial probit estimation of spatially interdependent choices: An empirical comparison of two new techniques. International Regional Science Review. 1997;20:77–101.
- Bunch D. Estimability in the multinomial probit model. Transportation Research B. 1991;25(1):1–12.
- Carlin BP, Louis TA. Bayesian Methods for Data Analysis. 3. Chapman and Hall/CRC Press; Boca Raton, Fl: 2008.
- Celeux G, Forbes F, Robert CP, Titterington DM. Deviance information criteria for missing data models. Baysian Analysis. 2006;4(1):651–674.
- Cheng S, Long JS. Testing for IIA in the multinomial logit model. Sociological Methods and Research. 2007;35:583–600.
- Clogg CC, Goodman LA. Latent structure analysis of a set of multidimensional contingency tables. Journal of the American Statistical Association. 1984;79:762–771.
- Daganzo C. Multinomial Probit: The theory and its application to demand forecasting. Academic Press; New York: 1979.
- Dey DK, Kuo L, Sahu SK. A Bayesian predictive approach to determining the number of components in a mixture distribution. Statistics and Computing. 1995;5:297–305.
- Diggle PJ, Tawn JA, Moyeed RA. Model-based geostatistics (with discussion) Applied Statistics. 1998;47:299–350.
- Gelfand A, Vounatsou P. Proper multivariate conditional autoregressive models for spatial data analysis. Biostatistics. 2003;4(1):11–25. [PubMed]
- Green PJ. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika. 1995;82(4):711–732.
- Grzebyk M, Wackernagel H. Multivariate analysis and spatial/temporal scales: real and complex models. Proceedings of the XVIIth International Biometrics Conference; Hamilton, Ontario. 1994. pp. 19–33.
- Holloway G, Shankar B, Rahmanb S. Bayesian spatial probit estimation: A primer and an application to HYV rice adoption. Agricultural Economics. 2002;27(3):383–402.
- Jin X, Carlin BP, Banerjee S. Generalized hierarchical multivariate CAR models for areal data. Biometrics. 2005;61:950–961. [PubMed]
- Komac M, Sajn R. Polluted or nonpolluted-A fuzzy approach determining soil pollution. Proceeding of the Annual Conference of the International Association for Mathematical Geology 2001
- Lacy D, Burden BC. The vote-stealing and turnout effects of Ross Perot in the 1992 U.S. presidential election. American Journal of Political Science. 1999;43(1):233–55.
- Mardia KV, Goodall C. Spatial-temporal analysis of multivariate environmental monitoring data. In: Patil GP, Rao CR, editors. Multivariate Environmental Statistics. Elsevier; Amsterdam: 1993. pp. 347–386.
- McCutcheon AL. Latent Class Analysis. Sage Publications; Beverly Hills and London: 1987. Sage University Paper Series on Quantitative Applications in the Social Sciences.
- McFadden D, Train K, Tye W. An application of diagnostic tests for the independence from irrelevant alternatives property of the multinomial logit model. Transportation Research Board Record. 1981;637:39–46.
- McMillen DP. Probit with spatial autocorrelation. Journal of Regional Science. 1992;32:335–48.
- Mohammadian A, Haider M, Kanaroglou P. Incorporating observed spatial dependencies in random parameter discrete choice models. Proceedings of the 84th Annual Meeting of the Transportation Research Board; Washington D.C. 2005.
- Muthen LK, Muthen BO. Mplus User’s Guide. Muthen & Muthen; Los Angeles: 1998.
- Reboussin BA, Liang KY, Reboussin DM. Estimating equations for a latent transition model with multiple discrete indicators. Biometrics. 1999;55(3):839–845. [PubMed]
- Schmidheiny K. Income segregation and local progressive taxation: Empirical evidence from Switzerland. Working Paper, Diskussionsschriften dp0311, Universitaet Bern, Departement Volkswirtschaft 2003
- Smith TE, LeSage JP. A Bayesan probit model with spatial dependencies. In: LeSage JP, Pace RK, editors. Spatial and Spatio-Temporal Econometrics. Elsevier; Amsterdam: 2004.
- Spiegelhalter DJ, Best N, Carlin BP, van der Linde A. Bayesian measures of model complexity and fit (with discussion) J Roy Statist Soc, Ser B. 2002;64:583–639.
- Spiegelhalter DJ, Thomas A, Best N, Lunn D. Winbugs version 1.4.1 manual. 2004. http://www.mrc-bsu.cam.ac.uk/bugs.
- Wackernagel H. Multivariate Geostatistics - An introduction with Applications. 3. Springer-Verlag; New York: 2003.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |