Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2894541

Formats

Article sections

- Abstract
- 1. Introduction
- 2. The competitors
- 3. Simulation experiment
- 4. Example: Minnesota 3-cancer data
- 5. Discussion and future work
- Supplementary Material
- References

Authors

Related links

Ann Appl Stat. Author manuscript; available in PMC 2010 June 30.

Published in final edited form as:

Ann Appl Stat. 2009; 3(4): 1805–1830.

doi: 10.1214/09-AOAS267PMCID: PMC2894541

NIHMSID: NIHMS206720

Yufen Zhang, Novartis Pharmaceuticals, East Hanover, New Jersey 07936, USA;

Sudipto Banerjee: ude.nmu.tatsoib@potpidus

See other articles in PMC that cite the published article.

Rapid developments in geographical information systems (GIS) continue to generate interest in analyzing complex spatial datasets. One area of activity is in creating smoothed disease maps to describe the geographic variation of disease and generate hypotheses for apparent differences in risk. With multiple diseases, a multivariate conditionally autoregressive (MCAR) model is often used to smooth across space while accounting for associations between the diseases. The MCAR, however, imposes complex covariance structures that are difficult to interpret and estimate. This article develops a much simpler alternative approach building upon the techniques of smoothed ANOVA (SANOVA). Instead of simply shrinking effects without any structure, here we use SANOVA to smooth spatial random effects by taking advantage of the spatial structure. We extend SANOVA to cases in which one factor is a spatial lattice, which is smoothed using a CAR model, and a second factor is, for example, type of cancer. Datasets routinely lack enough information to identify the additional structure of MCAR. SANOVA offers a simpler and more intelligible structure than the MCAR while performing as well. We demonstrate our approach with simulation studies designed to compare SANOVA with different design matrices versus MCAR with different priors. Subsequently a cancer-surveillance dataset, describing incidence of 3-cancers in Minnesota’s 87 counties, is analyzed using both approaches, showing the competitiveness of the SANOVA approach.

Statistical modeling and analysis of spatially referenced data receive considerable interest due to the increasing availability of geographical information systems (GIS) and spatial databases. For data aggregated over geographic regions such as counties, census tracts or ZIP codes (often called *areal* data), with individual identifiers and precise locations removed, inferential objectives focus on models for spatial clustering and variation. Such models are often used in epidemiology and public health to understand geographical patterns in disease incidence and morbidity. Recent reviews of methods for such data include Lawson et al. (1999), Elliott et al. (2000), Waller and Gotway (2004) and Rue and Held (2005). Traditionally such data have been modeled using conditionally specified probability models that shrink or smooth spatial effects by borrowing strength from neighboring regions. Perhaps the most pervasive model is the conditionally autoregressive (CAR) family pioneered by Besag (1974), which has been widely investigated and applied to spatial epidemiological data [Wakefield (2007) gives an excellent review]. Recently the CAR has been extended to multivariate responses, building on multivariate conditional autoregressive (MCAR) models described by Mardia (1988). Gelfand and Vonatsou (2003) and Carlin and Banerjee (2003) discussed their use in hierarchical models, while Kim, Sun and Tsutakawa (2001) presented a different “twofold CAR” model for counts of two diseases in each areal unit. Other extensions allowing flexible modeling of cross-correlations include Sain and Cressie (2002), Jin, Carlin and Banerjee (2005) and Jin, Banerjee and Carlin (2007). The MCAR can be viewed as a conditionally specified probability model for interactions between space and an attribute of interest. For instance, in disease mapping interest often lies in modeling geographical patterns in disease rates or counts of several diseases. The MCAR acknowledges dependence between the diseases as well as dependence across space. However, practical difficulties arise from MCAR’s elaborate dependence structure: most interaction effects will be weakly identified by the data, so the dependence structure is poorly identified. In hierarchical models [e.g., Gelfand and Vonatsou (2003), Jin, Carlin and Banerjee (2005 Jin, Carlin and Banerjee (2007)], strong prior distributions may improve identifiability, but this is not uncontroversial, as inferences are sensitive to the prior and perhaps unreliable without genuine prior information. This article proposes a much simpler and more interpretable alternative to the MCAR, modeling multivariate spatial effects using smoothed analysis of variance (SANOVA) as developed by Hodges, Carlin and Fan (2007), henceforth HCSC. Unlike an ANOVA that is used to identify some interaction effects to retain and others to remove, SANOVA mostly retains effects that are large, mostly removes those that are small, and partially retains middling effects. (Loosely speaking, “large,” “middling” and “small” describe the size of the unsmoothed effects compared to their standard errors.) To accommodate rich dependence structures, MCAR introduces weakly identifiable parameters that complicate estimation. SANOVA, on the other hand, focuses instead on smoothing interactions to yield more stable and reliable results. Our intended contribution is to show how SANOVA can solve the multiple disease mapping problem while avoiding the dauntingly complex covariance structures imposed by MCAR and its generalizations. We demonstrate that SANOVA produces inference that is largely indistinguishable from MCAR, yet SANOVA is simpler, more explicit, easier to put priors on and easier to estimate. The rest of the article is as follows. Section 2 reviews SANOVA and MCAR, identifying SANOVA as a special case of MCAR. Section 3 is a “tournament” of simulation experiments comparing SANOVA with MCAR for normal and Poisson data, while Section 4 analyzes data describing the number of deaths from lung, larynx and esophagus cancer in Minnesota between 1990 and 2000. A summary and discussion of future research in Section 5 concludes the paper. Zhang, Hodges and Banerjee (2009) (Appendices) gives computational and technical details.

Consider a balanced, single-error-term analysis of variance, with *M*_{1} degrees of freedom for main effects and *M*_{2} degrees of freedom for interactions. Specify this ANOVA as a linear model: let *A*_{1} denote columns in the design matrix for main effects, and *A*_{2} denote columns in the design matrix for interactions. Assume the design has *c* cells and *n* observations per cell, giving *cn* observations in total. To simplify later calculations, normalize the columns of *A*_{1} and *A*_{2} so
${A}_{1}^{\prime}{A}_{1}={I}_{{M}_{1}}$ and
${A}_{2}^{\prime}{A}_{2}={I}_{{M}_{2}}$. (Note: HCSC normalized columns differently, fixing
${A}_{1}^{\prime}{A}_{1}=cn{I}_{{M}_{1}}$ and
${A}_{2}^{\prime}{A}_{2}=cn{I}_{{M}_{2}}$.) Then write the ANOVA as

$$\mathbf{y}=[{A}_{1}\mid {A}_{2}]\phantom{\rule{0.16667em}{0ex}}\left[\begin{array}{c}{\mathbf{\Theta}}_{1}\\ {\mathbf{\Theta}}_{2}\end{array}\right]+\mathit{\epsilon}={A}_{1}{\mathbf{\Theta}}_{1}+{A}_{2}{\mathbf{\Theta}}_{2}+\mathit{\epsilon},$$

(1)

where
$\mathit{\epsilon}\sim N(0,{\scriptstyle \frac{1}{{\eta}_{0}}}I)$ with *η*_{0} being a precision, **y** is *cn* × *A*_{1} is *cn* × *M*_{1}, *A*_{2} is *cn* × *M*_{2}, **Θ**_{1} is *M*_{1} × 1, **Θ**_{2} is *M*_{2} × 1, and ** ε** is

$${\mathbf{0}}_{{M}_{2}}=[{0}_{{M}_{2}\times {M}_{1}}\mid {I}_{{M}_{2}}]\phantom{\rule{0.16667em}{0ex}}\left[\begin{array}{c}{\mathbf{\Theta}}_{1}\\ {\mathbf{\Theta}}_{2}\end{array}\right]+\mathit{\delta},$$

(2)

where $\mathit{\delta}\sim N(0,\text{diag}({\scriptstyle \frac{1}{{\eta}_{j}}}))$, in the manner of Lee and Nelder (1996) and Hodges (1998). Combining (1) and (2), express this hierarchical model as a linear model:

$$\left[\begin{array}{c}\mathbf{y}\\ {\mathbf{0}}_{{M}_{2}}\end{array}\right]=\left[\begin{array}{cc}{A}_{1}& {A}_{2}\\ {0}_{{M}_{2}\times {M}_{1}}& {I}_{{M}_{2}}\end{array}\right]\phantom{\rule{0.16667em}{0ex}}\left[\begin{array}{c}{\mathbf{\Theta}}_{1}\\ {\mathbf{\Theta}}_{2}\end{array}\right]+\left[\begin{array}{c}\mathit{\epsilon}\\ \mathit{\delta}\end{array}\right].$$

(3)

More compactly, write

$$\mathbf{Y}=X\mathbf{\Theta}+\mathbf{e},$$

(4)

where **Y** has dimension (*cn* + *M*_{2}) × 1 and **e**’s covariance Γ is block diagonal with blocks
${\mathrm{\Gamma}}_{1}={\scriptstyle \frac{1}{{\eta}_{0}}}{I}_{cn}$ for the data cases (rows of *X* corresponding to the observation **y**) and Γ_{2} = diag(1/*η*_{1}, …, 1/*η*_{M2}) for the constraint cases (rows of *X* with error term ** δ**). For convenience, define the matrix

Suppose a map has *N* regions, each with an unknown quantity of interest * _{i}, i* = 1, …,

$${\phi}_{i}\mid {\phi}_{j},j\ne i,\sim N\left(\frac{\alpha}{{m}_{i}}\sum _{i\sim j}{\phi}_{j},\frac{1}{\tau {m}_{i}}\right),\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}i,j=1,\dots ,N,$$

(5)

where *i* ~ *j* denotes that region *j* is a *neighbor* of region *i* (typically defined as spatially adjacent), and *m _{i}* is the number of region

$$\begin{array}{l}p(\mathit{\phi}\mid \tau )\sim {\tau}^{{N}^{\ast}/2}exp\left(-\frac{\tau}{2}{\mathit{\phi}}^{\prime}Q\mathit{\phi}\right),\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\text{with}\\ {N}^{\ast}=\{\begin{array}{ll}N,\hfill & \text{if}\phantom{\rule{0.16667em}{0ex}}\alpha \in (0,1),\hfill \\ N-G,\hfill & \text{if}\phantom{\rule{0.16667em}{0ex}}\alpha =1.\hfill \end{array}\end{array}$$

(6)

In (6) *τ* is the spatial precision parameter, *τQ* is the precision matrix in this multivariate normal distribution and *G* is the number of “islands” (disconnected parts) in the spatial map [Hodges, Carlin and Fan (2003)]. When *α* (0, 1), (6) is a proper multivariate normal distribution. When *α* = 1, *Q* is singular with *Q***1** = **0**; *Q* has rank *N* − *G* in a map with *G* islands, therefore, the exponent on *τ* becomes (*N* − *G*)/2. In hierarchical models, the CAR model is usually used as a prior on spatial random effects. For instance, let *Y _{i}* be the observed number of cases of a disease in region

$${Y}_{i}\stackrel{\text{ind}}{\sim}\text{Poisson}({E}_{i}{e}^{{\mu}_{i}}),\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}i=1,\dots ,N,$$

(7)

where
${\mu}_{i}={\mathbf{x}}_{i}^{\prime}\mathit{\beta}+{\phi}_{i}$. The **x*** _{i}* are explanatory, region-specific regressors with coefficients

To use CAR in SANOVA, the key is re-expressing the improper CAR, that is, (6) with *α* = 1. Let *Q* have spectral decomposition *Q* = *VDV*′, where *V* is an orthogonal matrix with columns containing *Q*’s eigenvectors and *D* is diagonal with nonnegative diagonal entries. *D* has *G* zero diagonal entries, one of which corresponds to the eigenvector
${\scriptstyle \frac{1}{\sqrt{N}}}{\mathbf{1}}_{N}$, by convention the *N* th (right-most) column in *V*. Define a new parameter **Θ** = *V*′**, so ****Θ** has dimension *N* and precision matrix *τD*. Giving an *N* -vector **Θ** a normal prior with mean zero and precision *τD* is equivalent to giving ** = ***V***Θ** a CAR prior with precision *τQ*. **Θ** consists of
${\mathrm{\Theta}}_{N}={\scriptstyle \frac{1}{\sqrt{N}}}{\mathbf{1}}_{N}^{\prime}\mathit{\phi}=\sqrt{N}\overline{\mathit{\phi}}$, the scaled average of the * _{i}*, along with

$$\begin{array}{l}\mathit{\phi}={[{\phi}_{1},{\phi}_{2},\dots ,{\phi}_{N}]}^{\prime}\\ =V\mathbf{\Theta}\\ =[\begin{array}{cc}{V}^{(-)}& {\scriptstyle \frac{1}{\sqrt{N}}}{\mathbf{1}}_{N}\end{array}]\phantom{\rule{0.16667em}{0ex}}\left[\begin{array}{c}{\mathbf{\Theta}}_{\mathit{Reg}}\\ {\mathrm{\Theta}}_{GM}\end{array}\right].\end{array}$$

Giving ** a CAR prior is equivalent to giving ****Θ** a *N*(**0**, *τD*) prior; the latter are the “constraint cases” in HCSC’s SANOVA structure. The precision *D _{NN}* = 0 for the overall level is equivalent to a flat prior on Θ

With multiple diseases, we have unknown * _{ij}* corresponding to region

$${\mathit{\phi}}_{i}\mid {\{{\mathit{\phi}}_{{i}^{\prime}}\}}_{{i}^{\prime}\ne i}\sim \text{MVN}\left(\frac{\alpha}{{m}_{i}}\sum _{{i}^{\prime}\sim i}{\mathit{\phi}}_{{i}^{\prime}},\frac{1}{{m}_{i}}{\mathrm{\Omega}}^{-1}\right).$$

(8)

These conditional distributions yield a joint distribution for $\mathit{\phi}={({\mathit{\phi}}_{1}^{\prime},\dots ,{\mathit{\phi}}_{N}^{\prime})}^{\prime}$:

$$f(\mathit{\phi}\mid \mathrm{\Omega})\propto \phantom{\rule{0.16667em}{0ex}}{\left|\right|\mathrm{\Omega}\left|\right|}^{(N-G)/2}exp(-{\scriptstyle \frac{1}{2}}{\mathit{\phi}}^{\prime}(Q\otimes \mathrm{\Omega})\mathit{\phi}),$$

(9)

where *Q* is defined as in Section 2.1.2 and again (9) is an improper density when *α* = 1. However, as for the univariate CAR, this yields proper posteriors in conjunction with a proper likelihood. The specification above is a “separable” dispersion structure, that is, the covariances between the diseases are invariant across regions. This may seem restrictive, but relaxing this restriction gives even more complex dispersion structures [see Jin, Banerjee and Carlin (2007) and references therein]. As mentioned earlier, our focus is to retain the model’s simplicity without compromising the primary inferential goals. We propose to do this using SANOVA and will compare it with the separable MCAR only.

We now describe the SANOVA model using the Minnesota 3-cancer dataset. Consider the Minnesota map with *N* = 87 counties, and suppose each county has counts for *n* = 3 cancers. County *i* has an *n*-vector of parameters describing the *n* cancers, * _{i}* = (

$$\begin{array}{l}\mathit{\phi}={[{\mathit{\phi}}_{1}^{\prime},{\mathit{\phi}}_{2}^{\prime},\dots ,{\mathit{\phi}}_{N}^{\prime}]}^{\prime}=[{A}_{1}\mid {A}_{2}]\mathbf{\Theta}\\ =\left[\underset{\begin{array}{c}\text{Grand}\phantom{\rule{0.16667em}{0ex}}\text{mean}\\ Nn\times 1\end{array}}{\underbrace{{\scriptstyle \frac{1}{\sqrt{Nn}}}{\mathbf{1}}_{Nn}}}\underset{\begin{array}{c}\text{Cancer}\\ \text{main}\phantom{\rule{0.16667em}{0ex}}\text{effect}\\ Nn\times (n-1)\end{array}}{\underbrace{{\scriptstyle \frac{1}{\sqrt{N}}}{\mathbf{1}}_{N}\otimes {H}_{CA}}}|\underset{\begin{array}{c}\text{County}\\ \text{main}\phantom{\rule{0.16667em}{0ex}}\text{effect}\\ Nn\times (N-1)\end{array}}{\underbrace{{V}^{(-)}\otimes {\scriptstyle \frac{1}{\sqrt{n}}}{\mathbf{1}}_{n}}}\underset{\begin{array}{c}\text{Cancer}\times \text{County}\\ \text{interaction}\\ Nn\times (N-1)(n-1)\end{array}}{\underbrace{\begin{array}{ccc}{V}^{(-)}\otimes {H}_{CA}^{(1)}& \cdots & {V}^{(-)}\otimes {H}_{CA}^{(n-1)}\end{array}}}\right]\times \left[\begin{array}{c}{\mathrm{\Theta}}_{GM}\\ {\mathbf{\Theta}}_{CA}\\ {\mathbf{\Theta}}_{CO}\\ {\mathbf{\Theta}}_{CO\times CA}\end{array}\right],\end{array}$$

(10)

where *H _{CA}* is an

To compare this to the MCAR model, use SANOVA’s priors on **Θ** to produce a marginal prior for ** comparable to the MCAR’s prior on **** (Section 2.2.1); in other words, integrate ****Θ*** _{CO}* and

$$\left[K\phantom{\rule{0.38889em}{0ex}}\left(\begin{array}{c}{\mathbf{\Theta}}_{CO}\\ {\mathbf{\Theta}}_{CO\times CA}^{(1)}\\ \vdots \\ {\mathbf{\Theta}}_{CO\times CA}^{(n-1)}\end{array}\right)\right]$$

(11)

has precision
$Q\otimes ({H}_{A}^{+}\phantom{\rule{0.16667em}{0ex}}\text{diag}({\tau}_{j}){{H}_{A}^{(+)}}^{\prime})$, where *K* is the columns of the design matrix for the county main effects and cancer-by-county interactions—the right-most *n*(*N* − 1) columns in equation (10)’s design matrix—and
${H}_{A}^{(+)}=({\scriptstyle \frac{1}{\sqrt{n}}}{\mathbf{1}}_{n}\mid {H}_{CA})$ is an orthogonal matrix. Appendix A in Zhang, Hodges and Banerjee (2009) gives a proof.

Defining ** as in Sections 2.2.1 and 2.2.2, consider the MCAR prior for ****, with within-county precision matrix Ω. Let Ω have spectral decomposition
${V}_{\mathrm{\Omega}}{D}_{\mathrm{\Omega}}{V}_{\mathrm{\Omega}}^{\prime}$, where ***D*_{Ω} is *n* × *n* diagonal and *V*_{Ω} is *n* × *n* orthogonal. Then the prior precision of ** is
$Q\otimes ({V}_{\mathrm{\Omega}}{D}_{\mathrm{\Omega}}{V}_{\mathrm{\Omega}}^{\prime})$, where ***Q* is the known neighbor relations matrix and *V*_{Ω} and *D*_{Ω} are unknown. Comparing MCAR to SANOVA, the prior precision matrices for the vector ** are as in Figure 1. SANOVA is clearly a special case of MCAR in which
${H}_{A}^{(+)}$ is known. Also, as described so far,
${H}_{A}^{(+)}$ has one column proportional to ****1*** _{n}* with the other columns being contrasts, while MCAR avoids this restriction. MCAR is thus more flexible, while SANOVA is simpler, presumably making it better identified and easier to set priors for. MCAR should have its biggest advantage over SANOVA when the “true”

For the case of normal errors, based on equations (1) and (10), setting priors for **Θ**, *τ _{j}, η*

MCAR models were fitted in WinBUGS. For the normal-error case, we used this model and parameterization:

$$\begin{array}{l}{Y}_{ij}\sim N\left({\mu}_{ij},\frac{1}{{\eta}_{0}}\right),\\ {\mu}_{ij}={\beta}_{j}+{S}_{ij},\end{array}$$

(12)

*i* = 1, …, *N; j* = 1, …, *n*, where *η*_{0} has a gamma prior with mean 1 and variance 10 as for SANOVA. To satisfy WinBUGS’s constraint that Σ* _{i} S_{ij}* = 0, we add cancer-specific intercepts

$$\begin{array}{l}{Y}_{ij}\sim \text{Poisson}({\mu}_{ij}),\\ log({\mu}_{ij})=log({E}_{ij})+{\beta}_{j}+{S}_{ij},\end{array}$$

(13)

where *E _{ij}* is an offset. Prior settings for

For this simulation experiment, artificial data were simulated from the model used in SANOVA with a spatial factor, as described in Section 2.2.2. Three different types of Bayesian analysis were applied to the simulated data: SANOVA with the same ${H}_{A}^{(+)}$ used to generate the simulated data (called “SANOVA correct”); SANOVA with incorrect ${H}_{A}^{(+)}$; and MCAR. SANOVA correct is a theoretical best possible analysis in that it takes as known things that MCAR estimates, that is, it uses additional correct information. SANOVA correct cannot be used in practice, of course, because the true ${H}_{A}^{(+)}$ is not known. MCAR vs SANOVA with incorrect ${H}_{A}^{(+)}$ is the comparison relevant to practice, and comparing them to SANOVA correct shows how much each method pays for its “deficiency” relative to SANOVA correct. Obviously it is not enough to test the SANOVA model using only data generated from a similar SANOVA model. To avoid needless computing and facilitate comparisons, instead of generating data from an MCAR model and fitting a SANOVA model as specified above, we use a trick that is equivalent to this. Section 3.1.2 gives the details.

We simulated both normally-distributed and Poisson-distributed data. For both types of data, we considered two different true sets of smoothing parameters **r** = ** τ**/

To generate data from the SANOVA model, we need to define the true ${H}_{A}^{(+)}$. Let

$${HA}_{1}=\left(\begin{array}{ccc}1& -2& 0\\ 1& 1& -1\\ 1& 1& 1\end{array}\right)\phantom{\rule{0.16667em}{0ex}}\left(\begin{array}{ccc}{\scriptstyle \frac{1}{\sqrt{3}}}& 0& 0\\ 0& {\scriptstyle \frac{1}{\sqrt{6}}}& 0\\ 0& 0& {\scriptstyle \frac{1}{\sqrt{2}}}\end{array}\right).$$

We used *HA*_{1} as the correct
${H}_{A}^{(+)}$; its columns are scaled to have length 1. Given *V*^{(−)} and with
${H}_{A}^{(+)}$ known, one draw of **Θ** and *ε* produces a draw of *X _{D}*

For each simulated data set, we did a Bayesian analysis for each of six different models described in Table 2. The six models are: SANOVA with the correct
${H}_{A}^{(+)}$, *HA*_{1}; SANOVA with a somewhat incorrect
${H}_{A}^{(+)}$, *HA*_{2} given below; a variant SANOVA with a very incorrect
${H}_{A}^{(+)}$, *H _{AM}* given below; MCAR with

$$\begin{array}{l}{HA}_{2}=\left(\begin{array}{ccc}1& 1& 1\\ 1& -2& 0\\ 1& 1& -1\end{array}\right)\phantom{\rule{0.16667em}{0ex}}\left(\begin{array}{ccc}{\scriptstyle \frac{1}{\sqrt{3}}}& 0& 0\\ 0& {\scriptstyle \frac{1}{\sqrt{6}}}& 0\\ 0& 0& {\scriptstyle \frac{1}{\sqrt{2}}}\end{array}\right),\\ {H}_{AM}=\left(\begin{array}{ccc}0.56& -0.64& -0.52\\ -0.53& -0.77& 0.36\\ -0.63& 0.07& -0.77\end{array}\right).\end{array}$$

The incorrect *HA*_{2} has the same first column (grand mean) as the correct *HA*_{1}, so it differs from the correct *HA*_{1}, though less than it might. As noted above, we need to see how the SANOVA model performs for data generated from an MCAR model in which *V*_{Ω} from Figure 1 does not have a column proportional to **1*** _{n}*. To do this without needless computing, we used a trick: we used the data generated from a SANOVA model with

$${V}_{\mathrm{\Omega}}=\left[\begin{array}{ccc}0.43& -0.74& -0.52\\ -0.13& -0.63& 0.77\\ -0.89& -0.26& -0.37\end{array}\right]$$

(14)

(to 2 decimal places). For Poisson errors (Data5, Data6), the equivalence is no longer precise but the divergence of fitted SANOVA [using ${H}_{A}^{(+)}={H}_{AM}$] and

generated data [using
${H}_{A}^{(+)}={HA}_{1}$] is quite similar. Finally, we considered three priors for MCAR because little is known about how to set this prior and we did not want to hobble MCAR with an ill-chosen prior. For the SANOVA and variant SANOVA analyses, we gave *τ _{j}* a Γ (0.1, 0.1) prior with mean 1 and variance 10 for both the normal data and the Poisson data.

To compare the six different methods for normal and Poisson data, we consider three criteria. The first is average mean squared error (AMSE). For each of the 60 (*X _{D}*

$${\widehat{\mathit{AMSE}}}_{K}=\frac{1}{L}\sum _{d=1}^{L}\sum _{i=1}^{N}\sum _{j=1}^{n}{[{({X}_{D}\widehat{\mathbf{\Theta}})}_{ij}^{d}-({X}_{D}{\mathbf{\Theta}}_{ij}^{2})]}^{2}/Nn,$$

(15)

where *L* = 100, *N* = 20, *n* = 3, *K* = 1, …, 4 for Normal, *K* = 5, 6 for Poisson, **Θ** is the true value and is the posterior median of **Θ**. For each design cell (*K*), the Monte Carlo standard error for AMSE is (100)^{−0.5} times the standard deviation, across DataK’s 100 simulated data sets, of
${\sum}_{i=1}^{N}{\sum}_{j=1}^{n}{[{({X}_{D}\widehat{\mathbf{\Theta}})}_{ij}^{d}-{({X}_{D}\mathbf{\Theta})}_{ij}^{d}]}^{2}/Nn$. The second criterion is the bias of *X*_{D}**Θ**. For each of DataK’s 100 simulated data sets, first compute posterior medians of (*X _{D}*

$${\widehat{\mathit{MBIAS}}}_{K}=2.5\text{th},50\text{th},97.5\text{th}\phantom{\rule{0.16667em}{0ex}}\text{percentiles}\phantom{\rule{0.16667em}{0ex}}\text{of}\left(\frac{1}{L}\sum _{d=1}^{L}({X}_{D}{\widehat{\mathbf{\Theta}}}^{d}-{X}_{D}{\mathbf{\Theta}}^{d})\right).$$

(16)

Finally, the coverage rate of Bayesian 95% equal-tailed posterior intervals, “PI rate,” is the average coverage rate for the 60 individual (*X _{D}*

While the MCAR models were implemented in WinBUGS, our SANOVA implementations were coded in R and run on Unix. The different architectures do not permit a fair comparison between the run times of SANOVA and MCAR. However, the SANOVA models have lower computational complexity than the MCAR models: MCAR demands a spectral decomposition in every iteration, while SANOVA does not. For each of our models, we ran three parallel MCMC chains for 10,000 iterations. The CODA package in R (www.r-project.org) was used to diagnose convergence by monitoring mixing using Gelman–Rubin diagnostics and autocorrelations [e.g., Gelman et al. (2004), Section 11.6]. Sufficient mixing was seen within 500 iterations for the SANOVA models, while 200 iterations typically revealed the same for the MCAR models; we retained 8000 × 3 samples for the posterior analysis.

Table 3 and Figures 2 and and33 show the simulation experiment’s results. Table 3 shows AMSE; for all methods and design cells, the standard Monte Carlo errors of AMSE are small, less than 0.07, 0.005 and 0.025 for Data1/Data2, Data3/Data4 and Data5/Data6 respectively. Figure 2 shows MBIAS, where the middle symbols represent the median bias and the line segments represent the 2.5th and 97.5th percentiles. Figure 3 shows coverage of the 95% posterior intervals. Denote SANOVA with the correct
${H}_{A}^{(+)}$ (*HA*_{1}) as “SANOVA correct,” SANOVA with *HA*_{2} as “SANOVA incorrect,” the variant SANOVA with *H _{AM}* as “SANOVA variant,” MCAR with

For normal data, SANOVA correct has the smallest AMSE for all true *η*_{0} and ** τ** (Table 3). The advantage is larger in Data1 and Data2 where the error precision

Table 3 shows that, for normal data, both SANOVA incorrect and SANOVA variant have smaller AMSEs than MCAR200 and MCAR0.002, and AMSEs at worst close to MCAR1’s. For Poisson data, Table 3 shows that MCAR0.002 and MCAR1 do somewhat better than SANOVA incorrect and variant SANOVA. Considering MBIAS in normal data [Figure 2(a)], the width of the 95% MBIAS intervals for SANOVA incorrect are the same as or smaller than for all three MCAR procedures. Similarly, SANOVA variant has MBIAS intervals better than MCAR0.002 and MCAR200 and almost as good as MCAR1. Figure 2(b) for Poisson data shows SANOVA correct, MCAR0.002 and MCAR1 have similar MBIAS intervals. SANOVA variant in Data5 and SANOVA incorrect in Data6 show the worst performance for MBIAS apart from MCAR200, whose MBIAS interval is much the widest. Figure 3(a) shows that, for normal data, interval coverage for SANOVA incorrect and SANOVA variant is very close to nominal. It appears that the specific value of ${H}_{A}^{(+)}$ has little effect on PI coverage rate for the cases considered here. Apart from MCAR200 for Data1/Data2 and MCAR0.002 for Data1 through Data4, which show low coverage, all the other methods have coverage rates greater than 90% for normal data, most close to 95%. For Data3 and Data4, PI rates for MCAR200 reach above 99%. For Poisson data, the PI rates for SANOVA incorrect and SANOVA variant are close to nominal and better than MCAR0.002 and MCAR200. In particular, all SANOVAs have the closest to nominal coverage rates for both normal and Poisson data, which again shows the stability of SANOVA under different ${H}_{A}^{(+)}$ settings.

To fairly compare SANOVA and MCAR, we considered MCAR under three different prior settings. For normal data, MCAR1 has the smallest AMSEs and narrowest MBIAS intervals among the MCARs considered, while MCAR0.002 has the largest and widest, respectively. For Poisson data, however, MCAR0.002 has the best AMSE and MBIAS among the MCARs. MCAR200 performs poorly for both Normal and Poisson. The coverage rates in Figure 3 show similar comparisons. These results imply that the prior matters for MCAR: no single prior was always best. By comparison, SANOVA seems more robust, at least for the cases considered.

As expected, SANOVA correct had the best performance because it uses more correct information. For normal data, SANOVA incorrect and SANOVA variant had similar AMSEs, better than two of the three MCARs for the data sets considered. For Poisson data, SANOVA incorrect and SANOVA variant had AMSEs as good as those of MCAR0.002 and MCAR1 for Data5 and somewhat worse for Data6, while showing nearly nominal coverage rates in all cases and less tendency to bias than MCAR in most cases. Replacing the Γ (0.1, 0.1) prior for ** τ** with Γ (0.001, 0.001) left AMSE and MBIAS almost unchanged and coverage rates a bit worse (data not shown). MCAR, on the other hand, seems more sensitive to the prior on Ω. MCAR0.002 tends to smooth more than MCAR1, more so in normal models where the prior is more influential than in Poisson models. (The latter is true because data give more information about means than variances, and the Poisson model’s error variance is the same as its mean, while the normal model’s is not.) For the normal data, MCAR0.002’s tendency to extra shrinkage appears to make it oversmooth and perform poorly for Data2 and Data4, where the truth is least smooth. For the Poisson data, MCAR0.002 and MCAR1 give results similar to each other and somewhat better than the SANOVAs except for interval coverage. Therefore, SANOVA, with stable results under different
${H}_{A}^{(+)}$ and with parameters that are easier to understand and interpret, may be a good competitor to MCAR in multivariate spatial smoothing.

Researchers in different fields have illustrated that accounting for spatial correlation could provide insights that would have been overlooked otherwise, while failure to account for spatial association could potentially lead to spurious and sometimes misleading results [see, e.g., Turechek and Madden (2002), Ramsay, Burnett and Krewski (2003), Lichstein et al. (2002)]. Among the widely investigated diseases are the different types of cancers. We applied SANOVA and MCAR to a cancer-surveillance data set describing total incidence counts of 3 cancers (lung, larynx, esophagus) in Minnesota’s 87 counties for the years 1990 to 2000 inclusive. Minnesota’s geography and history make it plausible that disease incidence would show spatial association. Three major North American land forms meet in Minnesota: the Canadian Shield to the north, the Great Plains to the west, and the eastern mixed forest to the southeast. Each of these regions is distinctive in both its terrain and its predominant economic activity: mining and outdoors tourism in the mountainous north, highly mechanized crop cultivation in the west, and dairy farming in the southeast. The different regions were also settled by somewhat different groups of in-migrants, for example, disproportionately many Scandinavians in the north. These factors imply spatial association in occupational hazards as well as culture, weather, and access to health care especially in the thinly-populated north, which might be expected to produce spatial association in diseases. With multiple cancers one obvious option is to fit a separate univariate model for each cancer. But diseases may share the same spatially distributed risk factors, or the presence of one disease might encourage or inhibit the presence of another in a region, for example, larynx and esophagus cancer have been shown to be closely related spatially [Baron et al. (1993)]. Thus, we may need to account for dependence among the different cancers while maintaining spatial dependence between sites. Although the data set has counts broken out by age groups, for the present purpose we ignore age standardization and just consider total counts for each cancer. Age standardization would affect only the expected cancer counts *E _{ij},* while other covariates could be added to either SANOVA or MCAR as unsmoothed fixed effects (i.e., in the

$$\begin{array}{l}{y}_{ij}\mid {\mu}_{ij}\sim \text{Poisson}({\mu}_{ij}),\\ log({\mu}_{ij})=log({E}_{ij})+{({X}_{D}\mathbf{\Theta})}_{ij},\end{array}$$

(17)

where *X _{D}*

Comparing data and fitted values for each cancer. The “Data” panel shows the density for *y*_{ij}/E_{ij}, while the other three panels show the posterior median of *μ*_{ij}/E_{ij} for univariate CAR, SANOVA and MCAR.

To further examine the smoothing under SANOVA, Figure 8 shows separate maps for the county main effect and interactions from the SANOVA fit with *HA*_{1}. The upper left plot is the cancer main effect, the mean of the three cancers; the lower left plot is the comparison of lung versus average of larynx and esophagus; the lower right plot is the comparison of larynx versus esophagus. All values are on the same scale as *y _{ij}*/

We used SANOVA to do spatial smoothing and compared it with the much more complex MCAR model. For the cases considered here, we found SANOVA with spatial smoothing to be an excellent competitor to MCAR. It yielded essentially indistinguishable inference, while being easier to fit and interpret. In the SANOVA model,
${H}_{A}^{(+)}$ is assumed known. For most of the SANOVA models considered,
${H}_{A}^{(+)}$‘s first column was fixed to represent the average over diseases, while other columns were orthogonal to the first column. Alternatively,
${H}_{A}^{(+)}$ could be treated as unknown and estimated as part of the analysis. With this extension, SANOVA with spatial effects is a reparameterization of the MCAR model and gains the MCAR model’s flexibility at the price of increased complexity. This extension would be nontrivial, involving sampling from the space of orthogonal matrices while avoiding identification problems arising from, for example, permuting columns of
${H}_{A}^{(+)}$. Other covariates can be added to a spatial SANOVA. Although (10) is a saturated model, spatial smoothing “leaves room” for other covariates. Such models would suffer from collinearity of the CAR random effects and the fixed effects, as discussed by Reich, Hodges and Zadnik (2006), who gave a variant analysis that avoids the collinearity. For data sets with spatial and temporal aspects, for example, the 11 years in the Minnesota 3-cancer data, interest may lie in the counts’ spatial pattern and in their changes over time. By adding a time effect, SANOVA can be extended to a spatiotemporal model. Besides spatial and temporal main effects, their interactions can also be included and smoothed. There are many modeling choices; the simplest model is an additive model without space-time interactions, where the spatial effect has a CAR model and the time effect a random walk, which is a simple CAR. But many other choices are possible. We have examined intrinsic CAR models, where *Q _{ij}* = −1 if region

$$h({t}_{ij},{X}_{ij})={h}_{0}({t}_{ij})exp({X}_{ij}\beta ),$$

(18)

where *X* is the design matrix, which may include a spatial effect, a temporal effect, their interactions and other covariates. Banerjee, Carlin and Gelfand (2004) noted that in the CAR model, considering both spatial and nonspatial frailties, the frailties are identified only because of the prior, so the choice of priors for precisions is very important. Besides the above extensions, HCSC introduced tools for normal SANOVA models that can be extended to nonnormal SANOVA models. For example, HCSC defined the degrees of freedom in a fitted model as a function of the smoothing precisions. This can be used as a measure of the fit’s complexity, or a prior can be placed on the degrees of freedom as a way of inducing a prior on the unknowns in the variance structure. The latter is under development and will be presented soon.

The authors thank the referees, the Associate Editor and the Editor for valuable comments and suggestions.

Yufen Zhang, Novartis Pharmaceuticals, East Hanover, New Jersey 07936, USA.

James S. Hodges, Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota 55455, USA.

Sudipto Banerjee, Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota 55455, USA.

- Banerjee S, Carlin BP, Gelfand AE. Hierarchical Modeling and Analysis for Spatial Data. Chapman & Hall/CRC Press; Boca Raton, FL: 2004.
- Banerjee S, Wall MM, Carlin BP. Frailty modeling for spatially correlated survival data, with application to infant mortality in Minnesota. Biostatistics. 2003;4:123–142. [PubMed]
- Baron AE, Franceschi S, Barra S, Talamini R, La Vecchia C. Comparison of the joint effect of alcohol and smoking on the risk of cancer across sites in the upper aerodigestive tract. Cancer Epidemiol Biomarkers Prev. 1993;2:519–523. [PubMed]
- Besag J. Spatial interaction and the statistical analysis of lattice systems (with discussion) J Roy Statist Soc Ser B. 1974;36:192–236. MR0373208.
- Besag J, Green P, Higdon D, Mengersen K. Bayesian computation and stochastic systems (with discussion) Statist Science. 1995;10:3–66. MR1349818.
- Besag J, York JC, Mollié A. Bayesian image restoration, with two applications in spatial statistics (with discussion) Ann Inst of Statist Math. 1991;43:1–59. MR1105822.
- Carlin BP, Banerjee S. Hierarchical multivariate CAR models for spatiotemporally correlated survival data. In: Bernardo JM, et al., editors. Bayesian Statistics. Vol. 7. Oxford Univ. Press; Oxford: 2003. pp. 45–64. MR2003166.
- Elliott P, Wakefield JC, Best NG, Briggs DJ. Spatial Epidemiology: Methods and Applications. Oxford Univ. Press; Oxford: 2000.
- Gelfand AE, Vounatsou P. Proper multivariate conditional autoregressive models for spatial data analysis. Biostatistics. 2003;4:11–25. [PubMed]
- Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian Data Analysis. 2. Chapman and Hall/CRC Press; Boca Raton, FL: 2004.
- Hodges JS. Some algebra and geometry for hierarchical models. J Roy Statist Soc Ser B. 1998;60:497–536. MR1625954.
- Hodges JS, Carlin BP, Fan Q. On the precision of the conditionally autoregressive prior in spatial models. Biometrics. 2003;59:317–322. MR1987398. [PubMed]
- Hodges JS, Cui Y, Sargent DJ, Carlin BP. Smoothing balanced single-error-term analysis of variance. Technometrics. 2007;49:12–25. MR2345448.
- Jin X, Banerjee S, Carlin BP. Order-free coregionalized lattice models with application to multiple disease mapping. J Roy Statist Soc Ser B. 2007;69:817–838. MR2368572. [PMC free article] [PubMed]
- Jin X, Carlin BP, Banerjee S. Generalized hierarchical multivariate CAR models for areal data. Biometrics. 2005;61:950–961. MR2216188. [PubMed]
- Kim H, Sun D, Tsutakawa RK. A bivariate Bayes method for improving the estimates of mortality rates with a twofold conditional autoregressive model. J Amer Statist Assoc. 2001;96:1506–1521. MR1946594.
- Lawson AB, Biggeri AB, Bohning D, Lesaffre E, Viel JF, Bertollini R. Disease Mapping and Risk Assessment for Public Health. Wiley; New York: 1999.
- Lee Y, Nelder JA. Hierarchical generalized linear models (with discussion) J Roy Statist Soc Ser B. 1996;58:619–673. MR1410182.
- Li Y, Ryan L. Modeling spatial survival data using semi-parametric frailty models. Biometrics. 2002;58:287–297. MR1908168. [PubMed]
- Lichstein JW, Simons TR, Shriner SA, Franzreb KE. Spatial autocorrelation and autoregressive models in ecology. Ecological Monographs. 2002;72:445–463.
- Mardia KV. Multi-dimensional multivariate Gaussian Markov random fields with application to image processing. J Multivar Anal. 1988;24:265–284. MR0926357.
- Ramsay T, Burnett R, Krewski D. Exploring bias in a generalized additive model for spatial air pollution data. Environ Health Perspect. 2003;111:1283–1288. [PMC free article] [PubMed]
- Reich BJ, Hodges JS, Zadnik V. Effects of residual smoothing on the posterior of the fixed effects in disease-mapping models. Biometrics. 2006;62:1197–1206. MR2307445. [PubMed]
- Rue H, Held L. Gaussian Markov Random Fields: Theory and Applications. Chapman & Hall/CRC; Boca Raton: 2005. MR2130347.
- Sain SR, Cressie N. Proceedings of ASA Section on Statistics and the Environment. Amer. Statist. Assoc; Alexandria, VA: 2002. Multivariate lattice models for spatial environmental data; pp. 2820–2825.
- Spiegelhalter DJ, Best NG, Carlin BP, van der Linde A. Bayesian measures of model complexity and fit (with discussion) J Roy Statist Soc Ser B. 2002;64:583–639.
- Turechek WW, Madden LV. A generalized linear modeling approach for characterizing disease incidence in spatial hierarchy. Phytopathology. 2002;93:458–466. [PubMed]
- Wakefield J. Disease mapping and spatial regression with count data. Biostatistics. 2007;8:158–183. [PubMed]
- Wall MM. A close look at the spatial structure implied by the CAR and SAR models. J Statist Plann Inference. 2004;121:311–324. MR2038824.
- Waller LA, Gotway CA. Applied Spatial Statistics for Public Health Data. Wiley; New York: 2004. MR2075123.
- Zhang Y, Hodges JS, Banerjee S. Supplement to “Smoothed ANOVA with spatial effects as a competitor to MCAR in multivariate spatial smoothing” 2009 doi: 10.1214/09-AOAS267SUPP. [PMC free article] [PubMed] [Cross Ref]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |