Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2889169

Formats

Article sections

- SUMMARY
- 1. Introduction
- 2. Notation and Methods for Covariate Adjustment and Ranking
- 3. Example of Mortality from Heart Disease
- 4. Simulations and Analytical Studies to Compare Procedures for Selecting High Risk Groups
- 5. Discussion
- REFERENCES

Authors

Related links

Biometrics. Author manuscript; available in PMC 2010 July 2.

Published in final edited form as:

Published online 2009 June 9. doi: 10.1111/j.1541-0420.2009.01284.x

PMCID: PMC2889169

NIHMSID: NIHMS117728

Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, Executive Plaza South, Room 8032, Bethesda, MD 20892-7244

The publisher's final edited version of this article is available at Biometrics

Identifying regions with the highest and lowest mortality rates and producing the corresponding color-coded maps help epidemiologists identify promising areas for analytic etiologic studies. Based on a two-stage Poisson-Gamma model with covariates, we use information on known risk factors, such as smoking prevalence, to adjust mortality rates and reveal residual variation in relative risks that may reflect previously masked etiologic associations. In addition to covariate adjustment, we study rankings based on standardized mortality ratios, empirical Bayes estimates, and a posterior percentile ranking method and indicate circumstances that warrant the more complex procedures in order to obtain a high probability of correctly classifying the regions with the upper 100γ% and lower 100γ% of relative risks for γ= 0.05, 0.1 and 0.2. We also give analytic approximations to the probabilities of correctly classifying regions in the upper 100γ% of relative risks for these three ranking methods. Using data on mortality from heart disease, we found that adjustment for smoking prevalence has an important impact on which regions are classified as high and low risk. With such a common disease, all three ranking methods performed comparably. However, for diseases with smaller event counts, such as cancers, and wide variation in event counts among regions, empirical Bayes and posterior percentile ranking methods outperform ranking based on standardized mortality ratios.

Epidemiologists study maps of disease incidence rates and mortality rates to find clues to disease etiology. Regions with the highest and lowest rates are of particular interest, because analytic epidemiologic studies can be conducted in such regions to identify etiologic factors. Standardized incidence ratios (SIRs) or standardized mortality ratios (SMRs), namely the ratio of the number of deaths, *O _{i}*, in region

Two important issues arise in identifying regions with the highest or lowest mortality rates. (Hereafter we refer to mortality rates, but the same ideas apply to incidence rates). One issue is whether the variation in SMRs is partly explained by known factors such as smoking prevalence. One would like to adjust for such factors to identify regions with the highest or lowest residual mortality rates where one might find new etiologic factors apart from e.g. smoking. A second issue is what procedures to use for ranking. The SMR is unstable in areas with small values of *E _{i}*, and a two-stage Poisson-Gamma model (Clayton and Kaldor, 1987; Pennello, Devesa, and Gail, 1999) can be adapted to produce smoother Bayes or empirical Bayes (EB) estimates for ranking the regions. Although EB estimates produce good estimates of the ranks overall (Laird and Louis, 1989), they are not optimal for identifying the regions with the upper 100γ% of rates (Lin et al., 2006). For that purpose, Lin et al. (2006) recommended ranking the posterior probability that a region’s rank will be in the top 100γ% of ranks, which we call fully Bayesian posterior percentile ranking (B-PPR). In this paper, we propose an empirically approximated posterior percentile ranking procedure (PPR) that implicitly assumes vague priors on certain parameters, and we compare PPR with B-PPR. We use the abbreviation PPR for ranking either in the upper 100γ%of ranks or lower 100γ% of ranks and the abbreviations PPHR and PPLR for the upper 100γ% and lower 100γ% of ranks respectively.

We describe methods for covariate adjustment for the SMR and for the EB and PPR approaches under the Poisson-Gamma model (Section 2) and compare these procedures in an analysis of heart disease mortality data (Section 3). In this example, we find that covariate adjustment has a greater impact than the choice of ranking method, which motivates simulations and analytic studies to determine under what circumstance the choice of ranking method has a material impact (Section 4). The choice of ranking method is only important when there is wide variation in *E _{i}*. In this circumstance, PPR outperforms EB slightly, and both can be noticeably better than SMR if the expected variance of the SMR is large. We present accurate analytic approximations to guide the choice of ranking procedure, both for the Poisson-Gamma and Gaussian-Gaussian models (Section 4). We conclude with a discussion in Section 5.

Consider *I* regions, *i* = 1, 2,, *I* with *J* age-groups *j* = 1,,*J* in each. Let *y _{ij}* be the number of person-years exposure and

$${E}_{ij}={y}_{ij}{\xi}_{j}\{{P}_{ij}r+(1-{P}_{ij})\},$$

(1)

where {ξ_{j}}, the age-specific mortality rates for non-smokers, and *r*, the relative risk associated with smoking, are to be estimated.

We would like to know the unmeasured random effect θ_{i} which is assumed to come from a gamma distribution Γ(α,1/α) with mean α(1/α)=1 and variance 1/α, where α is unknown. The quantity θ_{i} represents the intrinsic relative risk (or frailty) of region *i* apart from the measured factors encompassed in *E _{ij}*. Ideally we would be able to rank the {θ

$${O}_{ij}|{\theta}_{i}~\phantom{\rule{thinmathspace}{0ex}}\text{Poi}({E}_{ij}{\theta}_{i}).$$

(2)

We assume θ_{i} is independent of *E _{ij}*. Thus, unconditionally, E(

$${O}_{ij}~\text{NB}\{\text{success}=\alpha ,\phantom{\rule{thinmathspace}{0ex}}\text{prob}=\alpha /(\alpha +{E}_{ij})\},$$

(3)

with mean = success(1-peob)/prob and variance = success(1-prob)/prob^{2}. The marginal likelihood is the product over regions and age groups of (3), which can be maximized to yield maximum likelihood estimates ={, and _{j}=exp(_{j})}. The parameterization in terms of δ_{j} avoids numerical instabilities. Assuming ψ={α, *r* and ξ_{j}} are known, the posterior distribution of θ_{i} given *O _{i}*=∑

$${\theta}_{i}|{O}_{i}\stackrel{\text{iid}}{~}\mathrm{\Gamma}\{{O}_{i}+\alpha ,1/({E}_{i}+\alpha )\}.$$

(4)

Thus the EB estimate of θ_{i} is

$${{\widehat{\theta}}_{i}}^{\text{EB}}=({O}_{i}+\widehat{\alpha})/({\widehat{E}}_{i}+\widehat{\alpha}),$$

(5)

where *Ê _{i}* = ∑

The fully Bayesian posterior percentile ranking (B-PPR) method was first proposed in Lin et al. (2006). They assumed a Poisson-Gamma model with vague and independent priors on the unknown parameters. To implement their procedures, we use α ~ LN(3.69, 100), *r* ~ LN(0.69, 100), and $\xi \stackrel{\mathit{\text{iid}}}{~}\text{LN}(-6,100)$. Here LN(η, ^{2}) is the lognormal density of a random variable whose logarithm has mean η and variance ^{2}. The top 100γ% highest risk regions are identified by ranking the following posterior probabilities,

$$\mathrm{P}(\text{rank}({\theta}_{i})\ge (1-\gamma )I|O).$$

(6)

These posterior probabilities can be estimated by sampling from the joint posterior distribution of {θ_{i}} using a Gibbs or Metropolis-Hastings algorithm. In this study, the B-PPR analysis was conducted in WinBugs1.4.3 (Lunn et al.; 2000). Based on graphical assessment of convergence, the following procedure was used for each simulation. We discarded the initial 2500 samples and used every fifth sample thereafter to obtain 1000 samples of parameters.

In this paper, we propose an approximate posterior percentile ranking (PPR) method and compare it with B-PPR. For PPR, we approximate the joint posterior distribution of ψ={α, *r* and ξ_{j}} as a multi-normal distribution with means and with covariance given by the inverse of the observed Fisher information matrix from the marginal likelihood based on (3), and we obtain an approximate posterior sample of ψ, namely ψ^{*} from this multi-normal distribution. Given ψ^{*}, the approximate posterior $\{{\theta}_{i}^{*}\}$ are conditionally independent and can be independently generated from (4). In order to approximate (6), this sample of $\{{\theta}_{i}^{*}\}$ is ranked and the regions with ranks in the top 100γ% of ranks (and bottom 100γ% of ranks) are recorded. We repeated this approximated posterior sampling procedure *N* = 1000 times and selected as “high” the γ*I* regions that most often had among the top 100γ% of ranks. Likewise we determined which regions were “low” as those that most often had among the lowest 100γ% of ranks. To be precise, let {S_{i}} be a sample from the posterior distribution of $\{{\theta}_{i}^{*}\}$ and for a particular region *i* consider P{rank(*S _{i}*) > (1−γ)

We compare our PPR approximation with B-PPR using the heart disease example described in the following section, and we find that PPR performs as well as or better than B-PPR in ranking regions. Moreover, PPR takes much less computing time than B-PPR.

We studied mortality from heart disease (ICD codes I00-I09, I13, I20-I51 based on ICD-10, 1992) in white U.S. males for the period 1990–2005. The data were presented separately in 805 health service areas (HSAs), which are defined as a single county or cluster of contiguous counties that are relatively self-contained with respect to hospital care. Heart disease death counts and person-years for each HSA and 13 age intervals (25–29, 30–34, …, 80–84, 85+ years) were obtained from the National Cancer Institute’s Surveillance Epidemiology and End Results (SEER) Program data at http://seer.cancer.gov.

We estimated the prevalence of current smoking {*P _{ij}*} from aggregated 1995–2005 data from the Behavioral Risk Factor Surveillance System (BRFSS) (http://www.cdc.gov/brfss/technical_infodata/surveydata/1995.htm). Four HSAs in Alaska were eliminated because smoking data were too sparse. Twenty HSAs had total sample sizes less than 40; to estimate {

In this example, we use the ranking procedures presented in the last section to identify regions in the top 10% and bottom 10% of relative risks. To highlight the “high” and “low” HSAs, we plotted those in the top decile in black and those in the bottom decile in grey in Figure 1a, 1b, 1c and 1d. Other areas are shown in white, except the four excluded HSAs in Alaska, which are light grey.

Rankings of heart disease mortality relative risks in 805 Health Service Areas (HSAs) for white men, 1990–2005. Those HSAs in the *top* decile are shown in black; those in the *bottom* decile are shown in grey; others are in white. Four HSAs in Alaska **...**

Figure 1a depicts SMRs that are not adjusted for smoking. Adjustment for smoking (Figure 1b) produces noticeable changes in the pattern of “high” risk HSAs especially in and above Texas. Likewise areas of “low” risk are changed in California and elsewhere. Of those 81 HSAs classified as “high” risk in Figure 1b, only 60 were classified as high without adjustment for smoking (Figure 1a). Of those 80 classified as “low” risk in Figure 1b, only 62 were so classified in Figure 1a. Thus, adjustment for smoking changes the identification of high and low risk HSAs appreciably.

The choice of ranking procedure had less impact than adjustment for smoking. Figure 1b, 1c and 1d depict HSA deciles of risk based on the SMR, EB and PPR methods of ranking respectively, all adjusted for smoking. Of the 81 “high” risk HSAs identified by the PPR method, 80 were also ranked “high” by the EB method, and 73 by the SMR method. Likewise, of the 80 HSAs ranked “low” by the PPR method, 79 were also ranked low by the EB method and 70 by the SMR method. Thus the choice of ranking procedure had relatively little impact.

To study factors that might influence the relative performance of the ranking procedures and compare the performance of B-PPR and PPR methods, we conducted simulations based on the heart disease data. Using the estimates from the original data, =42.43, = 1.71, and =(0.281, 0.487, 0.987, 2.17, 4.65, 9.22, 15.94, 26.15, 43.01, 73.17, 114.48, 181.57, 303.08, 622.98)×10^{−4}, and *y _{ij}* and

Table 1 reports results for ranking procedures for smoking adjusted B-PPR, PPR, EB and SMR, and for SMR without smoking adjustment. Across different values of γ and *y _{ij}*, PPR performed slightly better than B-PPR. The B-PPR method took about 7 minutes per simulation, whereas PPR took less than one minute. We found that a slight change in the mean of the vague prior could change the ranking results for B-PPR slightly for a given data set. Because PPR was faster than B-PPR, performed at least as well as B-PPR and does not require special MCMC expertise, we focus on PPR hereafter.

Mean percentages (with standard error) of correctly classified HSAs in simulated cardiovascular mortality data^{a}.

In simulations reflecting the original heart disease mortality data (columns 2 and 3 in Table 1), the proportions of HSAs correctly classified as low or high risk were lower without smoking adjustment. The PPR and EB procedures performed comparably and very slightly better than SMR with adjustment for smoking. The advantage of the PPR and EB procedures was more visible, however, in the scenario in which person-years were reduced by a factor of ten. In this setting, PPR performed slightly better than EB, and both outperformed SMR noticeably. The performance of all the procedures was less satisfactory than with the original *y _{ij}*, because the variance of

The example in Section 3 indicates that the proportion of truly high (or low) regions correctly detected decreases as the expected conditional variance of *SMR _{i}* which is $\mathrm{E}({E}_{i}{\theta}_{i})/{E}_{i}^{2}=1/{E}_{i}$, increases. Unreported simulations as in Section 3 indicate that the superiority of PPR over EB increases with increasing variability of 1/

The Gaussian mean is normally distributed μ_{i} ~ N(μ_{0}, τ^{2}). Given μ_{i} and the conditional variance ${\sigma}_{i}^{2}$, we observe ${X}_{i}~\mathrm{N}({\mu}_{i},{\sigma}_{i}^{2})$. The conditional variance ${\sigma}_{i}^{2}$ might depend on the sample size in a survey in region *i*, for example. We assume that $\text{log}({\sigma}_{i}^{2})~\mathrm{N}(\eta ,{\phi}^{2})$ is independent of μ_{i}. We assume μ_{0}, τ^{2}, η and ^{2} are known. The joint density is

$$f({X}_{i},{\sigma}_{i}^{2},{\mu}_{i})=\varphi ({\mu}_{i};{\mu}_{0},{\tau}^{2})\text{LN}({\sigma}_{i}^{2};\eta ,{\phi}^{2})\phantom{\rule{thinmathspace}{0ex}}\varphi ({X}_{i};{\mu}_{i},{\sigma}_{i}^{2}),$$

(7)

where ϕ(μ_{i}; μ_{0}, τ^{2}) is a normal density with mean μ_{0} and variance τ^{2} and $\text{LN}({\sigma}_{i}^{2};\eta ,{\phi}^{2})$ is the log-normal density of a random variable whose logarithm has mean η and variance ^{2}. In a sample of *I* independent regions, we wish to identify those regions whose values μ_{i} have ranks [0.9*I*], [0.9*I*]+1, ……, *I*, where [ ] is the integer part of a real number.

It is possible to estimate the proportion of regions correctly classified as having μ_{i} among the top 10% of ranks by simulation. The MLE procedure (analogous to SMR) ranks the *X _{i}*. The empirical Bayes procedure ranks ${\widehat{\mu}}_{i}^{\text{EB}}=\mathrm{E}({\mu}_{i}|{X}_{i})={\rho}_{i}{\mu}_{0}+(1-{\rho}_{i}){X}_{i}$, where ${\rho}_{i}={\sigma}_{i}^{2}/({\sigma}_{i}^{2}+{\tau}^{2})$. The PPHR procedure is based on 1000 simulations. In each simulation, a sample

The left three graphs in Figure 2 show the simulation results of the percentages of correctly identified high regions for three procedures. Without loss of generality, we have set μ_{0} = 0 and τ^{2} = 1. We can see that the percentage correctly classified increases with $k={\tau}^{2}/\text{GM}({\sigma}_{i}^{2})$, which is a measure of signal to noise. $\text{GM}({\sigma}_{i}^{2})$ is the geometric mean of $\{{\sigma}_{i}^{2}\}$, which equals exp(η). For ^{2} = 0.1, there is little variation among the $\{{\sigma}_{i}^{2}\}$, and the PPHR, EB and MLE rankings perform similarly. With ^{2} = 1, PPHR and EB outperform MLE slightly. With ^{2} = 10, PPHR and EB are markedly superior to MLE, and PPHR outperforms EM slightly. Thus, when the variability in ${\sigma}_{i}^{2}$ is small, all ranking methods perform comparably, whereas when the variability in ${\sigma}_{i}^{2}$ is large, the PPHR method and EB are superior, and the more complex PPHR method is recommended when the variability of ${\sigma}_{i}^{2}$ is very large (cf. Lin et al., 2006).

Gaussian-Gaussian model: left hand panels plot percent of regions correctly classified as having among the top 10% of means versus the signal to noise ratio, *k*; right hand panels plot the percent correctly classified estimated from simulations with 1000 **...**

The following approximations are close to the results from simulations in Figure 2. For *X* ~ N(μ, σ^{2}), with μ ~ N(μ_{0}, τ^{2}) and log(σ^{2}) ~ N(η, ^{2}), and assuming μ_{0}, τ^{2}, η and ^{2} are known, the joint density *f* (*X*, σ^{2}, μ) is given by equation (6). Since the three ranking methods we consider are all monotone functions of *X*, we use *R*(*X*) to denote a generic statistic with distribution

$$\begin{array}{c}{G}_{\mathrm{R}}(t)=\mathrm{P}\{R(X)\le t\}=\mathrm{P}\{X\le {R}^{-1}(t)\}={\mathrm{E}}_{\mu ,{\sigma}^{2}}[\mathrm{P}\{X\le {R}^{-1}(t)|\mu ,{\sigma}^{2}\}]\hfill \\ ={\displaystyle \underset{{\sigma}^{2}=0}{\overset{\infty}{\int}}{\displaystyle \underset{\mu =-\infty}{\overset{\infty}{\int}}{\displaystyle \underset{x=-\infty}{\overset{{R}^{-1}(t)}{\int}}f(x,}}}{\sigma}^{2},\mu )dxd\mu d{\sigma}^{2},\hfill \end{array}$$

(8)

and with *p ^{th}* percentile ${\zeta}_{p}^{\mathrm{R}}={{G}_{\mathrm{R}}}^{-1}(p)$.

Suppose μ^{*} is drawn from a N(μ_{0}, τ^{2}) distribution but truncated to be at or above the *p ^{th}* percentile δ = μ

$$\begin{array}{c}\mathrm{P}\{R({X}^{*})>{\zeta}_{p}^{\mathrm{R}}\}=1-\mathrm{P}\{R({X}^{*})\le {\zeta}_{p}^{\mathrm{R}}\}=1-\mathrm{P}\{{X}^{*}\le {R}^{-1}({\zeta}_{p}^{\mathrm{R}})\}=1-{\mathrm{E}}_{{\mu}^{*},{\sigma}^{2}}[\mathrm{P}\{{X}^{*}\le {R}^{-1}({\zeta}_{p}^{\mathrm{R}})|{\mu}^{*},{\sigma}^{2}\}]\hfill \\ =1-\frac{1}{1-p}{\displaystyle \underset{{\sigma}^{2}=0}{\overset{\infty}{\int}}{\displaystyle \underset{{\mu}^{*}=\delta}{\overset{\infty}{\int}}{\displaystyle \underset{{x}^{*}=-\infty}{\overset{{R}^{-1}({\xi}_{p}^{\mathrm{R}})}{\int}}f({x}^{*},}}}{\sigma}^{2},{\mu}^{*})d{x}^{*}d{\mu}^{*}d{\sigma}^{2}.\hfill \end{array}$$

(9)

*f* (*X*^{*}, σ^{2}, μ^{*}) is given by equation (6) except that μ^{*} is from the truncated normal distribution. The specific functions *R*(*X*) are defined as follows. The MLE procedure ranks *R*_{MLE}(*X*) = *X*, the EB procedure ranks *R*_{EB}(*X*) = ρμ_{0} +(1−ρ)*X*, where ρ = σ^{2}/(σ^{2} + τ^{2}), and the PPHR procedure ranks ${R}_{\text{PPHR}}(X)=1-\mathrm{\Phi}[\{\delta -{\mu}_{0}\rho -(1-\rho )X\}/\sqrt{\kappa}]$, with κ = ρτ^{2}. As seen in the right three graphs in Figure 2, the approximation (8) gives results very close to the simulated values, as all pairs fall on the equiangular line by visual inspection.

We simplify the model in equations (1) and (2) to eliminate the covariate effect by setting *r* = 1 and by assuming ξ_{j} and hence *E _{ij}* and ${E}_{i}={\displaystyle {\sum}_{j=1}^{J}{E}_{ij}}$ are known. The underlying parameter of interest θ

$$g({O}_{i},{E}_{i},{\theta}_{i})=\mathrm{\Gamma}({\theta}_{i};\alpha ,1/\alpha )\text{LN(}{E}_{i};\eta ,{\phi}^{2})\phantom{\rule{thinmathspace}{0ex}}\text{Poi}({O}_{i};{E}_{i}{\theta}_{i}).$$

(10)

We assume α, η, ^{2}, and *E _{i}* are known. Selection of high risk regions is based on the

In the simulation, we fixed α = 42.43, which is the MLE estimate from the heart disease mortality data, and we used ${\phi}_{0}^{2}=1.55$, which was obtained by matching the first two moments of *E _{i}* from the heart disease mortality data to those of a lognormal distribution. The unconditional variance of

In Figure 3, the left three graphs illustrate the simulation results corresponding to three degrees of variability in 1/*E _{i}*, namely

Poisson-Gamma model: left hand panels plot percent of regions correctly classified as having among the top 10% of relative risks versus the signal to noise ratio, *k*; right hand panels plot the percent correctly classified estimated from simulations with **...**

For *O* ~ Poi(*E*θ), with θ ~ Γ(α,1/α) and log(*E*) ~ N(η, ^{2}), and assuming α, η and ^{2} are known, we approximate the performance of the three ranking methods, which are all monotone functions of *O*, denoted *R*(*O*) with

$$\begin{array}{c}{G}_{\mathrm{R}}(t)=\mathrm{P}(R(O)\le t)=\mathrm{P}(O\le {R}^{-1}(t))={\mathrm{E}}_{\theta ,E}\{\mathrm{P}(O\le {R}^{-1}(t)|\theta ,E)\}\hfill \\ ={\displaystyle \underset{E=0}{\overset{\infty}{\int}}{\displaystyle \underset{\theta =0}{\overset{\infty}{\int}}{\displaystyle \sum _{O=0}^{\{{R}^{-1}(t)\}}g(O,}\theta ,E)d\theta dE}},\hfill \end{array}$$

(11)

and with *p ^{th}* percentile ${\zeta}_{p}^{\mathrm{R}}={{G}_{\mathrm{R}}}^{-1}(p)$.

Suppose θ^{*} is drawn from a Γ(α,1/α) distribution but truncated to be at or above the *p ^{th}* percentile δ = Γ

$$\begin{array}{c}\mathrm{P}\{R({O}^{*})>{\zeta}_{p}^{\mathrm{R}}\}=1-\mathrm{P}\{R({O}^{*})\le {\zeta}_{p}^{\mathrm{R}}\}=1-\mathrm{P}\{{O}^{*}\le {R}^{-1}({\zeta}_{p}^{\mathrm{R}})\}\hfill \\ =1-{\mathrm{E}}_{{\theta}^{*},E}\{\mathrm{P}\{{O}^{*}\le {R}^{-1}({\zeta}_{p}^{\mathrm{R}})|{\theta}^{*},E\}\}=1-\frac{1}{1-p}{\displaystyle \underset{E=0}{\overset{\infty}{\int}}{\displaystyle \underset{{\theta}^{*}=\delta}{\overset{\infty}{\int}}{\displaystyle \sum _{{O}^{*}=0}^{\{{R}^{-1}({\zeta}_{p}^{\mathrm{R}})\}}g({O}^{*},{\theta}^{*},E)}d{O}^{*}d{\theta}^{*}dE.}}\hfill \end{array}$$

(12)

Specifically, the SMR procedure ranks *R*_{SMR} (*O*) = *O*/*E*, the EB procedure ranks *R*_{EB}(*O*) = (*O*+α)/(*E*+α), and the PPHR procedure ranks ${R}_{\text{PPHR}}(O)=1-{\displaystyle {\int}_{0}^{\delta}\mathrm{\Gamma}(s;O+\alpha ,E+\alpha )ds}$. As seen in the right three graphs in Figure 3, the approximation (12) gives results visually indistinguishable from the simulated data.

We investigated procedures for identifying the regions with the upper 100γ% and lower 100γ% of standardized mortality rates. In an example of heart disease mortality (Section 3), we found that adjusting for smoking prevalence produced notable changes in the regions identified as high and low risk. Once smoking prevalence was accounted for, however, ranking based on the SMR, on empirical Bayes estimates (EB), or on posterior percentile ranking (PPR) yielded very similar classifications (Figure 1b, 1c and 1d). The EB and PPR methods outperformed SMR, however, when the expected counts were reduced by a factor of 10, (Table 1). In scenarios representing less common diseases, PPR and EB outperformed SMR when expected counts varied over regions (Section 4), and when there was wide variation in expected counts, the PPR method was better than EB. We developed very accurate approximations for the probability of correct classification in the upper 100γ% of risk for Gaussian-Gaussian and Poisson-Gamma models (Section 4).

The Poisson-Gamma empirical Bayesian framework was used by Clayton and Kaldor (1987). The choice of Gamma distribution for the random effect θ_{i} is mathematically convenient, because it is the conjugate prior for the Poisson distribution, and the posterior of θ_{i} is also gamma distributed. Clayton and Kaldor (1987) and Christiansen and Morris (1997) modeled the log-mean of θ_{i} as a linear combination of covariates. We suggested incorporating the effects of a covariate like smoking prevalence as a mixture that represented smokers and non-smokers. The Poisson-Gamma model helps to smooth θ_{i} by borrowing strength from the other regions. One could achieve further smoothing by allowing for spatial correlations. However, the Gamma prior cannot be easily generalized to for this purpose. In our analysis of mortality from heart disease, we aggregated data from nearby counties, reducing the number of units from 3,141 counties to 805 health service areas, and achieving some local smoothing.

Lin et al. (2006) presented data from a Gaussian-Gaussian model (their Table 2) that indicated better performance for picking high and low regions with B-PPR than by classification based on their posterior expected ranks. Conlon and Louis (1999) and Shen and Louis (1998, 2000) discussed procedures that simultaneously gave good estimates of region-specific rates, good estimates of the ranks of the rates, and good estimates of the cumulative distribution of the rates (“triple goal estimates”). Stern and Cressie (1999) described a loss function to identify regions with extreme risks.

Although the original B-PPR method was based on a fully Bayesian analysis, simulated comparisons using heart disease data in Section 3 suggests that our approximate PPR performs as well as or better than B-PPR with reduced computation time and easier implementation. Perhaps the fully Bayesian approach would perform better with different choices of parameters for the vague prior distributions and Gibbs sampling regime.

We assumed that the prevalence of smoking was known without error. Further work is needed to evaluate the impact of error and the use of hierarchical models to improve prevalence estimates. More experience is needed to assess the usefulness of the covariate adjustment and ranking procedures in this paper. The present work illustrates, however, that it is computationally feasible to adjust for covariates and use a preferred ranking procedure, PPR, which may be particularly advantageous for less common diseases with wide variation in expected counts.

The authors thank Professor Thomas A. Louis for pointing out some key references and providing reprints and the reviewers for helpful suggestions. This work was supported by the Intramural Research Program of the Division of Cancer Epidemiology and Genetics of the National Cancer Institute.

- Christiansen CL, Morris CN. Hierarchical Poisson Regression Modeling. Journal of the American Statistical Association. 1997;92:618–632.
- Clayton DG, Kaldor J. Empirical Bayes estimates of age-standardized relative risks for use in disease mapping. Biometrics. 1987;43:671–682. [PubMed]
- Conlon EM, Louis TA. Addressing multiple goals in evaluating region-specific risk using Bayesian methods. New York: Wiley; 1999.
- Devesa SS, G DG, Blot WJ, Pennello G, Hoover RN, Fraumeni JF., Jr Atlas of Cancer Mortality in the United States, 1950–94. Washington, DC: 1999. US Govt Print Off [NIH Publ No. (NIH) 99-4564]
- Laird NM, Louis TA. Empirical Bayes ranking methods. Journal of Educational Statistics. 1989;14:29–46.
- Lin R, Louis TA, Paddock SM, Ridgeway G. Loss function based ranking in two-stage, hierarchical models. Bayesian Analysis. 2006;1:915–946. [PMC free article] [PubMed]
- Lunn DJ, Thomas A, Best N, Spiegelhalter D. WinBUGS -- a Bayesian modelling framework: concepts, structure, and extensibility. Statistics and Computing. 2000;10:325–337.
- Pennello GA, Devesa SS, Gail MH. Using a mixed effects model to estimate geographic variation in cancer rates. Biometrics. 1999;55:774–781. [PubMed]
- Shen W, Louis TA. Triple-goal estimates for disease mapping. Statistics in Medicine. 2000;19:2295–2308. [PubMed]
- Shen W, Louis TA. Triple-goal estimates in two-stage hierarchical models. Journal of the Royal Statistical Society Series B. 1998;60:445–471.
- Stern HS, Cressie N. In: Inference for Extremes in Disease Mapping and Risk Assessment for Public Health. Lawson A, Biggeri A, Bohning D, Lesaffre E, Viel J-F, Bertollini R, editors. Chichester: John Wiley and Sons; 1999. pp. 63–84.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |