Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2824447

Formats

Article sections

- Abstract
- 1 Introduction
- 2 Using Cumulative Residuals to Detect Clusters
- 3 Simulation Study
- 4 Home Allergens and Asthma Study Analysis
- 5 Discussion
- References

Authors

Related links

Biom J. Author manuscript; available in PMC 2010 October 1.

Published in final edited form as:

PMCID: PMC2824447

NIHMSID: NIHMS152043

Spatial cluster detection has become an important methodology in quantifying the effect of hazardous exposures. Previous methods have focused on cross-sectional outcomes that are binary or continuous. There are virtually no spatial cluster detection methods proposed for longitudinal outcomes. This paper proposes a new spatial cluster detection method for repeated outcomes using cumulative geographic residuals. A major advantage of this method is its ability to readily incorporate information on study participants relocation, which most cluster detection statistics cannot. Application of these methods will be illustrated by the Home Allergens and Asthma prospective cohort study analyzing the relationship between environmental exposures and repeated measured outcome, occurrence of wheeze in the last 6 months, while taking into account mobile locations.

The prevalence of allergic diseases in children have greatly increased in the last few decades (Akinbami, Centers for Disease Control, and Prevention National Center for Health Statistics, 2006). What influences the onset of allergic diseases such as asthma and wheeze has become an increasingly important public health question. The Home Allergens and Asthma Study is an ongoing prospective cohort study investigating environmental and socioeconomic (SES) risk factors leading to early childhood respiratory diseases, such as asthma and wheezing, in the Boston, MA metropolitan area (Celedon *et al*., 1999). Cross-sectional and longitudinal studies have tied home allergen levels (*e.g*. from cockroach and mouse), mold in the home, lower SES, and other individual or family-based measures of exposures to increased incidence or prevalence of wheeze, asthma, and allergic rhinitis (Brugge *et al*., 2003; Finkelstein *et al*., 2002). Fewer studies have focused on the larger area, or neighborhood, in which the individual resides as a source of environmental exposures that may influence the risk of allergic diseases (Litonjua *et al*., 2005).

The immune development of an individual depends upon a complex interaction of factors related to genetics and environmental exposures that may derive from the larger neighborhood as well as the individual home. These exposures may have differing effects according to the age within which they occur and it is likely that an individual’s immune development is influenced by their entire exposure history. Owing to this complexity, it is of substantial interest to detect spatial regions that have significantly higher odds of disease dependent on the age of the child. High-risk areas may indicate potential hazardous environmental sources (*e.g*. bus depots, poor housing, neighborhood waste sites, neighborhood violence).

To make conclusions about these questions of interest in the Home Allergens and Asthma Study, and other similar studies, there is a need to develop spatial cluster detection methods that handle longitudinal outcomes. Currently, numerous spatial cluster detection methods are available for the analysis of individual level data. For example, there are methods for binary outcomes assessing areas with elevated prevalence of disease and count outcomes evaluating excess rates of incidence or mortality (Kulldorff *et al*., 2006; Tango and Takahashi, 2005; Duczmal and Assunção, 2004; Patil and Taillie, 2004; Tango, 2000; Kulldorff, 1997; Turnball et al., 1990). There are even several methods for censored continuous outcomes exploring potential spatial clusters for detection of time to early event (Cook, Gold, and Li, 2007; Huang, Kulldorff, and Gregorio, 2007). However, there are no methods available for longitudinal outcomes.

The Home Allergens and Asthma Study has information about occurrence of wheeze in the last six months measured every six months from birth to age four. Previous analyses evaluated potential spatial clusters with three failure time outcomes: time to doctor diagnosed asthma or censoring, time to allergic rhinitis/hayfever or censoring, and time to eczema or censoring (Cook *et al*., 2007). For the two outcomes, asthma and allergic rhinitis/hayfever, a significant cluster was found, but in very different neighborhoods. Wheeze is a time-varying symptom. Factors influencing wheeze and its resolution or persistence vary with age. Thus, the influence of a single geographical residence may vary with age.

A further innovation of the proposed method is the ability to incorporate study participant relocation during the study. The Home Allergens and Asthma Study has still surveyed and conducted home visits of study participants who have moved, even outside of the predefined study area. It is crucial to include this information for analysis to reduce missingness in the analysis and potential bias. For the Home Allergens and Asthma Study we will analyze the data using the following three different spatial locations: (i) location at birth, (ii) location at age of repeated measure, and (iii) weighted cumulative location history at age of repeated measure.

The outline of this manuscript begins by presenting in Section 2 a new method for spatial cluster detection for repeated measured data. We then conduct a simulation study to assess type I error and power for numerous situations in Section 3. In Section 4 the results from the analysis of the Home Allergens and Asthma Study with outcome repeated wheeze is presented. We conclude with a general discussion in Section 5.

We propose a new method for spatial cluster detection of repeated measured outcomes using cumulative geographic residuals, which are correlated, and generalized estimating equations (GEE). Previous cluster detection methods using cumulative geographic residuals have been developed for failure time outcomes (Cook *et al*., 2007).

We exemplify the development of our test statistic in the framework of a binary repeated outcome, though the formulation may be easily generalized to any continuous/discrete data with proper link functions (e.g. Poisson data with a log link function). Suppose the outcome for individual *i*(*i* = 1,…,*n*), at occasion *k*(*k* = 1,…, *k _{i}*,

$$g({\mathrm{\mu}}_{\mathit{\text{ik}}})=\text{logit}({\mathrm{\mu}}_{\mathit{\text{ik}}})=\mathbf{\beta}{\mathbf{X}}_{\mathit{\text{ik}}},$$

(1)

and β is a 1 × *p* vector of regression parameters. The corresponding marginal variance, dependent on μ* _{ik}*, is Var(

Then define **R*** _{i}*(

$$\sum _{i=1}^{n}{\mathbf{D}}_{i}^{\mathrm{T}}{\mathbf{V}}_{i}^{-1}{\mathbf{\epsilon}}_{i}=\mathbf{0},$$

(2)

where **ε*** _{i}* =

Under mild regularity conditions and under the null hypothesis of no disease clusters, has been shown to be consistent and asymptotically normal with covariance matrix

$${\left({\displaystyle \sum _{i=1}^{n}{\widehat{\mathbf{D}}}_{i}^{\mathrm{T}}{\widehat{\mathbf{V}}}_{i}^{-1}{\widehat{\mathbf{D}}}_{i}}\right)}^{-1}{{\displaystyle \sum _{i=1}^{n}{\widehat{\mathbf{D}}}_{i}^{\mathrm{T}}{\widehat{\mathbf{V}}}_{i}^{-1}{\mathbf{e}}_{i}{\mathbf{e}}_{i}^{\mathrm{T}}{\widehat{\mathbf{V}}}_{i}^{-1}{\widehat{\mathbf{D}}}_{i}\left({\displaystyle \sum _{i=1}^{n}{\widehat{\mathbf{D}}}_{i}^{\mathrm{T}}{\widehat{\mathbf{V}}}_{i}^{-1}{\widehat{\mathbf{D}}}_{i}}\right)}}^{-1}$$

even if **R*** _{i}*(

These asymptotic results have formed the basis for checking whether the link function is correctly specified for a particular component of the covariate vector, such as, **X*** _{j}*, or several components of the covariate vector,

However, in our particular setting for spatial cluster detection we study patterns of residuals from a different perspective: instead of viewing the patterns dependent on covariates, we study whether such patterns vary by geographic locations. Presented patterns across regions may indicate excessive, or exiguous, numbers of cases within those areas. In the next section, we propose a use of cumulative residuals for cluster detection.

We begin by defining our cumulative geographic residuals, *W*_{loc}(*x*_{1}, *x*_{2}|*b*), as a stochastic process indexed by (*x*_{1}, *x*_{2}) for a fixed radius *b*, which takes the form,

$${W}_{\text{loc}}({x}_{1},{x}_{2}|b)=\frac{1}{\sqrt{n}}{\displaystyle \sum _{i=1}^{n}\mathbf{I}{({\mathbf{r}}_{i,}{\mathbf{s}}_{i}|b,{x}_{1},{x}_{2})}^{\mathrm{T}}{\mathbf{e}}_{i}}$$

(3)

where **I**(**r*** _{i}*,

$${\widehat{W}}_{\text{loc}}({x}_{1},{x}_{2}|b)=\frac{1}{\sqrt{n}}{\displaystyle \sum _{i=1}^{n}\left\{\mathbf{I}{({\mathbf{r}}_{i},{\mathbf{s}}_{i}|b,{x}_{1},{x}_{2})}^{\mathrm{T}}{\mathbf{e}}_{i}.+{\mathrm{\nu}}^{\mathrm{T}}{({x}_{1},{x}_{2},b|\widehat{\mathbf{\beta}})({\displaystyle \sum _{j=1}^{n}{\widehat{\mathbf{D}}}_{j}^{\mathrm{T}}{\widehat{\mathbf{V}}}_{j}^{-1}{\widehat{\mathbf{D}}}_{j}})}^{-1}{\widehat{\mathbf{D}}}_{i}^{\mathrm{T}}{\widehat{\mathbf{V}}}_{i}^{-1}{\mathbf{e}}_{i}\right\}}{G}_{i},$$

where

$$\mathrm{\nu}({x}_{1},{x}_{2},b|\widehat{\mathbf{\beta}})=-{\displaystyle \sum _{i=1}^{n}{\displaystyle \sum _{k=1}^{{K}_{i}}\left[\mathit{I}[{({r}_{\mathit{\text{ik}}}-{x}_{1})}^{2}+{({s}_{\mathit{\text{ik}}}-{x}_{2})}^{2}\le {b}^{2}]{\mathbf{X}}_{\mathit{\text{ik}}}\frac{\text{exp}(\mathbf{\beta}{\mathbf{X}}_{\mathit{\text{ik}}})}{{[1+\text{exp}(\mathbf{\beta}{\mathbf{X}}_{\mathit{\text{ik}}})]}^{2}}\right]},}$$

and (*G*_{1},…, *G _{n}*) are independent variables from a unspecified distribution with mean 0 and variance 1 that are independent of (

Therefore, to approximate the null distribution of *W*_{loc}(·,·|*b*), one can simulate *N* realized paths of *Ŵ*_{loc}(·,·|*b*), *e.g*. ( *Ŵ*_{1,loc}(·,·|*b*),…, *Ŵ* _{N,loc}(·,·|*b*)), by repeatedly simulating (*G*_{1},…, *G _{n}*), while fixing the data (

To test for global spatial clusters of half edge length, *b*, would be to compute the probability of how extreme *S*_{loc,b} is under the null distribution that the residuals are not dependent on location. Formally testing this hypothesis would require calculating,

$${\widehat{S}}_{j,\text{loc},b}=\underset{{x}_{1},{x}_{2}}{\text{sup}}{\widehat{W}}_{j,\text{loc}}({x}_{1},{x}_{2}|b),$$

for *j* = 1,…, *N* simulated *Ŵ* _{loc}(*x*_{1}, *x*_{2}|*b*) and estimating the probability, under the null hypothesis that a simulated *Ŝ*_{loc,b} is equal to or more extreme than the observed *S*_{loc,b}, by the *p*-value $P={\displaystyle {\sum}_{j=1}^{N}\mathit{I}[{S}_{\text{loc},b}\le {\widehat{S}}_{j,\text{loc},b}]/N.}$

However, for spatial cluster detection it is particularly important to be able to range the values of *b* to allow the data to depict maximum cluster size. To extend the method to incorporate a finite range of edge lengths, define **b** = (*b*_{1},…,*b _{L}*) as a finite vector of potential

$${\widehat{S}}_{\text{loc}}=\text{sup}({\widehat{S}}_{\text{loc},{b}_{1}},\dots ,{\widehat{S}}_{\text{loc},{b}_{L}})$$

converges weakly to the same limiting distribution as

$${S}_{\text{loc}}=\text{sup}({S}_{\text{loc},{b}_{1}},\dots ,{S}_{\text{loc},{b}_{L}}).$$

To test for global clusters using a finite range of half edge length, **b**, would be to compute the *p*-value $P={\displaystyle {\sum}_{j=1}^{N}\mathit{I}[{S}_{\text{loc}}\le {\widehat{S}}_{j,\text{loc}}]/N.}$

This hypothesis test can still easily be inverted to form confidence bands around the stochastic process *W*_{loc}(*x*_{1}, *x*_{2}|*b*) to define values of (*x*_{1}, *x*_{2}) and *b* which have significantly higher cumulative residuals than expected under the null hypothesis of geographic independence. Explicitly, we can form the confidence band {(*x*_{1}, *x*_{2}, *b*) : *W*_{loc}(*x*_{1}, *x*_{2}|*b*) ≥ *Ŝ*_{(0:95N)}} where *Ŝ*_{(0:95N)} is the 95-th percentile of all *Ŝ*_{j,loc}.

By using cumulative geographic residuals, one would be able to locate significant clusters with corresponding edge length, 2*b*. Another advantage is the fact that location is not treated as fixed for an individual, but can change at each repeated time point. Therefore this method incorporates moving, which previous spatial cluster detection methods do not. So far the incorporation of moving has not taken into account moving history, but only current location of the individual. The handling of moving can be made even more flexible by incorporating a weighting structure on an individual location to handle moving history. Specifically, define the test statistic as

$${W}_{\text{loc},h}({x}_{1},{x}_{2}|b)=\frac{1}{\sqrt{n}}{\displaystyle \sum _{i=1}^{n}\mathbf{H}({\mathbf{r}}_{i}^{h},{\mathbf{s}}_{i}^{h},{\mathbf{w}}_{i}^{h}}|b,{x}_{1},{x}_{2}{)}^{\mathrm{T}}{\mathbf{e}}_{i},$$

(4)

where $\mathbf{H}({\mathbf{r}}_{i}^{h},{\mathbf{s}}_{i}^{h},{\mathbf{w}}_{i}^{h}|b,{x}_{1},{x}_{2})$ is a *K _{i}* × 1 vector with each row corresponding to a weight defined as ${\mathbf{H}}_{\mathit{\text{ik}}}({\mathbf{r}}_{i}^{h},{\mathbf{s}}_{i}^{h},{\mathbf{w}}_{i}^{h}|b,{x}_{1},{x}_{2})={\displaystyle {\sum}_{m=1}^{{M}_{i}}\mathit{I}({x}_{1}-b\le {r}_{\mathit{\text{im}}}^{h}<{x}_{1}+b,{x}_{2}-b\le {s}_{\mathit{\text{im}}}^{h}<{x}_{2}+b){w}_{\mathit{\text{ikm}}}\in [0,1],({\mathbf{r}}_{i}^{h},{\mathbf{s}}_{i}^{h})}$ is a vector of all address locations,

The proposed spatial cluster detection test using cumulative residuals will be able to find exact locations, and size, of significant clusters and simultaneously give a *p*-value for the global hypothesis test of existence of geographic clusters. It can flexibly handle moving locations by applying different weighting structures by not treating location as fixed and does not require model specification of the spatial surface. Section 3 will study the properties of this approach to check the type I error and power. This approach will be applied to the Home Allergens and Asthma study in Section 4, looking at the outcome repeated wheeze.

However, the above spatial cluster detection method cannot pinpoint the times at which the significant clusters occurred. Also, clusters may occur at different locations at different time points. The previous method would only be valid if one assumes that increased risk of disease from a location is constant over age/time. This scenario is not true for most public health outcomes and, in particular, for the outcome wheeze were protective/hazardous predictors in early age can become hazardous/protective in later ages. Therefore, in the next section we will present an alternative method that can detect the time and location of the clusters.

The previous section presented a global test statistic utilizing all of the repeated measured data to detect significant geographic clusters that occur throughout a study. However, often a cluster of outcomes may occur only during a certain time point of a study. For example, in the Home Allergens and Asthma Study one may hypothesize that important early in life geographic exposures are different then later in life exposures and therefore locations of significant clusters may change. To handle this important issue we present the following test statistics for each repeated time point *t* = *t*_{1}, *t*_{2},…, *t _{K}*,

$${W}_{\text{loc},h,t}({x}_{1},{x}_{2}|b)=\frac{1}{\sqrt{n}}{\displaystyle \sum _{i=1}^{n}\mathbf{H}{({\mathbf{r}}_{i}^{h},{\mathbf{s}}_{i}^{h},{\mathbf{w}}_{i}^{h},{\mathbf{t}}_{i}|b,{x}_{1},{x}_{2},t)}^{\mathrm{T}}{\mathbf{e}}_{i},}$$

where $\mathbf{H}({\mathbf{r}}_{i}^{h},{\mathbf{s}}_{i}^{h},{\mathbf{w}}_{i}^{h},{\mathbf{t}}_{i}|b,{x}_{1},{x}_{2},t)$ is a *K _{i}* × 1 vector with each row corresponding to a weight defined as ${H}_{\mathit{\text{ik}}}^{t}({\mathbf{r}}_{i}^{h},{\mathbf{s}}_{i}^{h},{\mathbf{w}}_{i}^{h},{\mathbf{t}}_{i}|b,{x}_{1},{x}_{2},t)=[{\displaystyle {\sum}_{m=1}^{{M}_{i}}\mathit{I}({x}_{1}-b\le {r}_{\mathit{\text{im}}}^{h}<{x}_{1}+b,{x}_{2}-b\le {s}_{\mathit{\text{im}}}^{h}<{x}_{2}+b){w}_{\mathit{\text{ikm}}}]\mathit{I}({t}_{\mathit{\text{ik}}}=t)\in [0,1]}$. Therefore,

$${\widehat{W}}_{\text{loc},h,t}({x}_{1},{x}_{2}|b)=\frac{1}{\sqrt{n}}{\displaystyle \sum _{i=1}^{n}\left\{\mathbf{H}{({\mathbf{r}}_{i}^{h},{\mathbf{s}}_{i}^{h},{\mathbf{w}}_{i}^{h},{\mathbf{t}}_{i}|b,{x}_{1},{x}_{2},t)}^{\mathrm{T}}{\mathbf{e}}_{i}.+{\mathrm{\nu}}^{\mathrm{T}}({x}_{1},{x}_{2},b|\widehat{\mathbf{\beta}}){\left({\displaystyle \sum _{j=1}^{n}{\widehat{\mathbf{D}}}_{j}^{\mathrm{T}}{\widehat{\mathbf{V}}}_{i}^{-1}{\widehat{\mathbf{D}}}_{j}}\right)}^{-1}{\widehat{\mathbf{D}}}_{i}^{\mathrm{T}}{\widehat{\mathbf{V}}}_{i}^{-1}{\mathbf{e}}_{i}\right\}}{G}_{i,}$$

where ν(*x*_{1}, *x*_{2}, *b*|**β**), **D**, and **V** are defined as in Section 2.2, but *I* [*x*_{1} − *b* ≤ *r _{ik}*<

To make conclusions for each time point, *t*, one would approximate the null distribution of *W*_{loc,h,t}(·,·|*b _{l}*) (

We conducted simulations calculating the type I error and power for the global cumulative geographic residual test. First, we will analyze the results for assessing type I error. Simulations were conducted by generating 1000 test studies where location of an individual was randomly assigned uniformly over an 8 × 8 grid. For our simulations we chose to treat locations of individuals as fixed over time and a finite range for *b* of 0.5 to 4 sequenced by 0.5, just to reduce computational complexity. We simulated a repeated measured data set with exchangeable correlation structure and overall probability of having the outcome to be approximately 0.2. The details of this simulation are presented in Appendix B.

By choosing this simulation setup the outcomes for the same individual are correlated and there is an effect of time. When running the simulation we assumed a profile analysis for the mean structure on time and an exchangeable correlation structure. The results for the type I error calculations are given in Table 1. We defined Type I error as the proportion of simulations that detect a significant (α = 0:05) cluster. The type I error converges to the α-level of 0.05 when the number of individuals and repeated measures increase. However, it is very low when there are only 100 individuals in the study.

Type I error and power calculations of Cumulative Geographic Residual Test for different sample sizes and number of repeated measured.

For the power calculations we simulated the repeated measured study population as described for the type I error. To create a single cluster we first considered an 8 × 8 unit-less area and divided the area into 16 equally sized squares of size 2 × 2 as depicted in Fig. 1. To create the cluster in consecutive grid areas 6 and 10, we gave a higher probability for individuals with more cases to be within the cluster area. First, define ${S}_{\mathit{\text{Yi}}}={\displaystyle {\sum}_{k=1}^{K}{Y}_{\mathit{\text{ik}}}}$ *Y _{ik}* where

Overall, the proposed cumulative geographic residual test statistic for repeated measures holds the type I error rate and has relatively high power of finding clusters. The power, sensitivity, and specificity increase as expected given more individuals and repeated measures in the study. The sensitivity is relatively low indicating that the spatial cluster, or clusters, detected tend to be larger than the actual cluster. The specificity is very high indicating whether the detected cluster does not include a given area, that area with high probability is not actually a cluster.

The next set of simulations evaluated the effect of covariate adjustment on cluster detection. In particular we are evaluating the assumption that the proposed spatial cluster detection method is valid even when there is dependence between the covariate and spatial location. We assumed 4 repeated measures and 300 observations for this simulation study. We first simulated the outcome data following the procedure described for type I error. For simplicity, we will assume only one covariate and that the covariate stays constant over time (*X _{ik}* =

Table 2 displays the results from the simulation study evaluating the influence of covariate adjustment. When there was no actual spatial cluster (*c* = 0) that existed independent of the relationships between the covariate, *X _{i}*, and outcome,

Evaluation of the effect of covariate adjustment on the properties of the cumulative geographic residual test for different degrees of dependence between outcome and covariate and between covariate and location.

Overall in this simulation section we have shown that the proposed cumulative geographic residual method holds the appropriate type I error with and without adjusting for covariates and power follows expected patterns as it increases with increased sample size, increased number of repeated measures, and increased dependence between outcome and spatial location. In the next section, the proposed cumulative geographic residual method will be applied to the repeated measured outcome wheeze.

We now apply the proposed method to the Home Allergens and Asthma prospective cohort study with the longitudinal outcome wheeze in the last 6 months. The study was designed to investigate potential environmental exposures and their relationship to childhood asthma and other respiratory outcomes. A total of 499 study participants were enrolled in the study after being born at Brigham and Women’s hospital in Boston, MA USA between September 1994 to June 1996. Details of the study design have been previously published by Celedon *et al*. (1999). Of those 499 study participants, only 478 were used for this analysis due to the inability to geocode the missing participants’ birth addresses. The investigators for this analysis were interested in areas of significant disease clusters for a range of outcomes. For this analysis we will study the clusters of the outcome repeated wheeze in the first four years of life. Therefore, the repeated measures will be observed at ages 6, 12, 18, 24, 30, 36, 42, and 48 months.

The study area is a diverse population with a range of SES levels. Figure 2 displays the median family income level in the study population. Previous analysis on the mothers of the infants screened for the study had found elevated IgE levels, an indicator of allergic response, in southern Boston, Chelsea, and Revere areas (Litonjua *et al*., 2007). These areas also correspond to lower median family areas indicating a relationship between disparity and allergic reaction.

Indicated areas of low, medium, and high median family income by US census tract in the study population area.

A spatial cluster detection analysis on the children up to age four in this study using censored outcomes asthma, allergic rhinitis/hayfever, and eczema found a significant cluster of the censored outcome asthma in southern Boston, Chelsea, Revere, and their neighboring towns, but for the censored outcome allergic rhinitis/hayfever the significant cluster resided in the western, more affluent, towns (Cook *et al*., 2007). It is of interest to display significant disease clusters for the outcome wheeze since it may be less vulnerable to underdiagnosis in lower SES areas compared with the previous outcomes, particularly hayfever (Strunk, Ford, and Taggart, 2000). We hypothesize that a cluster will exist in the southern less affluent Boston area early in life, similar to the asthma cluster found in Cook *et al*. (2007), since the area has higher IgE levels (Litonjua *et al*., 2005) and lower median family income (Fig. 2) and location of the cluster will change over time. One reason for the change in location over time could be due to the differential drop-out within lower SES and minority areas and therefore over time the cluster may move to more affluent areas. To infer whether the cluster location movement over time is due to exposure change, or loss to follow-up, we ran the analysis using all of the data (full data) and then checked for comparable results using only the observations of study participants with complete follow-up up to age four (complete follow-up). Note that we will present results only for the full data set since the complete data set results did not change results substantially, except *p*-values were higher since we have fewer subjects in the complete data set (Table 3).

Estimated probability wheeze per time period for all study participants and subset with complete follow-up.

First, we ran a GEE model without considering spatial clusters to assess change in percent wheeze by age. Owing to the exploratory nature of all analyses in this manuscript, we did not adjust for other predictors except age. We used a profile mean model on age and an unstructured correlation structure with robust standard errors from the sandwich estimator. Table 3 summarizes the results for two analyses that ran the GEE model for the full data set and the complete follow-up subset. Note that estimates, and corresponding 95% confidence intervals, do not change significantly depending on the full data set versus complete data set indicating missingness may be missing completely at random as assumed by the GEE. Overall, there is a definite change in probability of wheeze over time indicated by a significant drop in wheeze rates after age 30 months.

It is of interest to assess if the environmental exposure is influenced by earlier months, *i.e*. the birth location, current location, or complete location history is the important predictor. To answer this question we ran three analyses (i) keeping location constant as birth location, (ii) location as the current location at evaluation, and (iii) a weighted history of location with weights determined as length of time resided in a particular location.

Results for the three analyses are reported in Table 4 and Fig. 3, Fig. 4, and Fig. 5. The only significant (α-level = 0.05) spatial cluster detected was at 6 months in the urban coastal Boston area. This is in a similar area in which the censored outcome time to asthma and time to eczema found significant clusters. The spatial clusters moved over time, starting in the urban coastal Boston area, and slowly moving toward the southern, more suburban, study area. The movement of clusters is not statistically significant since there are no significant clusters found for any time points after 6 months. However, this could be due to power issues since the prevalence of wheeze decreases over time.

Summary of results from the cumulative geographic residual test by time point and location definition indicating probability of wheeze within and outside cluster at each time point.

It is interesting to note that the performance of the cumulative spatial cluster detection is dependent on the size of cluster detected and strength of cluster detected. The 6 months cluster comprises 31%(171/478) of the study population and was in an area with 32% prevalence of wheeze in the first 6 months compared with 17% prevalence in the rest of the study population (Table 3). At 12 months, the cluster detected, assuming birth address, comprises of only 10% of the study population, but had a much larger difference in prevalence of wheeze of 57% compared with 25% even though the *p*-value was 0.13. There is a trade off between prevalence, effect size, and size of cluster that needs to be further explored to assess the performance of this method, but this case example indicates that the cumulative geographic residual method has more power to detect clusters of larger size.

In this manuscript we have proposed a new spatial cluster detection method for repeated measured outcomes utilizing cumulative geographic residuals. Applying the new method, we detected a significant cluster of wheeze in urban Boston for age 6 months. Further research is being conducted to look into which exposures in urban Boston may be influencing this disease cluster, such as air pollution.

We also performed type I error and power calculations for the cumulative geographic residual method. Type I error was held at the appropriate α-level under the null of no spatial clusters. By increasing the number of individuals, and repeated measures, power was shown to increase substantially. Therefore, the method is performing as expected and is valid for spatial cluster detection of repeated measured outcomes. Future work exploring the properties of the cumulative residual method, particularly how power is effected by cluster size, effect size, and overall incidence rates would be of interest to assess the performance of the method.

The importance of using the time-dependent cumulative geographic residual method was presented as being able to pinpoint the location and time of significant clusters. This method can be used to explore hypotheses and assess changes of outcomes and exposures over time which virtually no previous spatial cluster detection methods have directly been able to assess.

Diane R. Gold was supported by NIH RO1 AI/EHS 35786 and Yi Li was supported by NIH RO1 CA95747.

Throughout this proof we assume that *g*(·) (1) is the correct link function between **Y*** _{i}* and

Consider the following one-term Taylor series expansion of *W*_{loc}(*x*_{1}, *x*_{2}|*b*) at β:

$$\begin{array}{cc}{W}_{\text{loc}}({x}_{1},{x}_{2}|b)\hfill & =\frac{1}{\sqrt{n}}{\displaystyle \sum _{i=1}^{n}\mathbf{I}{({\mathbf{r}}_{i},{\mathbf{s}}_{i}|b,{x}_{1},{x}_{2})}^{\mathrm{T}}{\mathbf{e}}_{i}}\hfill \\ \hfill & =\frac{1}{\sqrt{n}}{\displaystyle \sum _{i=1}^{n}\mathbf{I}{({\mathbf{r}}_{i},{\mathbf{s}}_{i}|b,{x}_{1},{x}_{2})}^{\mathrm{T}}{\mathbf{\epsilon}}_{i}}\hfill \\ \hfill & \text{\hspace{1em}}+\frac{1}{n}{\displaystyle \sum _{i=1}^{n}\mathbf{I}{({\mathbf{r}}_{i,}{\mathbf{s}}_{i}|b,{x}_{1},{x}_{2})}^{\mathrm{T}}{\mathbf{D}}_{i}\sqrt{n}(\widehat{\mathbf{\beta}}-\mathbf{\beta})+{o}_{p}(1).}\hfill \end{array}$$

(5)

where ε* _{i}* =

It was shown in Section 2.1 that given the the conditional mean of *Y _{i}*,

$${\left({\displaystyle \sum _{i=1}^{n}{\mathbf{D}}_{i}^{\mathrm{T}}{\mathbf{V}}_{i}^{-1}{\mathbf{D}}_{i}}\right)}^{-1}{{\displaystyle \sum _{i=1}^{n}{\mathbf{D}}_{i}^{\mathrm{T}}{\mathbf{V}}_{i}^{-1}{\mathbf{\epsilon}}_{i}{\mathbf{\epsilon}}_{i}^{\mathrm{T}}{\mathbf{V}}_{i}^{-1}{\mathbf{D}}_{i}\left({\displaystyle \sum _{i=1}^{n}{\mathbf{D}}_{i}^{\mathbf{T}}{\mathbf{V}}_{i}^{-1}{\mathbf{D}}_{i}}\right)}}^{-1}.$$

This implies that *W*_{loc}(*x*_{1}, *x*_{2}|*b*) is asymptotically equivalent to

$$\begin{array}{cc}{\tilde{W}}_{\text{loc}}({x}_{1},{x}_{2}|b)\hfill & =\frac{1}{\sqrt{n}}{\displaystyle \sum _{i=1}^{n}\mathbf{I}{({\mathbf{r}}_{i,}{\mathbf{s}}_{i}|b,{x}_{1},{x}_{2})}^{\mathrm{T}}{\mathbf{\epsilon}}_{i}}\hfill \\ \hfill & \text{\hspace{1em}}+\frac{1}{n}{\mathrm{\nu}}^{\mathrm{T}}({x}_{1},{x}_{2},b|\mathbf{\beta}){\left(\frac{1}{n}{\displaystyle \sum _{j=1}^{n}{\mathbf{D}}_{j}^{\mathrm{T}}{\mathbf{V}}_{j}^{-1}{\mathbf{D}}_{j}}\right)}^{-1}\hfill \\ \hfill & \text{\hspace{1em}}\times \frac{1}{\sqrt{n}}{\displaystyle \sum _{i=1}^{n}{\mathbf{D}}_{i}^{\mathrm{T}}{\mathbf{V}}_{i}^{-1}{\mathbf{\epsilon}}_{i}}\hfill \\ \hfill & \text{\hspace{1em}}=\frac{1}{\sqrt{n}}{\displaystyle \sum _{i=1}^{n}\left[\mathbf{I}{({\mathbf{r}}_{i},{\mathbf{s}}_{i}|b,{x}_{1},{x}_{2})}^{\mathrm{T}}+\frac{1}{n}{\mathrm{\nu}}^{\mathrm{T}}({x}_{1},{x}_{2},b|\mathbf{\beta}){(\mathbf{B})}^{-1}{\mathbf{D}}_{i}^{\mathrm{T}}{\mathbf{V}}_{i}^{-1}\right]}{\mathbf{\epsilon}}_{i}.\hfill \end{array}$$

Where $\mathbf{B}={n}^{-1}{\displaystyle {\sum}_{j=1}^{\mathrm{n}}{\mathbf{D}}_{j}^{\mathrm{T}}{\mathbf{V}}_{j}^{-1}{\mathbf{D}}_{j}}$

We will first establish the tightness of _{loc}(*x*_{1}, *x*_{2}|*b*), which implies the tightness of *W*_{loc}(*x*_{1}, *x*_{2}*b*|**β**). By the law of large numbers **B** converges to a constant matrix as *n* → ∞. By the uniform law of large numbers *n*^{−1}ν (*x*_{1}, *x*_{2}, *b*|**β**) converges as *n* → ∞ uniformly in *x*_{1}, *x*_{2}, and *b*, to a nonrandom function. Therefore,

$$\frac{1}{\sqrt{n}}\left(\frac{1}{n}{\mathrm{\nu}}^{\mathrm{T}}({x}_{1},{x}_{2},b|\mathbf{\beta})\right){(\mathbf{B})}^{-1}{\displaystyle \sum _{i=1}^{n}{\mathbf{D}}_{i}^{\mathrm{T}}{\mathbf{V}}_{i}^{-1}{\mathbf{\epsilon}}_{i}}$$

converges as *n* → ∞ to Gaussian process and therefore is tight. Since $\sum}_{i=1}^{\mathrm{n}}\mathbf{I}{({\mathbf{r}}_{i},{\mathbf{s}}_{i}|b,{x}_{1},{x}_{2})}^{\mathrm{T}}{\mathbf{\epsilon}}_{i$ is the sum of monotonic step functions and therefore manageable (Pollard, 1998) and by the functional central limit theorem it is tight. Hence, the entire process _{loc}(*x*_{1}, *x*_{2}|*b*) is tight yielding *W*_{loc}(*x*_{1}, *x*_{2}|*b*) is tight.

For fixed (*x*_{1}, *x*_{2}), _{loc}(*x*_{1}, *x*_{2}|*b*) is a sum of *n* independent and identically distributed zero-mean random vectors since *E*(**ε*** _{i}*) = 0. By the multivariate central limit theorem, the finite-dimensional distributions of

$$\begin{array}{c}\underset{n\to \infty}{\text{lim}}\frac{1}{n}{\displaystyle \sum _{i=1}^{n}\{[\mathbf{I}{({\mathbf{r}}_{i,}{\mathbf{s}}_{i}|b,{x}_{1a},{x}_{2a})}^{\mathrm{T}}+\frac{1}{n}{\mathrm{\nu}}^{\mathrm{T}}({x}_{1a},{x}_{2a},b|\mathbf{\beta}){(\mathbf{B})}^{-1}{\mathbf{D}}_{i}^{\mathrm{T}}{\mathbf{V}}_{i}^{-1}]{\mathbf{\epsilon}}_{i}\}}\hfill \\ \text{\hspace{1em}}\times \{[\mathbf{I}{({\mathbf{r}}_{i},{\mathbf{s}}_{i}|b,{x}_{1b},{x}_{2b})}^{\mathrm{T}}+\frac{1}{n}{\mathrm{\nu}}^{\mathrm{T}}({x}_{1b},{x}_{2b},b|\mathbf{\beta}){(\mathbf{B})}^{-1}{\mathbf{D}}_{i}^{\mathrm{T}}{\mathbf{V}}_{i}^{-1}]{\mathbf{\epsilon}}_{i}\}\hfill \end{array}$$

Next we will establish the weak convergence distribution of *Ŵ*_{loc}(*x*_{1}, *x*_{2}|*b*). Conditional on the data (*Y _{ik}*,

The conditional covariance function of *Ŵ* _{loc}(*x*_{1}, *x*_{2}|*b*) at ((*x*_{1a}), (*x*_{2a}), (*x*_{1b}, *x*_{2b})) is,

$$\begin{array}{c}\frac{1}{n}{\displaystyle \sum _{i=1}^{n}\{\mathbf{I}{({\mathbf{r}}_{i},{\mathbf{s}}_{i}|b,{x}_{1a},{x}_{2a})}^{\mathrm{T}}{\mathbf{e}}_{i}+{\mathrm{\nu}}^{\mathrm{T}}({x}_{1a},{x}_{2a},b|\widehat{\mathbf{\beta}}){(\widehat{\mathbf{B}})}^{-1}{\widehat{\mathbf{D}}}_{i}^{\mathrm{T}}{\widehat{\mathbf{V}}}_{i}^{-1}{\mathbf{e}}_{i}\}}\hfill \\ \text{\hspace{1em}}\times \{\mathbf{I}{({\mathbf{r}}_{i},{\mathbf{s}}_{i}|b,{x}_{1b},{x}_{2b})}^{\mathrm{T}}{\mathbf{e}}_{i}+{\mathrm{\nu}}^{\mathrm{T}}({x}_{1b},{x}_{2b},b|\widehat{\mathbf{\beta}}){(\widehat{\mathbf{B}})}^{-1}{\widehat{\mathbf{D}}}_{i}^{\mathrm{T}}{\widehat{\mathbf{V}}}_{i}^{-1}{\mathbf{e}}_{i}\}\hfill \end{array}$$

which converges as *n*→ ∞ to the same deterministic limit as the covariance function of *W*_{loc}(*x*_{1}, *x*_{2}|*b*). Therefore, *W*_{loc}(*x*_{1}, *x*_{2}|*b*) and *Ŵ*_{loc}(*x*_{1}, *x*_{2}|*b*) converge to the same limiting zero-mean Gaussian process as *n*→ ∞.

To simulate the repeated measured data under an exchangeable correlation structure we conducted the following simulation design for the different number of repeated observations holding the overall probability of being a case to be approximately 0.2.

Generate *n* multivariate normal outcome, **Z _{i}** ~

Generate *n* multivariate normal outcome, **Z*** _{i}* =

Generate *n* multivariate normal outcome, **Z***i* = *N*_{5}((−0.1, −0.05, 0, 0.05, 0.1)^{T},**V**), where **V** is a 5 × 5 matrix with diagonal elements 1 and off diagonal elements ρ = 0.2. Then define binary repeated measures outcome, *Y _{ij}* =

**Supplementary Material**

Software for the method can be found at http://faculty.washington.edu/acook developed in the R statistical package (R Development Core Team, 2009).

**Conflict of Interests Statement**

The authors have declared no conflict of interest.

- Centers for Disease Control and Prevention National Center for Health Statistics. Akinbami L. The state of childhood asthma 1980–2005. Advance Data. 2006;381:1–24. [PubMed]
- Brugge D, Vallarion J, Ascolillo L, Osgood ND, Steinbach S, Spengler J. Comparison of multiple environmental factors for asthmatic children in public housing. Indoor Air. 2003;13:18–27. [PubMed]
- Celedon J, Litonjua A, Weiss S, Gold D. Day care attendence in the first year of life and illnesses of the upper and lower respiratory tract in children with a familial history of atopy. Pediatrics. 1999;104:495–500. [PubMed]
- Cook AJ, Gold DR, Li Y. Spatial cluster detection for censored outcome data. Biometrics. 2007;63:540–549. [PubMed]
- Duczmal L, Assunção R. A simulated annealing strategy for the detection of arbitrarily shaped spatial clusters. Computational Statistics and Data Analysis. 2004;45:269–286.
- Finkelstein J, Fuhlbrigge A, Lozano P, Grant E, Shulruff R, Arduino K, Weiss K. Parent-reported environmental exposures and environmental control measures for children with asthma. Archives of Pediatrics and Adolescent Medicine. 2002;156:258–264. [PubMed]
- Huang L, Kulldorff M, Gregorio D. A spatial scan statistic for survival data. Biometrics. 2007;63:109–118. [PubMed]
- Kulldorff M. A spatial scan statistic. Communications in Statistics. 1997;26:1481–1496.
- Kulldorff M, Huang L, Pickle L, Duczmal L. An elliptic spatial scan statistic. Statistics in Medicine. 2006;25:3929–3943. [PubMed]
- Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73:13–22.
- Lin DY, Wei LJ, Ying Z. Model-checking techniques based on cumulative residuals. Biometrics. 2002;58:1–12. [PubMed]
- Litonjua AA, Celedón JC, Hausmann BA, Nikolov M, Sredl D, Ryan L, Platts-Mills TAE, Weiss ST, Gold DR. Variation in total and specific IgE: effects of ethnicity and socioeconomic status. The Journal of Allergy and Clinical Immunology. 2005;115:751–757. [PubMed]
- Patil GP, Taillie C. Upper level set scan statistic for detecting arbitrarily shaped hotspots. Environmental and Ecological Statistics. 2004;11:183–197.
- Pollard D. Empirical Processes: Theory and Applications; NSF-CBMS Regional Conference Series in Probability and Statistics 2. Institute of Mathematical Sciences; Hayward, CA. 1998.
- R Development Core Team. R Foundation for Statistical Computing. Austria: Vienna; 2009. R: A Language and Environment for Statistical Computing. ISBN 3-900051-07-0 http://www.R-project.org.
- Strunk RC, Ford JG, Taggart V. Reducing disparities in asthma care: priorities for research - National Heart, Lung, and Blood Institute workshop report. The Journal of Allergy and Clinical Immunology. 2002;109:229–237. [PubMed]
- Stute W. Nonparametric model checks for regression. The Annals of Statistics. 1997;25:613–641.
- Su JQ, Wei LJ. A lack-of-fit test for the mean function in a generalized linear model. Journal of the American Statistical Association. 1991;86:420–426.
- Tango T. A test for spatial disease clustering adjusted for multiple testing. Statistics in Medicine. 2000;19:191–204. [PubMed]
- Tango T, Takahashi K. A flexibly shaped spatial scan statistic for detecting clusters. International Journal of Health Geographics. 2005;4:11. [PMC free article] [PubMed]
- Turnball GW, Iwano EJ, Burnett WS, Howe HL, Clark LC. Monitoring for clusters of disease: Application to leukemia incidence in upstate New York. American Journal of Epidemiology. 1990;132:136–143. [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |