3.1 Health Outcome Model

A common approach (

Metzger et al., 2004;

Peel et al., 2005) to modeling daily event counts, such as ED visits, involves fitting a generalized linear model with a log-link and Poisson error (

McCullagh and Nelder, 1989) such as the following:

*R*_{ht} ~

*P*, where

Here

*R*_{ht} and

*n*_{ht} are the number of visits for a specific illness and total number of patient ED visits from subregion

*h* based on residential zip code on day

*t*, respectively.

*X*_{ht} is the true, unobserved ambient-level concentration of a specific pollutant at the centroid of subregion

*h* on day

*t*. While we use the notation

*X*_{ht} for simplicity, often the exposure of interest is the previous day’s concentration or possibly a moving-average of current and prior concentrations.

(

*t*;Ψ) can include covariates such as day-of-the-week and smooth functions of calendar time and weather to model long-term temporal trends, seasonality, and other relevant events such as influenza epidemics, and to account for any additional temporal correlation in the count time series (

Dominici et al., 2000;

Metzger et al., 2004;

Peel et al., 2005;

Peng et al., 2006). In

equation 1, β

_{0} +

(

*t*;Ψ) represents the log(baseline rate) for the outcome in the absence of exposure. β

_{1}, the parameter of interest, is the log relative risk associated with a unit increase in the daily ambient concentration of the pollutant.

Given the usual assumptions, the above model would be be reasonably straightforward except for the fact that

*X*_{ht} is unknown for all

*h* and

*t*. Ambient air quality measurements are not available for each subregion, but rather there are a limited number of monitors located throughout the region, each providing a daily measure of ambient concentration (

*Z*_{mt}) at that specific location. One might think of this as a classical missing data problem (

Little and Rubin, 2002), albeit an extreme example, however information provided by the correlation structure of the observed measurements can inform us about the unobserved observations. The problem can also be placed in a measurement error framework (

Carroll et al., 1995), by considering the set of daily monitor measurements

**Z**_{t} as a source of potential surrogates for the unknown exposure measurement

*X*_{ht}. We use the term surrogate as defined by

Carroll et al. (1995). That is, we assume

*f*(

**R**|

**Z**,

**X**,Ψ) =

*f*(

**R**|

**X**,Ψ), i.e., non-differential error whereby

**Z** offers no information about

**R** once

**X** is known.

3.2 Surrogate exposures for *X*_{ht}

In this section we investigate a conditional expectation approach designed to adjust for measurement error in ambient exposures due to spatial variability. Assuming known mean and variance, this method is equivalent to the use of the simple kriging predictor. See

Schabenberger and Gotway (2005) for details on kriging.

Wong et al. (2004) compared four different weighted average interpolation methods: spatial averaging, nearest neighbor, inverse distance weighting, and ordinary kriging, to assess each method’s influence on the estimated exposure measurement. Other commonly used surrogates are to utilize a single, centrally located monitor or the arithmetic average of all or a subset of exposure measurements. We take the comparison of surrogates a step further by assessing the impact of each surrogate measure on the estimation of the association between exposure and a health outcome of interest.

We propose the following method where we model the temporal and spatial variability exhibited by the observed ambient concentrations through the mean and covariance structures, respectively. Utilizing estimated parameters, one can then *interpolate* concentration levels at each location. This method is equivalent to kriging. Modeling the spatial variability is based on describing a region’s air pollution measurements as a spatial random field in which the spatial dependence of measurements at different locations can be expressed through the variance-covariance matrix.

More specifically, let

*t* = 1,…,

*T* denote the day of observation,

*m* = 1,…,

*M* the monitor,

*h* = 1,…,

*H* the subregion, and

*L* =

*M* +

*H*. We can then express

**Z**, the vector of observed concentrations, as:

and

**X**, the vector of unobserved ambient exposure levels, as:

We then assume the following model:

where

and e ~

(

**0**,∑).

ζ(**t**;Θ) is a function that defines the regional average exposure for each day. As in model 1, long-term temporal trends are accounted for via smooth functions of calendar time with pre-specified knots. Variables such as day-of-the-week, precipitation, wind direction and magnitude can be included to further account for weather, traffic patterns and any additional temporal correlation in the time series. Let Θ be the vector of parameters corresponding to the time-specific covariates included in the model.

Conditional on the fixed covariates indexed by Θ, we assume that each day’s set of pollutant measurements is an independent realization of the underlying spatial process of the pollutant measurements across a defined region. We seek to explore the impact of various surrogates for the unknown subregion ambient exposures. For this purpose take advantage of parametric isotropic models (

Waller and Gotway, 2004) to estimate the spatial covariance parameters. Conditional on the covariates, we assume the spatial correlation between exposure measurements is the same for all pairs of equally distant locations and is not dependent on direction,

*i.e*., we assume stationarity and isotropy, respectively.

Under the defined structure, ∑ is a block diagonal matrix where there are identical blocks of size

*L* =

*M* +

*H* for each day,

*t*. For example, if we assume a spatial exponential covariance structure with a nugget effect given 2 monitors (

*M* = 2) and a single subregion centroid (

*H* = 1), i.e.

*L* = 3, then for day

*t*,

This model gives rise to the following variance-covariance structure:

where, under an exponential model,

and

*d*_{ll}′ is the distance between location

*l* and

*l*′.

Once a structure is chosen for ∑, maximum- or restricted maximum-likelihood methods are available to estimate the covariance parameters. We take advantage of the

*MIXED* procedure’s (

SAS, 2004c) ability to fit models with spatial covariance structures. We emphasize that the estimation of the covariance parameters from a parametric model using likelihood methods with a handful of monitors can be challenging, and untenable without the assumption of independence across days. Nevertheless, the following analysis and simulation study illustrate that estimation of the covariance parameters is still possible given only a handful of monitors.

Under the model framework described above, utilization of multivariate normal (

) theory enables calculation of the conditional expectation (CE) of the unobserved exposure measurements (

**X**) given the observed exposure measurements (

**Z**):

where

**Z**_{a} = (

*z*_{11},…,

*z*_{1T},

*z*_{21},…,

*z*_{2T} ,…,

*z*_{M1},…,

*z*_{MT})′ is a vector of observed concentrations from the

*M* monitors on

*T* days, and

**X**_{a} = (

*x*_{11},…,

*x*_{1T},

*x*_{21},…,

*x*_{2T} ,…,

*x*_{H1},…,

*x*_{HT})′ is a vector of the unknown concentrations. In practice, unknown parameters are replaced by estimates; for example ζ

_{Za}(

**t**;Θ) and ζ

_{Xa}(

**t**;Θ) are both replaced by

_{Za}(

**t**;Θ), the vector containing estimates of the average exposure for each day across the region based on the observed exposures,

**z**_{a}, and day-specific covariates.

Liang and Liu (1991) suggest that Whittemore’s approach (

Whittemore, 1989), which replaces the true value

**X**_{a} by

*E*[

**X**_{a}|

**Z**_{a}], may offer a valid approximation in a non-linear setting if β

_{1} is small. They also suggest the method produces consistent estimates of the exposure effect β

_{1}, our primary focus of interest, in models with a log-link. The method does not ensure consistent estimation of the intercept β

_{0}.

To obtain the

*CE* estimates, we first estimate the parameters of the mean structure and covariance structure in 2 stages using the

*MIXED* procedure (

SAS, 2004c), utilizing only the observed measurements (

**Z**). The first stage uses restricted maximum likelihood (REML) to estimate the vector of parameters, Θ, defining the mean structure. In this first stage, we assume observations of the daily ambient concentration measurements are independent after removing the temporal trends.

Stage-1 Model:

where ϵ ~

(

**0**, σ

^{2}**I**). The fitted values,

=

_{z}(

*t*;

) and the residuals,

**r** =

**Z** −

, are obtained. The default optimization algorithm in

*MIXED* is ridge-stabilized Newton-Raphson. We utilized Fisher’s scoring up to the first 50 iterations. In the second stage, a no intercept model is fit to the residuals to obtain REML estimates of the covariance parameters.

Stage-2 Model:

where e ~

(

**0**,∑), and ∑ takes a spatial exponential form as described in

equation 3. Using the estimates of Θ and ∑, the conditional expectation estimates (

Eq. 4) are then calculated. See

appendix A for details.