|Home | About | Journals | Submit | Contact Us | Français|
Our research focuses on the association between exposure to an airborne pollutant and counts of emergency department visits attributed to a specific chronic illness. The motivating example for this analysis of measurement error in time series studies of air pollution and acute health outcomes was a study of emergency department visits from a 20-county Atlanta metropolitan statistical area from 1993–1999. The research presented illustrates the impact of using various surrogates for unobserved measurements of ambient concentrations at the zip code level. Simulation results indicate that the impact of measurement error on the association between pollutant exposure and a health outcome can be substantial. The proposed conditional expectation approach provided reliable estimates of the association and exhibited good confidence interval coverage for a variety of magnitudes of association. Use of a single-centrally located monitor, the arithmetic average, the nearest-neighbor monitor, and the inverse-distance weighted average surrogates resulted in biased estimates and poor coverage rates, especially for larger magnitudes of the association. A focus on obtaining reasonable exposure measurements within clearly defined subregions is important when the pollutant exposure of interest exhibits strong spatial variability.
Epidemiological studies of air pollution and health have a common goal of estimating the effect of exposure to specific airborne pollutants on specific health outcomes. These studies involve inherent measurement error issues, as true exposure, whether personal or ambient, is often not available. In many studies the only exposures available are ambient source concentrations from one or more monitoring stations, which often vary both temporally and spatially. The purpose of this paper is to explore and illustrate the impact of using various surrogates for unknown ambient exposures when estimating the association between local ambient exposure and a health outcome.
Errors-in-variables, or predictor measurement error, can arise when an exposure variable of interest can not be measured directly. Extensive coverage of conceptual and analytic issues related to measurement error can be found in the books by Fuller (1987) and Carroll et al. (1995), as well as in reviews such as those by Carroll (1989), Thomas et al. (1993) and Thurigen et al. (2000).
Dominici et al. (2003) provide details on various epidemiological study designs as well as statistical methods for studying health effects of air pollution. Several papers have explored the impact that exposure measurement error has on the estimation of the association between pollutant exposures and health outcomes (Burnett et al., 1994; Brown et al., 1994; Duddek et al., 1995; Zidek et al., 1998; Dominici et al., 2000; Berhane et al., 2004). Zeger et al. (2000) discuss the difference between the true ambient level and the measured ambient concentration, where the latter is the primary focus of the current study. They suggest that, if the Berkson-type error model holds true when relating ambient to personal exposure, this may have little impact on estimates of the association when the focus is on personal exposure. The Berkson-error model has the property that E[True|Observed] = Observed (Thomas et al., 1993). However, from a regulatory perspective, the error relating measurements of ambient concentration with their true spatially-resolved counterparts, may be of greater interest because it is ambient, rather than personal, pollutant levels that may be regulated.
Sheppard et al. (2005) discuss various study design factors that impact what parameter is being estimated in time-series and panel studies of air pollution effects. Their simulation studies suggest that using a single monitor when there is spatial variation in the concentrations results in small but noticeable attenuation of effect estimates, and that using the average of multiple ambient monitor exposures in time-series studies gives estimates that are less biased. Sheppard (2005) discussed further that when estimation of acute health effects is the goal, the use of ambient concentrations in time-series studies is quite adequate.
Carlin et al. (1998) developed spatio-temporal hierarchical models for the association between ozone and emergency department visits rates in Atlanta. They chose universal kriging methods to obtain exposure estimates and standard errors for each zip code centroid. Using measurements from 10 monitors across the Atlanta metropolitan area, Gelfand et al. (2001) interpolated ozone levels within each zip-code region. They consider a fixed spatial case, and then extend to a spatio-temporal model.
The motivating example for this analysis of measurement error in time series studies of air pollution and acute health outcomes was a study of emergency department (ED) visits from a 20-county Atlanta metropolitan statistical area from 1993–1999. Metzger et al. (2004) and Peel et al. (2005) analyzed the association between ambient concentrations of various pollutants and cardiovascular and respiratory outcomes, respectively. They developed single-pollutant models for particulate matter (PM10), ozone, nitrogen dioxide (NO2), carbon monoxide (CO), and sulfur dioxide (SO2). Health outcome data consisted of information on emergency department visits of residents of the Atlanta metropolitan statistical area for 31 area hospitals, including ICD-9 diagnostic codes. Counts of ED visits were aggregated across the region for each day and associated with ambient concentrations of single pollutants measured at a centrally located monitor.
For our illustrative analysis, we focus on the Atlanta area within 30 km of the central downtown monitor operated by the Aerosol Research and Inhalation Epidemiology Study (ARIES). We focus our attention upon the pollutant nitrogen dioxide (NO2) for the year 1999. NO2 is a primary pollutant that has been shown to exhibit marked spatial variability (Wade et al., 2006) and to be positively associated with various cardiovascular diseases (Metzger et al., 2004) and respiratory diseases (Peel et al., 2005). NO2 is measured at 4 locations in the above defined area. Data on particle composition and physical characteristics collected by the ARIES station are available from August 1, 1998 forward. ED visits from 94 zip codes having a centroid within this area were included in the analysis. Visits were determined to be associated with cardiovascular disease using reported ICD-9 codes.
The purpose of this analysis is to illustrate the impact of various surrogates for true ambient concentration on the estimation of the association between ambient exposure and ED visits related to a specific illness or disease of interest. In particular, we propose approaches designed to reduce bias due to spatial variability of the pollutant measurements. In the process, we also explore the feasibility of estimating spatial variability using likelihood-based methods when measurements are available from only a small number of locations.
A common approach (Metzger et al., 2004; Peel et al., 2005) to modeling daily event counts, such as ED visits, involves fitting a generalized linear model with a log-link and Poisson error (McCullagh and Nelder, 1989) such as the following: Rht ~ P, where
Here Rht and nht are the number of visits for a specific illness and total number of patient ED visits from subregion h based on residential zip code on day t, respectively. Xht is the true, unobserved ambient-level concentration of a specific pollutant at the centroid of subregion h on day t. While we use the notation Xht for simplicity, often the exposure of interest is the previous day’s concentration or possibly a moving-average of current and prior concentrations. (t;Ψ) can include covariates such as day-of-the-week and smooth functions of calendar time and weather to model long-term temporal trends, seasonality, and other relevant events such as influenza epidemics, and to account for any additional temporal correlation in the count time series (Dominici et al., 2000; Metzger et al., 2004; Peel et al., 2005; Peng et al., 2006). In equation 1, β0 + (t;Ψ) represents the log(baseline rate) for the outcome in the absence of exposure. β1, the parameter of interest, is the log relative risk associated with a unit increase in the daily ambient concentration of the pollutant.
Given the usual assumptions, the above model would be be reasonably straightforward except for the fact that Xht is unknown for all h and t. Ambient air quality measurements are not available for each subregion, but rather there are a limited number of monitors located throughout the region, each providing a daily measure of ambient concentration (Zmt) at that specific location. One might think of this as a classical missing data problem (Little and Rubin, 2002), albeit an extreme example, however information provided by the correlation structure of the observed measurements can inform us about the unobserved observations. The problem can also be placed in a measurement error framework (Carroll et al., 1995), by considering the set of daily monitor measurements Zt as a source of potential surrogates for the unknown exposure measurement Xht. We use the term surrogate as defined by Carroll et al. (1995). That is, we assume f(R|Z,X,Ψ) = f(R|X,Ψ), i.e., non-differential error whereby Z offers no information about R once X is known.
In this section we investigate a conditional expectation approach designed to adjust for measurement error in ambient exposures due to spatial variability. Assuming known mean and variance, this method is equivalent to the use of the simple kriging predictor. See Schabenberger and Gotway (2005) for details on kriging.
Currently published studies have utilized a variety of surrogate measures for ambient exposure. Metzger et al. (2004) and Peel et al. (2005) fit models using a single, centrally-located monitor. Zeger et al. (2000) and Wade et al. (2006) recommend averaging, possibly using a spatial average, over monitors within a specific region. Burnett et al. (1994) divided a geographical region of interest into subregions and assigned each subregion a single monitor.
Wong et al. (2004) compared four different weighted average interpolation methods: spatial averaging, nearest neighbor, inverse distance weighting, and ordinary kriging, to assess each method’s influence on the estimated exposure measurement. Other commonly used surrogates are to utilize a single, centrally located monitor or the arithmetic average of all or a subset of exposure measurements. We take the comparison of surrogates a step further by assessing the impact of each surrogate measure on the estimation of the association between exposure and a health outcome of interest.
We propose the following method where we model the temporal and spatial variability exhibited by the observed ambient concentrations through the mean and covariance structures, respectively. Utilizing estimated parameters, one can then interpolate concentration levels at each location. This method is equivalent to kriging. Modeling the spatial variability is based on describing a region’s air pollution measurements as a spatial random field in which the spatial dependence of measurements at different locations can be expressed through the variance-covariance matrix.
More specifically, let t = 1,…, T denote the day of observation, m = 1,…,M the monitor, h = 1,…,H the subregion, and L = M +H. We can then express Z, the vector of observed concentrations, as:
and X, the vector of unobserved ambient exposure levels, as:
We then assume the following model:
where and e ~ (0,∑).
ζ(t;Θ) is a function that defines the regional average exposure for each day. As in model 1, long-term temporal trends are accounted for via smooth functions of calendar time with pre-specified knots. Variables such as day-of-the-week, precipitation, wind direction and magnitude can be included to further account for weather, traffic patterns and any additional temporal correlation in the time series. Let Θ be the vector of parameters corresponding to the time-specific covariates included in the model.
Conditional on the fixed covariates indexed by Θ, we assume that each day’s set of pollutant measurements is an independent realization of the underlying spatial process of the pollutant measurements across a defined region. We seek to explore the impact of various surrogates for the unknown subregion ambient exposures. For this purpose take advantage of parametric isotropic models (Waller and Gotway, 2004) to estimate the spatial covariance parameters. Conditional on the covariates, we assume the spatial correlation between exposure measurements is the same for all pairs of equally distant locations and is not dependent on direction, i.e., we assume stationarity and isotropy, respectively.
Under the defined structure, ∑ is a block diagonal matrix where there are identical blocks of size L = M + H for each day, t. For example, if we assume a spatial exponential covariance structure with a nugget effect given 2 monitors (M = 2) and a single subregion centroid (H = 1), i.e. L = 3, then for day t,
This model gives rise to the following variance-covariance structure:
where, under an exponential model, and dll′ is the distance between location l and l′.
Once a structure is chosen for ∑, maximum- or restricted maximum-likelihood methods are available to estimate the covariance parameters. We take advantage of the MIXED procedure’s (SAS, 2004c) ability to fit models with spatial covariance structures. We emphasize that the estimation of the covariance parameters from a parametric model using likelihood methods with a handful of monitors can be challenging, and untenable without the assumption of independence across days. Nevertheless, the following analysis and simulation study illustrate that estimation of the covariance parameters is still possible given only a handful of monitors.
Under the model framework described above, utilization of multivariate normal () theory enables calculation of the conditional expectation (CE) of the unobserved exposure measurements (X) given the observed exposure measurements (Z):
where Za = (z11,…, z1T, z21,…, z2T ,…, zM1,…, zMT)′ is a vector of observed concentrations from the M monitors on T days, and Xa = (x11,…, x1T, x21,…, x2T ,…, xH1,…, xHT)′ is a vector of the unknown concentrations. In practice, unknown parameters are replaced by estimates; for example ζZa(t;Θ) and ζXa(t;Θ) are both replaced by Za(t;Θ), the vector containing estimates of the average exposure for each day across the region based on the observed exposures, za, and day-specific covariates. Liang and Liu (1991) suggest that Whittemore’s approach (Whittemore, 1989), which replaces the true value Xa by E[Xa|Za], may offer a valid approximation in a non-linear setting if β1 is small. They also suggest the method produces consistent estimates of the exposure effect β1, our primary focus of interest, in models with a log-link. The method does not ensure consistent estimation of the intercept β0.
To obtain the CE estimates, we first estimate the parameters of the mean structure and covariance structure in 2 stages using the MIXED procedure (SAS, 2004c), utilizing only the observed measurements (Z). The first stage uses restricted maximum likelihood (REML) to estimate the vector of parameters, Θ, defining the mean structure. In this first stage, we assume observations of the daily ambient concentration measurements are independent after removing the temporal trends.
where ϵ ~ (0, σ2I). The fitted values, = z(t; ) and the residuals, r = Z − , are obtained. The default optimization algorithm in MIXED is ridge-stabilized Newton-Raphson. We utilized Fisher’s scoring up to the first 50 iterations. In the second stage, a no intercept model is fit to the residuals to obtain REML estimates of the covariance parameters.
True ambient concentrations were simulated for each day of two calendar years for each of five monitored locations and 94 centroid locations, for a total of 99 locations. The actual locations of the 94 Atlanta area ZIP code centroids were used. The five monitor locations were distributed across the region such that one was at the center and the other four towards the outer perimeter. We used a publicly available Excel spreadsheet (Dutch, 2005) to convert monitor and ZIP code centroid locations from latitude-longitude to northing-easting scale based on the Universal Transverse Mercator (UTM) System. This allowed for a scale in kilometers (km) for graphing purposes as well as the unit of measure for the effective range.
Exposure data were simulated as the sum of three sine curves to loosely mimic the short-and long-term trends of NO2:
where for all monitor and centroid locations. We chose this method for defining the smooth trend of average daily exposure for its relative ease, as well as for the realism and assessment of using cubic splines to smooth temporal trends when the underlying functional forms of the temporal trends are unknown. Further day-to-day variability is incorporated via the error term, e. Here, e ~ (0,∑), where ∑ takes a spatial exponential form with a nugget effect as described in equation 3. Covariance parameter values were chosen loosely based on the 1999 NO2 data from Atlanta: partial sill , effective range (ρb) = 50 km, and nugget . The simulated pollutant measurements had a mean and standard deviation of approximately 50 and 17 ppb, respectively. Figure 1 shows a time-series plot of the simulated exposure measurements with corresponding cubic smoothing splines fit through the data for two of the monitors.
Approximately 1,000,000 ED visits were dispersed across days and ZIP codes for a 2-year period. Using the observed daily counts and total annual count of ED visits, a rate for each day of the year was calculated. Using the CALL RANTBL routine (SAS, 2004a) and the daily rates, each of the 500678 observed visits were randomly assigned to one of 365 days. Within each day, each visit was then assigned to one of the 94 zip codes using the same routine. Distribution of the probability mass across the 94 zip codes loosely replicated the observed; probabilities ranged from 0.01% to 2.6%. The 2-stage process was repeated using different seed values to obtain counts for a second year. These daily subregion counts of all ED visits, ranging from 0 to 65 with a median of 12, were fixed for all subsequent simulation runs. Daily counts of an event of interest, i.e. ED visits related to a specific cardiac or respiratory event, were simulated from a binomial distribution for each of the 94 subregions (i.e. ZIP codes) using the same day exposure of the true ambient concentrations. That is, Rht ~ B(nht, λht) where nht is the total number of ED visits from ZIP code h on day t. ln(λht), the natural log of the probability of the visit being attributed to the health outcome of interest, was defined by the following:
As some subregions had very small total numbers of ED visits, simulating from a binomial distribution insured that the number of events would always be less than or equal to the total number of ED visits for each day in any particular subregion.
The coefficient of the current-day’s exposure, β1, was set at: 0.001, 0.005, 0.01 and 0.05. We incorporate smooth functions of calendar time via cubic splines with knots on the 21st of March, June, September and December. See Green and Silverman (1994, Ch.2) or Seber (1977, Ch.8) for details. Indicator variables for day-of-week (DOW) and federal holiday (HOLIDAY) were included in an attempt to more accurately reflect the true nature of ED visits. All terms on the right side of the above equation, except β1Xht, were fixed across all simulations.
The monitor located near the center of the region was chosen as the central monitor. Utilizing all five monitors’ observed measurements, predictions using each of the four methods were calculated: daily average, inverse-distance weighted, nearest-neighbor, and the conditional expectation (CE) estimates.
Temporal trends were simulated using a mixture of several sine curves (see section 4.1), however in the stage-1 model we remove the long-term temporal trends by fitting smooth functions of calendar time via cubic splines with knots on the 21st day of each month. Indicator variables for day-of-the-week (DOW) were also included. In the stage-2 model we assume a spatial exponential structure with a nugget effect to estimate the spatial covariance parameters based on modeling the residuals from the stage-1 model. Using the estimated parameters of Θ and ∑, we then calculate the estimated conditional expectations for each day at each location.
Using each of the proposed surrogates, a time-series model was then fit using the GEN-MOD procedure (SAS, 2004b) under an assumed Poisson distribution with a log link as described in equation 1 with covariates HOLIDAY, DOW, and cubic spline with knots on the 21st day of each month. Invoking the Poisson approximation to a binomial (Dudewicz and Mishra, 1988), the Poisson model for event counts remains reasonable given that events are rare (i.e., small Rht). For comparison purposes in the simulation study, we also fit the health outcome model using the true CE, based on the known Θ and ∑, and the true centroid subregion measurements, X.
Table 1 gives the average and standard deviation of 1000 simulation runs for the estimated covariance parameters. Note the true parameters chosen for the simulation reflect both the spatial correlation as well as the large amount of unexplained variability, either due to unknown explanatory variables or to small-scale variation (e.g. local sources, instrument error) that exist in the real data. Even with only a few monitors, reliable estimation of these spatial covariance parameters is achieved due to the many independent replicates of the spatial process over time.
Table 2 gives the average and standard deviation of 1000 simulation runs for the estimated β1’s in the four different association scenarios, along with the average of the estimated standard errors. Coverage rates, average relative risk (RR) and average corresponding lower and upper confidence bounds are also given for a 20 ppb increase in exposure. Figure 2 and Figure 3 provide histograms of the 1000 estimates of β1 for a visual comparison of the methods’ performances for β1 = 0.001 and 0.05.
Table 2 indicates that even when β1 is small, there is considerable bias and sub-optimal coverage of β1 when using the central monitor and nearest neighbor surrogates. The arithmetic and inverse-distance weighted averages exhibit less bias than both the CM and NN; however, both are clearly more biased than the estimated CE approach. Results also show an increase in the amount of bias and a decrease in the coverage rates for the CM, NN, AVE, and IDW surrogates when the magnitude of the association increases, i.e., as β1 gets larger. In contrast, the CE approach accurately estimates β1 for all four magnitudes of β1, and provides near-optimal coverage for the three smallest magnitudes of association. The bias remains small for 20β1 = 1 (i.e., β1 = 0.05). However, the coverage rate suffers in this case, indicating the possible need for a variance adjustment when effect sizes are larger than those seen in real world ambient pollution and health studies. The performance of the estimated CE method is quite respectable, making it the favored approach.
To further illustrate the utility of the CE approach, we apply the method to 1-hour daily maximum NO2 data from the Atlanta metropolitan area from the year 1999.
Prior to estimating the spatial covariance parameters, we fit the following stage-1 model to ambient NO2 to remove the large scale trend as well as any temporal correlation:
where wj(t) = (t − τj)3 if t ≥ τj and wj(t) = 0 otherwise. ϵmt ~ (0, σ2) and Θ = (α0,…, α23,γe,1,…,γe,15).
The exposure model includes indicator variables for day-of-the-week (DOW) as well as same-day (RAIN0) and previous day (RAIN−1) precipitation. Also included are average wind direction (WINDDIR), average wind vector magnitude (WINDV EC) and the corresponding interaction between the two. Average wind direction was categorized into four 90° quadrants, measured from 0°N, and wind magnitude was categorized using the quartiles as cut-points. Weather variables (precipitation, wind vector and wind magnitude), measured at Hartsfield-Atlanta International Airport, were obtained from the National Climatic Data Center network.
Cubic splines were included to model the temporal trends with knots on the 21st of each month. We also included the previous day’s average exposure (t−1) to remove any residual autocorrelation not handled by the cubic splines. The residuals obtained from the above model were then modeled in turn, assuming a spatial exponential covariance structure with a nugget effect, to estimate the spatial covariance parameters.
Using the parameter estimates from the stage-1 and stage-2 exposure models above, the CE estimate was calculated at each centroid for each day. For comparisons, the outcome data was modeled in five ways: the ARIES monitor as the central monitor (CM), the nearest monitor’s exposure for each of the 94 ZIP code centroids, the inverse-distance weighted average, the arithmetic average, and the CE estimate. We assume the following health effects model:
The exposure of interest, is the previous day’s ambient concentration. Covariates included are federal holiday and day-of-the-week, as well as smooth functions of calendar time, g1(time,quarterly), with knots on the 21st of March, June, September and December and smooth functions for average daily temperature and dew point with knots fixed at the first, second and third quartiles.
Table 3 gives the results from the five models for the model of cardiovascular (CVD) disease visits, which for 1999 had an average daily observed rate of ≈ 0.021. The estimated covariance parameters for the 1999 NO2 measurements can be found in the footnote of Table 3. As in the simulation study, results show that the point estimates of the association between previous day’s ambient concentration of NO2 with cardiovascular disease using the conditional expectation surrogate are considerably larger than those using either the central monitor or arithmetic average surrogates, both of which ignore spatial variability. In contrast to the simulation trends, in this particular case the inverse-distance weighted average and the nearest-neighbor surrogate estimates are larger than the CE estimate. We note that in the simulation study where 20β1 = 0.02, ≈ 30% of the simulations resulted in NN and/or IDW estimates that were greater than the estimates from the CE approach. Because the estimation of β1 is dependent upon the choice of covariance structure, in practice we recommend a sensitivity analysis to assess the impact of different covariance structures.
Our results demonstrate that when exposure concentrations exhibit spatial variation across a defined region of interest, using a single centrally-located monitor, the nearest neighbor monitor, or the arithmetic or inverse-distance weighted average of several monitors can lead to substantial bias, generally, underestimation of the effect, even when the true association between exposure and outcome is small and the outcome event is rare. The impact of measurement error on the association between pollutant exposure and a health outcome can be substantial. A focus on obtaining reasonable exposure measurements within clearly defined subregions is important when the pollutant exposure of interest exhibits strong spatial variability. Although there are some limitations and computational pitfalls associated with staying within a likelihood framework, we have shown that having many independent replicates of a defined spatial process over time allows one to estimate spatial covariance parameters reasonably well. Having these estimates then allows one to impute subregion exposures that can be incorporated effectively into the health outcome model. This method provides reliable estimates of the association and exhibits good CI coverage for associations of typical magnitudes. With the real data, we estimated the increase in relative risk for a corresponding 20 ppb increase in exposure to be ≈ 1.06 for cardiovascular disease. Results for similar magnitudes of association from the simulation indicate that we can expect near-nominal CI coverage rates in such a setting.
In fitting the health outcome models discussed here, we assumed that the cubic splines adequately addressed the serial correlation. We refer to Metzger et al. (2004), where results of sensitivity analyses which fit GEE models with a stationary 4-dependent correlation structure indicated minimal serial correlation for various cardiovascular diseases. This assumption may not hold for some health outcomes of interest and thus one would need to explore models that account for the additional correlation, regardless of the surrogate used.
The focus of future work could involve incorporating temporal correlation into the spatial fields in an effort to bridge the likelihood approach presented here and the hierarchical Bayesian spatio-temporal models described in Banerjee et al. (2003).
R.H.L. was supported in part by and R01 from the National Institute of Environmental Health Sciences (ES012458), and by an Early Career Award from the American Chemistry Council and the American College of Epidemiology. This work was also supported by grants from: National Institute for Environmental Health Sciences (R01ES11294), U.S. Environmental Protection Agency (R82921301-0), and the Electric Power Research Institute (EP-P27723/C13172). We would like to thank Kristi Metzger, James A. Mulholland, Jennifer Peel, Stefanie E. Sarnat, and Jing Yu for all their helpful input and discussions.
To estimate the conditional expectation we first need to rearrange Z* (eq. 2) such that daily observed measurements are grouped within monitor, followed by the daily unobserved measurements grouped within centroid (see description following Eq. 4). That is,
ζ(t;Θ) and ∑ are then rearranged in the following way:
The rearrangement of ∑ results in a matrix that is no longer block diagonal, but rather a matrix that consists of L2 diagonal matrices of size T × T. The matrices on the diagonal of ∑a take the form IT, where the diagonal elements of each matrix correspond to The off-diagonal matrices of ∑a take the form IT, where the diagonal elements of each matrix correspond to For example, assuming the function f(·) takes a spatial-exponential structure, then for M=2, H=1, and T=2, and the components of ∑a can be expressed as:
If we assumed that ζXa(t;Θ) and ∑a were known, the conditional expectation under a multivariate normal framework, as defined above, would be equivalent to the simple kriging predictor psk(Za, s0) (Schabenberger and Gotway, 2005). That is, given the random field Z(s) : s D d, where Z(s) = [Z(s1),…,Z(sn)]′ is the vector of observed data at locations s1,…, sn, the simple kriging predictor, assuming µ(s) and ∑ are known, is: psk(Z; s0) = E[Z(s0)|Z(s)] = µ(s0)+σ′∑−1(Z(s)−µ(s)), at location s0. This optimal linear predictor is the best linear predictor under squared-error loss (Schabenberger and Gotway, 2005).
In what follows, we provide a transformation using Kronecker products that makes calculation of the conditional expectation feasible, even for the purpose of repeated simulation studies. First, recall that Let ζXa(t;Θ) = = (1, 2,…, H)′ be a vector of length H × T, where each h is a vector, length T, of the expected measurements for each day. And let zd = za −ζZa(t;Θ), where zd is a vector, length M × T, of each day’s difference between the observed measurement, zmt, on day t at monitor m and the expected measurement for each day, t. Then, . Using Kronecker (direct) product notation, ΣXaza=SXaza IT, where
and ΣZa = SZa IT, where
and IT is an identity matrix of size T×T. Using the properties of Kronecker products, it then follows that . Now, if we let where each Sh is a row vector of length M, then
To reduce the computation time for calculating the conditional expectation for each day at each location, we can iterate through each location one at a time by calculating E[Xa(h)|Za] = h + (Sh IT)zd. This enables efficient calculation of the CE estimates.