|Home | About | Journals | Submit | Contact Us | Français|
Lianfa Li: 0000-0001-8074-7099
Spatiotemporal models to estimate ambient exposures at high spatiotemporal resolutions are crucial in large-scale air pollution epidemiological studies that follow participants over extended periods. Previous models typically rely on central-site monitoring data and/or covered short periods, limiting their applications to long-term cohort studies. Here we developed a spatiotemporal model that can reliably predict nitrogen oxide concentrations with a high spatiotemporal resolution over a long time span (>20 years). Leveraging the spatially extensive highly clustered exposure data from short-term measurement campaigns across 1–2 years and long-term central site monitoring in 1992–2013, we developed an integrated mixed-effect model with uncertainty estimates. Our statistical model incorporated nonlinear and spatial effects to reduce bias. Identified important predictors included temporal basis predictors, traffic indicators, population density, and subcounty-level mean pollutant concentrations. Substantial spatial autocorrelation (11–13%) was observed between neighboring communities. Ensemble learning and constrained optimization were used to enhance reliability of estimation over a large metropolitan area and a long period. The ensemble predictions of biweekly concentrations resulted in an R2 of 0.85 (RMSE: 4.7 ppb) for NO2 and 0.86 (RMSE: 13.4 ppb) for NOx. Ensemble learning and constrained optimization generated stable time series, which notably improved the results compared with those from initial mixed-effects models.
Exposure to air pollution is associated with acute and chronic adverse health outcomes, such as respiratory and cardiovascular morbidity.1,2 Spatiotemporal models that optimally characterize the environment are crucial to estimate exposures to ambient air pollutants with high spatiotemporal resolutions for large-scale epidemiologic studies. However, obtaining reliable estimates of long-term exposures relies on spatiotemporal models that fully capture complex temporal structure (e.g., both short and long-term temporal trends) jointly with multiscale spatial variations (e.g., regional- and local-scale spatial variations). Developing such spatiotemporal models is challenging because measurement data are limited in space and time, and complex, and nonlinear associations exist between predictors (such as meteorological and traffic variables) and pollutant concentrations.3
Many earlier studies used land-use regression or conventional kriging approaches to develop individual spatial models of exposure that capture details across space but not time. Such conventional approaches tend to overfit irregularities of the training data relying on a set of assumptions,4,5 including having access to an unbiased sample of monitoring sites for the population and homogeneity of spatial variation for kriging. When the temporal variability of exposure is ignored in the modeling process, especially when statistical assumptions are violated, the resulting exposure estimates could be affected by large error causing significant biases or large variance.6,7
In more recent years,3,8–12 air pollution exposure modelers started to employ the variants of principal components called empirical orthogonal functions (EOFs)11 for spatiotemporal modeling of air pollutants. In the approach, the first and second principal components accounting for the dominant temporal structure often explain the majority of the long-term and seasonal but not the short-term temporal pollutant variation in the study region. Two of these studies parametrized the temporal basis functions by incorporating time-invariant spatial variables such as elevation, distance to the shorelines and meteorological factors.3,12 However, this two-stage approach generally assumes spatial-temporal independence and is limited in capturing short-term temporal variability in its estimates.
Other methods such as Bayesian maximum entropy have been used to estimate the spatiotemporal concentrations of particulate matter <10 µm (PM10),13 particulate matter <2.5 µm (PM2.5),14 and ozone.15,16 Hierarchical spatiotemporal models have also been developed for PM with a second-order stationary and isotropic assumptions of spatiotemporal covariance.8 However, these methods were based on simulated variograms derived from limited measurement data subject to overfitting biases, and both methods only incorporated spatiotemporal covariates predictive of the means structure to a limited extent.
Here, we developed a novel spatiotemporal modeling framework to estimate nitrogen dioxide (NO2) and nitrogen oxides (NOx) at a high spatiotemporal resolution over a period of 22 years (1992–2013) using extensive data from government routine monitoring networks and rich short-term field sampling campaigns. Routine monitoring data were used to construct the temporal basis functions and to capture long-term and seasonal temporal trends of pollutants in the study region. The short-term samples from three monitoring field campaigns provided neighborhood scale spatial data that better captured intraurban spatial variability and spatial autocorrelation than the models trained just using the routine monitoring data. The modeling framework consisted of three stages: a generalized additive mixed model to capture spatiotemporal variability and spatial autocorrelation at a high resolution, ensemble learning of the mixed models to reduce uncertainty and to better characterize variability in prediction, and constrained optimization to ensure physically- and chemically consistent prediction of concentrations.
This study region (Supporting Information (SI) Figure S1) covers the area of southern California south of the 35.6 degree latitude (~Bakersfield) and includes Los Angeles, Orange, Riverside, Ventura, Santa Barbara, Mohave, San Diego, Imperial Counties and most of San Luis Obispo, Kern, and San Bernardino Counties.
Routine measurements of hourly NO2 and NOx concentrations from 1992 to 2013, recorded at 51 stations, were retrieved from ambient air monitoring networks operated by the California Air Resources Board, South Coast Air Quality Management District (SCAQMD), San Diego Air Pollution Control District (APCD), Antelope Valley AQMD, Mojave Desert AQMD, Imperial County APCD, San Joaquin Valley APCD, San Luis Obispo County APCD, Santa Barbara County APCD, and Ventura County APCD. The concentrations at these stations were measured using Federal reference (chemiluminescence NO/NO2/NOx) methods.
Additional data were generated in intensive field measurement campaigns conducted by the University of Southern California (USC), University of California Los Angeles (UCLA), and University of California Irvine (UCI), respectively. Passive diffusion-based Ogawa samplers17 were used to measure NO2 and NOx, with integrated biweekly samples for the USC and UCLA data and integrated weekly samples for the UCI data. The USC data were collected as part of the Intra-Community Variation campaigns conducted in 12 Children’s Health Study (CHS) communities in 2005–200618 and eight CHS communities in 2008–200919 (in total, 2,542 biweekly samples from 1,104 locations). The UCLA data contain 161 samples collected in Los Angeles County with two biweekly measures (i.e., September 16 to October 1, 2006 and February 10–25, 2007).20 The UCI data contain 32 samples collected in south Los Angeles and Orange counties during 4 weeks (i.e., July 10–18, July 24 to August 1, November 13–21, and December 4–12 in 2009).21
Since most (about 97%) of the field measurements were integrated biweekly samples (mainly from USC and UCLA), we used biweekly averages as the temporal unit of estimation. For the routine measurements, biweekly average concentrations were calculated from hourly data using a 75% completeness criterion. For the field measurements, linear interpolation was used to derive biweekly averages from the UCI weekly data. For a site with a full temporal coverage, a total of 574 biweekly concentrations were calculated from January 1992 to December 2013.
Section 2 and Table S1 of SI provide more details about the measurements of NO2 and NOx and adjustment for the passive data from the field campaigns to minimize systematic bias. SI Figure S1 also shows the locations for the routine and USC sampling sites [the UCI and UCLA sampling locations concealed to comply with specific requirements by their Institutional Review Boards].
CALINE4 is a line source dispersion model that was used to assess the contribution of local motor vehicle emissions to ambient concentrations.22,23 We used CALINE4 to compute mean NOx concentrations from emissions respectively on freeways and nonfreeways. The time-varying NOx estimates by CALINE4 were derived using the quarterly average daily traffic volumes and EMFAC2011 (for 1992–2012)24 and EMFAC2014 (for 2013)25 (see SI Section 3.1 for details).
Traffic density represents distance-decayed annual average daily traffic (AADT) volume in both directions from all roads (freeways/highways and major surface streets) within a circular buffer. Its values were computed by the ESRI ArcGIS density function using a kernel with a 300 m search radius and a 5 m grid resolution. Due to covering a long time period, the traffic densities were scaled by the South Coast Air Basin (SoCAB) EMFAC2011 vehicle fleet average NOx emission factor for 50 mph and 6% heavy-duty vehicle fraction to reflect the composite trend in traffic volumes and emissions over time (see SI Section 3.2 of see SI Section 3.2 for details).
We calculated the distance from the sampling location to the centerline of the nearest roadway by road class based on the ERSI Premium Street Map road network data. The two directions of travel were represented as separate line segments for freeways and other moderate and high volume roads in this data set.
We calculated block group population in 300 m buffers based on the 1990, 2000, and 2010 census block data in ArcGIS and linearly interpolated or extrapolated annual population density for 1992–2013 at the sampling sites.
Meteorological covariates were derived from a high-resolution (4-km) gridded data set of surface daily meteorological variables that cover the contiguous United States from 1979 to 2013.26 Seven meteorological factors were extracted as predictive variables: minimum and maximum air temperatures (°C), specific humidity (grams of vapor per kilogram of air), precipitation (amount of rain per square meter in 1 h (millimeters, mm)), wind speed (meters/second), wind direction (degree), near-ultraviolet and near-infrared spectra (watt/meter2, w/m2).
We obtained high accuracy elevation (at 30 m resolution) data using GoogleMap API27 for each sampling location. We also calculated the shortest distance (meter) to the shoreline of the Pacific Ocean for each sampling location.
We designed a hierarchical modeling framework (Figure 1) with three stages: a mixed-effect spatiotemporal model, ensemble learning, and constrained optimization.
We designed the mixed-effect model that incorporated nonlinear relationships, fixed and random effects from multiple predictors, and spatial autocorrelation to characterize spatiotemporal variability of NO2 and NOx concentrations.
The spatiotemporal estimate (f(s,t)) of the concentrations of NO2 and NOx is quantified using the following formula:
where s refers to spatial location, t refers to temporal parameter, β0 represents the long-term mean concentration, f1(t) and f2(t) are temporal basis functions that represent long-term and seasonal trends, fr(s,t) represents annual regional variation in pollution where regions are defined as the Thiessen polygons derived from the government routine monitoring stations, xi(s,t) represents local variability explained by different local predictors (e.g., CALINE4 estimates, traffic density, population density and meteorological parameters), fs(rs) represents structured spatial effects (rs refers to the region where s is located), fre(rs) represents unstructured spatial effects, and ε represents the residuals.
The seasonal effects (f1(t) and f2(t)) reflect the long-term dominant temporal trend for the study region. We used the empirical orthogonal functions (EOFs) (a.k.a., independent temporal basis functions) and the long-term biweekly concentrations from the 51 routine monitoring stations to derive the dominant basis functions. EOFs were used to present leading modes of spatiotemporal variability of air pollution; their smoothed curves are often used to reduce noise due to random fluctuation.11
We used the yearly average pollutant concentrations for each Thiessen polygon (fr(s,t)) surrounding each routine monitoring station to reflect the regional yearly spatial variability (fixed effect in the model). Thiessen polygons are often used to determine density of point samples and to build meshes for space-discretized analyses.28 Spatiotemporal variability due to local effects (xi(s,t)) was modeled using variables that influence air pollutant dispersion or reflect the type and strength of emission sources, including meteorological parameters, traffic-related variables, population density etc. Traffic-related variables and population density capture the influence of on-road mobile and area emission sources, whereas meteorological parameters mainly influence the environmental processes involved in air pollutant transport, dispersion and removal.
For spatiotemporal factors, we adopted a nonparametric additive method to model nonlinear effects (see SI Section 4.1 for details). Degrees of freedom were limited to 10 to minimize overfitting.
Better characterization of spatial-effect terms in the model development is important to account for the influence of neighboring polygons (spatial autocorrelation). In this study, we used structured spatial random effects to account for spatial autocorrelation not explained by spatial covariates. Additive unstructured random effects29 were also included to account for spatial autocorrelation not fully explained by structured spatial random effects (e.g., other spatially distributed sources of pollutants besides traffic emission and population density). By estimating a structured component and an unstructured component, we can distinguish between the two sources of spatial autocorrelations.30 Our empirical results showed that adding unstructured random effects slightly improved model performance (measured by the deviance information criterion) compared to the one with only structured random effects.
Thiessen polygons were constructed around the central points of the monitoring locations within a certain buffer distance to simulate spatial effects. By sensitivity analysis, a buffer distance of 500 m was selected as an optimal aggregate option due to its good balance between accuracy and computing efficiency. In our model, spatial effects were treated as random variables at the polygon level and incorporated formally as a component of the nonparametric additive terms.
Restricted maximum likelihood was used to solve the geo-additive mixed-effect model. We used the packages of BayesXsrc and BayesX to solve the mixed model31,32 in the statistics software R (Version 3.3). SI Section 4.2 presents the formulas and details about modeling of spatial random effects.
As Stage 2 of the modeling framework, we designed an algorithm of weighted bootstrap aggregation33 for the spatiotemporal models to ensure stable prediction. This algorithm iteratively selected a random sample of size n (n = 18 096, size of the original training data set) with replacement, stratified by traffic index (traffic density and Caline4 estimated concentration), from the original data set, and 90% of the predictors for training. In each iteration, about 63% and 37% of the original data set were selected to train and test the model, respectively.34 So, the final result was equivalent to a 63–37% cross validation. We also conducted a sensitivity analysis where only 2/3 of the predictors were used each time and the resulting model performed slightly worse (R2 decreased by about 4%) than when using 90% of the predictors. The selected samples were used to train multiple different models. The number of iterations (from 10 to 1,210 by a step of 20) was determined using cross-validation to minimize the root-mean-square error (RMSE). The aggregated predictions (mean and standard deviation) are the weighted means of the outputs of all trained models, where the weighting is the square of each model’s R2 (see Section 4.3 for details).
Randomly sampling from both the training data set and the predictors was used to ensure independence between the training samples for different models. Given that each model was trained for different portions of the original data set, the variance in the predictions can be effectively decreased, as demonstrated in the literature of machine learning.7,33 Besides the weighted predicted mean of concentration, the weighted standard deviation (SI eq S6) can be obtained, as an uncertainty indicator to reflect the dispersion of the predicted value.
Stage 2 generated averaged point estimates for specific-location and -time for which the full set of predictors was available. However, the predictors, especially the time-varying covariates, may be temporally incomplete for the entire modeling period. For locations with a large portion of incomplete time-varying covariates, the predictions from Stage 2 might not fully capture the dominant seasonal trend. Thus, as Stage 3 of the modeling framework, we designed constrained optimization to derive optimal coefficients for the temporal basis functions [f1(t) and f2(t) in eq 1)]. While the temporal basis functions represented the principle components of temporal variability for the study region, their coefficients reflected spatial difference in the long term averages and seasonal variation between different locations. Using the basis functions with their coefficients, the full time series of concentrations covering the study period can be simulated for a target location in the study region. Then, the corresponding time-specific estimates could be extracted from the simulated series as adjusted values for the estimates of Stage 2. In constrained optimization, the point estimates from Stage 2 were employed to estimate the coefficients of the temporal basis functions (β0, β1, and β2). Such optimization was solved through quadratic programming.
The constrained optimization aims to minimize the difference between the target concentration to be adjusted and the prediction output from Stage 2 subject to certain constraint conditions.
where yst is the measurement and/or estimate derived from ensemble learning in Stage 2, f1(t) and f2(t) are the temporal basis functions, and β0, β1 and β2 are the coefficients of the temporal basis functions to construct the time series of the concentration over the entire study period.
The intervals (βi lower or βi upper) of the beta parameters were determined from the long-term time series of measurements at routine monitoring stations (using outer fence37 to filter the outliers) and used to constrain the target functions to get stable seasonal trends.
Prediction errors, R-square (R2), RMSE, relative RMSE [NRMSE = normalized RMSE = RMSE/(ymax − ymin), and CV RMSE = coefficient of variation of the RMSD= RMSE/ӯ] were used to evaluate the individual models. To assess prediction error, residual plots were also examined for evidence of over- /under-fitting and heteroscedasticity. Leave-one-subcounty-out cross validations were conducted. In this validation, all samples within one subcounty were held out as the validation data while keeping the remaining data from the other subcounties to train the model, and then this resulting model was used to make prediction for the held-out samples in the subcounties not used to train the model. The goal of leave-one-subcounty-out cross validation was to examine the model’s performance across different sub counties using independent training data set.
Since one important application of the model is to estimate NO2 and NOx exposure for the subjects residing in the CHS communities, we also conducted leave-one-community-out cross validation specifically for the CHS samples. The CHS monitoring locations were highly clustered in space within each community, which created challenges in reliably estimating concentrations at individual sites within certain communities. We also examined the model performance for each individual CHS community.
In ensemble learning, using bootstrap aggregation, we employed about 63.2% of the original data set to train the model to make predictions for the remaining 36.8% of the data set. Similar performance measures (R2, RMSE, relative RMSE) as used for the individual models were calculated using the output from ensemble learning for all samples and for field sampling campaign data separately.
For constrained optimization, the Pearson’s correlation between the adjusted biweekly estimates obtained by constrained optimization and the observed values were respectively computed over the long-term study period for each routine monitoring station. Correlations from all routine monitoring sites then were summarized.
We employed the trained spatiotemporal models to make predictions at the CHS subject locations across southern California. In total, we predicted 1 850 415 biweekly NO2 and NOx concentrations at 10 820 locations for 1992–2013, the time period covering the lifetime residential histories of CHS participants (5845 unique individuals in cohort E). Since one of the two major USC field campaigns occurred in 2005–2006, we obtained the averages for summer (June–August 2005) and winter (December 2005 to February 2006) of those years to create maps for visual checks of spatial and seasonal patterns of pollutant concentrations at subject locations.
Table 1 lists the summary statistics of the concentrations and the sampling locations from routine monitors as well as field campaigns conducted by USC, UCLA, and UCI. The histograms (SI Figure S2) show small skewness for NO2 and large skewness for NOx; thus, we log-transformed NOx to make its distribution more normal.
The temporal basis functions were used to capture the seasonal trend of pollutant concentrations for the study region (SI Figure S3). The first component of the temporal basis trend accounted for 59% of the variance for NO2 and 56% of the variance for NOx. The second temporal basis function explained a lower percentage of variance (about 9% for NO2 and 8% for NOx).
The selected local spatiotemporal variables made different contributions to the variance explained in the mixed model (SI Table S2 where the thresholds are also listed as the filter for the outliers). Among these factors, CALINE4 NOx on freeways and traffic density (300 m-5 km) each accounted for 9–13% of the variance, and population density accounted for 5–11% of the variance. Wind speed and minimum air temperature together account for about 7–8% of the variance. The additive mixed models captured nonlinear associations between predictive variables and pollutant concentrations (SI Figure S4). Generally, traffic density, CALINE4 output and population density were positively and nonlinearly associated with pollutant concentrations.
The Thiessen polygons generated with the optimal 500 m radius were selected for modeling spatial effects (SI Figure S5).
The individual models from Stage 1 achieved an R2 of 0.90 for NO2 and 0.91 for NOx (RMSE: 2.08 ppb for NO2; 10.02 ppb for NOx) with the leave-one-subcounty-out cross validation R2 of 0.83 (RMSE: 5.39 ppb) for NO2 and 0.88 for NOx (RMSE 12.42 ppb) (Table 2). The leave-one-community-out validation specifically for the USC data shows an R2 of 0.71 for NO2 and 0.80 for NOx with the RMSE of 4.51 ppb for NO2 and 9.37 ppb for NOx. The validation results for the CHS communities are presented in SI Table S3 and S4. While the total correlation between the predicted and observed values was 0.95 (RMSE: 2.54 for NO2; 5.43 for NOx), the model performance was not as good for CHS communities with the lowest NO2 and NOx concentrations: Lake Arrowhead and Santa Maria.
Through bootstrap aggregation, we obtained the optimal number (120) of individual mixed-effect models. For the total samples, validation results for the ensemble models showed similar accuracy as individual models (Table 2); for the field campaign samples, the ensemble learning generated better results, in particular showing considerable improvement (12% for NO2; 10% for NOx) for the USC samples, compared with the result of the leave-one-community-out cross validation (Table 2). The residual plots for ensemble predictions between the observed values vs residuals show slight overfitting and no heteroscedasticity (SI Figure S6). Figure 2 shows the residual plots of the ensemble predictions for the USC samples. We also examined R2 and RMSE for each community of the USC samples; worse model performance was observed for several communities, that is, the Lake Arrowhead and Santa Maria, and Anaheim. The data (SI Figure S7) show that Lake Arrowhead and Santa Maria had the lowest concentrations of NO2 and NOx and thus the model slightly overestimated their concentrations, while Anaheim had the highest NO2 concentrations and the model slightly underestimated NO2.
The validation for constrained optimization shows a strong correlation between the simulated time series and the observed values. The mean and median of Pearson’s correlations between the simulated time series and observed values for each routine monitoring station are respectively 0.94 for NO2 (0.96 for NOx) and 0.97 for NO2 (0.99 for NOx) (SI Figure S8 for the boxplot). Even for the sites with the lowest correlation (0.55 for NO2; 0.7 for NOx), the simulated temporal trends were basically consistent with the observed values (SI Figure S9).
Figure 3 presents the plots of observed vs predicted values with the simulated time series generated by constrained optimization for one typical monitoring station. Even for the sample locations with many missing observed values (e.g., with only 4–5 measurements available for the USC sample locations), our approach can capture the basic temporal trends over the long-term period. The summer (June to August of 2004) vs winter (December 2004 to February 2005) average concentration estimates highlight local scale spatial variations with a general declining trend further away from heavily traveled roads, and higher concentration in winter than in summer. Contrast and gradient variations were also observed within the communities (e.g., Anaheim for NO2 in Figure 3; and San Dimas for NOx in SI Figure S10).
In this study, we developed a novel hierarchical modeling framework to make robust predictions for spatiotemporal concentrations of NO2 and NOx over 22 years. Our spatiotemporal model improved over the previous two-stage model approaches3,9,12 that only separated the temporal variability (characterized by temporal basis functions) from the spatial variability (modeled exclusively by spatial variables), but did not make full use of the spatiotemporal variables (e.g., meteorological variables were averaged over the entire study period and treated solely as spatial variables). In the two-stage models, the first two temporal basis functions and their coefficients (estimated by spatial covariates) were used as the dominant predictors. In the two-stage framework, the model’s performance in prediction was limited by the total variance that can be accounted for by the selected temporal basis functions (e.g., only 68% for NO2 and 64% for NOx in the case of this study). Further, the two-stage models assumed that temporal and spatial variances are distinctly separable. In practice, it is often difficult to completely separate the two. Such separation may result in loss of information on temporal variability in predictors and consequently substantial uncertainty in prediction. In this study, we developed a flexible three-stage framework with multiple features to improve model prediction. First, a nonlinear mixed model was developed to best capture both regional and local, as well as long-term and short-term variability in pollutant concentrations in a single model. Then, ensemble learning and constrained optimization were implemented to reduce uncertainty, minimize variance in prediction, and generate stable predictions.
The mixed model has a flexible framework making it easy to incorporate multiple spatiotemporal predictors and spatial effects. For instance, the model incorporated long-term average concentrations (intercept), long-term seasonal trends (the first and second temporal basis functions), regional variation in concentration (subcounty-level yearly averages) and local-scale influential predictors. At this stage, the temporal basis functions and associated coefficients are not the sole basis for the model’s framework but rather used to represent long-term seasonal variation predictors for the study region.
Local variability in the environmental processes (e.g., emission, transport, dispersion, and removal), was represented in our air pollution model by spatiotemporal covariates including traffic indicators, population density, and meteorological factors. In terms of traffic indicators, the CALINE4 and traffic density predictors incorporated quarterly or annual variations in traffic volumes, emission, and wind, which was important to capture temporal variation and trends in local on-road vehicle emissions. These two predictors accounted for a significant proportion of the variances, illustrating influence of traffic emissions on concentrations of NO2 and NOx. Population density, an indirect measure of emissions, explained 5–11% variance. In comparison, the meteorological parameters together accounted for 7–8% of spatiotemporal variability, although CALINE4 also captured part of the meteorological impact.
Nonlinear models were fit to account for local variability. Such models captured the critical points where different trends occurred. For example, in SI Figure S4-a and b, the increase in concentrations with traffic density was more rapid for traffic density below 50 than that above 50. Comparisons between linear and nonlinear models show that the nonlinear model improved the variance explained by about 19–21%.
In this study, we used Thiessen polygons rather than point-based kriging to model spatial autocorrelation. An assumption of kriging is random and even distribution of sample points with homogeneity of spatial variation.38,39 For this study, many sampling points from USC were highly clustered and thus not applicable for kriging. Thiessen polygons remained relatively stable regardless of the density of samples and distribution of spatial variation, and are effective in capturing neighborhood scale spatial variability. Spatial autocorrelation accounted for a significant portion of the variance.
Most previous exposure models4,40,41 used single data sets to fit a single model that was then evaluated using cross validation. The primary drawback of the single learner model, like our mixed-effect model in Stage 1, is that the model may overfit the training data and be variable when applied to new locations and times with their predictors different from the primary range of the training data set. In comparison, the ensemble learner combines individual predictions from different models, and thus it minimizes variation in prediction.42,43 In this study, we trained different mixed models using multiple sets of samples obtained by bootstrap aggregation with different combinations of prediction variables. The final prediction was determined by weighting all outputs of individual models by accuracy, and the standard deviations of the predictions were derived as uncertainty indicators. This approach reduces variance and enhances the reliability of prediction.
The leave-one-community-out cross validation shows a good predictive performance overall. The results varied by individual community, with better performance in communities with moderate pollution levels and relatively poor performance in the mountain communities with lower pollution levels, such as Lake Arrowhead. As expected, the model is not capable of accurately predicting the cases having the lower observed concentration than that of the samples used to train the model.
The coupling of constrained optimization with the temporal basis functions is useful to simulate reliable long-term time series. The constrained optimization leveraged a priori and empirical knowledge (e.g., concentration of NO2 lower than that of NOx, declining trends for NO2 and NOx over the long period, and seasonal variation) and a limited number of point estimates to have an optimal estimation of the parameters for the temporal basis functions, thus extrapolating the concentrations far (1992–2013) from our denser measurement campaigns in 2005 or later. This approach is particularly useful for situations where long-term time series of exposures are needed but the subject locations have incomplete predictor variables (e.g., USC sample locations).
This study employed unbalanced sampling data that included routine measurement data and short-term field campaign measurements. Ideally, one would rely on high spatiotemporal resolution measurements (i.e., frequent measurements over the whole period of 22 years and across the entire study region). However, a number of our short-term samples were spatially clustered. To address this concern regarding clustered data, we made strict leave-one-subcounty-out and leave-one-community-out cross validations to test the model’s actual performance and the results were satisfactory (R2: 0.83–0.88). By subsequent ensemble learning and constrained optimization to decrease bias, the final predictions at CHS subject locations (Figure 4 and SI Figure S10) showed fine concentration gradients within each community, illustrating the model’s capability to estimate within-community variability.
The study has several limitations. First, this is a model of NO2/NOx from traffic pollution since the traffic-related predictors such as CALINE4 NOx and traffic density were used, not a model for prediction of airport or shipping or other stationary combustion sources. For the latter, we need the extra covariates to capture the corresponding sources in the model. Second, there is a potential overfitting problem in the nonparametric nonlinear model. We limited the degrees of freedom (10) for the explanatory variables to decrease overfitting in generalized additive models. Ensemble learning further reduced overfitting. Third, not all of the short-term samples were carefully selected and sited for exposure modeling purposes. Since the USC data were highly clustered, more Thiessen polygons were constructed at these locations with a denser sample, which may result in better spatial resolution for these locations than that for the sparse sample (e.g., overestimation in Lake Arrowhead). Although our method can be applied to different situations (sparse vs dense sampling), when additional samples become available, these can be used to update the model for continual improvement. Fifth, the study region is confined to southern California and 1992–2013 calendar years only, but the modeling approach is easily generalizable to other regions and periods.
The study was supported by the Southern California Environmental Health Sciences Center (National Institute of Environmental Health Sciences’ grant, P30ES007048), the Natural Science Foundation of China (41471376), National Institute of Health USC ECHO LA DREAMERS (UG3 OD023287-01), the Southern California Children’s Environmental Health Center (National Institute of Environmental Health Sciences and the Environmental Protection Agency’s grants, P01ES022845 and RD-83544101-0), the Hastings Foundation, and the National Institute of Environmental Health Sciences (R21ES016379 and R21ES022369), and California Air Resources Board (# 04-323).
The authors declare no competing financial interest.