|Home | About | Journals | Submit | Contact Us | Français|
An integrated exposure model was developed that estimates nitrogen dioxide (NO2) concentration at residences using geographic information systems (GIS) and variables derived within residential buffers representing traffic volume and landscape characteristics including land use, population density and elevation. Multiple measurements of NO2 taken outside of 985 residences in Connecticut were used to develop the model. A second set of 120 outdoor NO2 measurements as well as cross-validation were used to validate the model. The model suggests that approximately 67% of the variation in NO2 levels can be explained by: traffic and land use primarily within 2 km of a residence; population density; elevation; and time of year. Potential benefits of this model for health effects research include improved spatial estimations of traffic-related pollutant exposure and reduced need for extensive pollutant measurements. The model, which could be calibrated and applied in areas other than Connecticut, has importance as a tool for exposure estimation in epidemiological studies of traffic-related air pollution.
Outdoors, traffic is the primary source of nitrogen oxides (NOx) that convert to nitrogen dioxide (NO2). Traffic is believed to be responsible for at least half of NOx emissions in general and accounts for a higher proportion in urban areas (HEI, 2010). NO2 is also a precursor for several other air pollutants including nitric acid and ozone (WHO, 2003). Of particular relevance when assessing exposure to NO2 is the observation that levels can vary significantly over small distances (Jerrett et al., 2005; WHO, 2003).
Residential exposure to traffic or to specific traffic-related pollutants has been assessed using a variety of measurements including traffic proximity (self-reported (Duhme et al., 1996) or measured distance (Brunekreef et al., 1997)), traffic volume (Lin et al., 2002; Livingstone et al., 1996), traffic-related pollutant levels measured at central site monitors (Gent et al., 2009), and estimates from complex chemical transport models (Byun and Schere, 2006; Jerrett et al., 2005). Each method has limitations. Measures using distance (self-reported or measured), or traffic counts, for example, can result in exposure misclassification by failing to capture variability in traffic volume on road networks in the vicinity of a residence. Traffic volume is considered in some models that assume a pollutant dispersion function that is Gaussian (English et al., 1999; Pearson et al., 2000) or logarithmic (Pleijel et al., 2004). The most complicated chemical/dispersion models, e.g., CALINE4 (Bensen, 1989; US EPA, 2003), require detailed data inputs that are generally unavailable: pollution sources and emissions figures, fleet composition and fuel types, topographical information, atmospheric and meteorological data. The applicability of CALINE4 to locations and time periods other than California in the 1970s (where the model was developed and when the vehicle fleet had different pollution characteristics than the 2000s) is questionable.
Holford et al. (2010) developed a framework for estimating integrated exposure to traffic using publicly available data to estimate traffic volume in buffers surrounding a residence (Holford et al., 2010). No prior assumptions were made as to the pollutant dispersion functions, which were explored with step, polynomial, and spline models. Using a step function to predict NO2 measured outside of 120 homes in Connecticut, Holford et al. (2010) present models of NO2 as isotropic functions of traffic volume (i.e., independent of direction with respect to a residence) (R2 = 0.510) and anisotropic functions (traffic volume in buffers divided into four quadrants north, east, south, west of a residence) (R2 = 0.513). No additional covariates were considered. The model presented here extends these results by using many more NO2 samples taken over a larger area and longer time span and by including landscape characteristics (land use, elevation, population density) and season of sampling. We used NO2 measurements at 985 residences throughout the state of Connecticut combined with other readily available, geographically-based information to develop a model to predict residential outdoor NO2 concentration.
Geographic data used to create variables for model development (e.g., residential locations of nitrogen dioxide [NO2] measurements, traffic counts on interstate and numbered highway road segments, land use, population density and elevation) were processed with ESRI's ArcGIS 9.3.1 (ESRI, 2006), using Universal Transverse Mercator (UTM) Zone 18 North projection. Locations of NO2 samples used for model development were determined by taking Global Positioning System (GPS) measurements at study subjects' homes. Trained research assistants using GPS devices (Garmin® models eTrex Legend Cx and Nuvi 760) were instructed to stand at the front door (or nearby in the case of heavy ground cover at the front door, i.e., porch or trees), wait for an indication of a minimum of three satellites, then record the latitude and longitude. Addresses of NO2 sample sites used in model validation were geocoded to point locations using the ESRI StreetsUSA database (ESRI, 2003) to obtain latitude and longitude coordinates. Hawth's Tools extension for ArcGIS was used to derive land use covariates (Beyer, 2004).
NO2 levels were sampled outside of 985 residences of families participating in a year-long, follow-up study of air pollution and childhood asthma (Leaderer). The study took place in Connecticut, began in April, 2006, and concluded in August, 2009. NO2 samples, up to four at each residence (one per season) were taken using Palmes tubes (Palmes et al., 1976) placed outside the back door of the home and left in place for one month. Sampler analysis resulted in a mean daily NO2 concentration for the monitoring period. A total of 3140 samples were available for use in model development: 508 (52%) contributed 4 samples, 262 (27%) contributed 3 and 22% contributed 2 (N=107) or 1 (N=108). Similarly collected NO2 data from Connecticut residents participating in a study conducted in 1994 were used to validate the model and assess its accuracy (Beckett et al., 2006; Pettigrew et al., 2004; Triche et al., 2002; Triche et al., 2005; Triche et al., 2006). The 120 NO2 samples used here for model validation were used previously for our earlier model development (Holford et al., 2010). Sample collection date for each sample was defined as the mid-point of the monitoring period. This date was used in analyses involving season of sampling (see section 3, below).
In an approach described by Holford et al. (2010) and using traffic data for 2006 provided by the Connecticut Department of Transportation (CT DOT), a line file of interstates and numbered highways was divided into approximately 50-meter segments, each of which had an associated average daily traffic (ADT) count (CT DOT, 2006). The midpoint of each segment was defined and a measure of vehicle-distance traveled on that segment was created by assuming that the contribution of this point source to NO2 concentration was proportional to the product of segment length and ADT. Residential buffers were defined in 500 m steps within 1 km of each residence and in 1 km steps beyond 1 km (i.e., buffer 1 = 0 to 500 m, buffer 2= 500 m to 1 km, buffer 3= 1 km to 2 km, buffer 4= 2 km to 3 km, …, buffer 11 = 9 km to 10 km). Traffic variables used in model development were calculated by summing the contributions of all point sources within a buffer and dividing by 10 000 (in order to produce similar orders of magnitude for all model inputs). This created a measure of vehicle-kilometers (unit = 10 000 vehicle-kilometers) traveled within the range of distances that defined each buffer. For example, a value of 4.5 in a buffer would indicate an average of 45 000 vehicle-kilometers traveled per day within that buffer. Fig. 1 illustrates location of the ADT counts on road segments within a circular buffer surrounding a residence.
Land use data for Connecticut were obtained from the University of Connecticut's Center for Land Use Education and Research (CLEAR) (CLEAR, 2006). The most recent data available at the time of analysis showed land use in the year 2006, stored as a raster file with 30-meter pixels and classified into 12 categories of land use: developed, deciduous forest, coniferous forest, turf and grass, other grasses, agricultural field, water, non-forested wetland, forested wetland, tidal wetland, barren, utility rights-of-way (CLEAR, 2006).
Since all pixels in the file are of equal size, the proportion of a particular category of land use within a buffer is equal to the count of pixels of that land use category divided by the total pixels within a buffer. Land use measures were calculated using the `Thematic Raster Summary (by polygon)' tool in the Hawth's Tools ArcGIS extension (Beyer, 2004). The sum of pixels of each category of land use within each buffer was multiplied by 900 (area in m2 of a 30 m pixel) / 10 000 (area in acres of a hectare) to give the area in hectares of each category of land use within a buffer. Fig. 1 illustrates “pixels” within a residential buffer where different colors represent different land use categories.
Population density (per square mile) for each residence was obtained from a polygon layer of Connecticut census tracts from 2005 in ArcGIS (ESRI, 2005) and assigned to each residence as the mid-year census tract population divided by the area of the census tract in square miles.
For the final model, residential buffers were divided into quadrants to reflect the cumulative prevailing wind during the course of the study (illustrated by the orange lines in Fig. 1). Meteorological data, including wind speed and direction, were obtained for four weather stations in Connecticut (Sikorsky Memorial Airport in Bridgeport, Municipal Airport in Danbury, Tweed Airport in New Haven and Bradley International Airport in Hartford)(NRCC, 2010). Using wind direction data from all four stations, a cumulative wind rose plot for the study period (April 2006–August 2009) was created by determining the distribution (as a percent of total study time) of prevailing wind from each of 16 directions (Lakes Environmental Software, 2008). Data for wind speed were also incorporated into the wind rose plot.
We developed a model for estimating exposure to residential NO2, by integrating over highways and other landscape features in buffers surrounding a residence using publicly available data. Traffic-related NO2 resulting from point (x,y) on a highway is proportional to traffic volume (ADT(x,y) at that point) and dispersion (f [d(x,y)]), which depends on distance (d(x,y)) to the residence. Traffic elements affecting level of NO2 at a residence (illustrated in Fig. 1) are cumulated using a line integral,
where C represents all highways within a buffer (Holford et al., 2010). We modeled dispersion of NO2 levels using a step-function, and also included as regressors integrated land use within concentric buffers, elevation, population density and season of sampling using linear regression (PROC REG in SAS (SAS Institute Inc., 2004)). Normal QQ plots were used to examine the distribution of residuals. No transformation of either the dependent or independent variables had any material effect on goodness of fit (R2). Initial models contained variables for traffic volume and land use representing each of the buffers surrounding a residence, population density for the census tract containing the residence and elevation at the residence. Isotropic models (independent of direction) and anisotropic models (dependent on direction, i.e., dependent on effects of prevailing wind) were considered. For anisotropic models, covariates for traffic were calculated for each residential buffer and directional quadrant combination.
Variable selection for the final model followed a rule that no more distant buffer could be included without the inclusion of all buffers of the same variable type (traffic volume or land use category) closer to the residence. For example, traffic in the 3 to 4 km buffer could not be included in the model without also including the covariates for traffic for all buffers within 3 km (0 to 500 m, 500 m to 1 km, 1 km to 2 km, and 2 km to 3 km). Similarly, `developed' land use in the 1 km to 2 km buffer could not be included in the model without also including the covariates for `developed' land for all buffers within 1 km (0 to 500 m, and 500 m to 1 km). The number of covariates was reduced by working from outer buffers inward, comparing nested models with F-tests, then retaining all variables contributing significantly (at the 5% level) to model fit. The use of multiple covariates for traffic volume and land use created the possibility of multicollinearity. Elimination of non-significant covariates from the furthest point from the residence and working inward established an inferential hierarchy in which near points more plausibly affected the response than those farther away.
To account for the effect of seasonality on level of NO2, a cubic bspline function of date specifying six equally spaced knots was fit to the sampled NO2 (PROC TRANSREG, SAS (SAS Institute Inc., 2004)) and was included as the seasonality covariate in one version of the final model. The bspline function is dependent on the specific time period of the NO2 sampling study, thus, in order to adapt the model for predicting NO2 levels at times outside of this specific time period, a more generalizable trigonometric (cosine and sine) and linear function of date was used to account for the effect of season in a second version of the final model.
By design, NO2 samples were collected at each residence four times, once in each season over the course of the year-long study, and over three-fourths of the residences contributed samples from 3 or 4 seasons. Prior to fitting the final model, we examined correlations among residuals for measurements made at different time points and found low, but significant correlations (ranging from 0.13 to 0.24). Therefore, the final model was fit using repeated measures regression (PROC GENMOD with GEE [generalized estimating equations] (SAS Institute Inc., 2004)) specifying an “exchangeable” correlation structure.
To assess the validity of the model as a predictive tool, output from the (generalizable) final model was compared to 120 NO2 measurements from a study conducted in Connecticut in 1994. Residential addresses were geocoded and land use and traffic covariates were developed using the methods described above. Parameter estimates used in the validation model were those obtained from the NO2 samples collected from 2006–2009 with the exception of the intercept, which was adjusted to reflect the difference between mean NO2 levels measured in each study. Differences between observed and predicted NO2 values were examined. In addition, we performed cross-validation using a strategy of fitting the model excluding NO2 observations for each residence in turn, then using the model to obtain predicted values for excluded observations.
In order to illustrate NO2 levels predicted by the final model, we selected an area of the state with a range of traffic density and land use characteristics. Prediction maps of NO2 concentrations for two dates (February 1 and August 1, 2009) were created for a 616 km2 area around Hartford, Connecticut. A raster was created using 300 m × 300 m pixels then the latitude and longitude coordinates were calculated for the centroid of each pixel. Each of these 8030 points were treated as “residences” and covariates necessary for the final model (i.e., traffic volume for each buffer/quadrant combination, land use for each buffer, and for each point population density, elevation and date of prediction) were developed using the methods described above. The final model including covariates for these locations was used to obtain predicted NO2 values. Predicted values were reassigned to the original raster grid which was resampled to a resolution of 30 m and displayed using a color ramp to indicate NO2 concentration.
Summary statistics for all covariates used to develop the step-function, dispersion model of NO2 concentration in the state of Connecticut are shown in Table 1. The geographic distribution of average daily traffic (ADT) counts is displayed in Fig. 2. The unadjusted, isotropic dispersion function relating NO2 to traffic volume in concentric buffers out to 10 km from a residence (unadjusted R2 = 0.30) is shown in Fig. 3.
An examination of the wind rose representing the composite wind speed and direction for the study period reveals prevailing north/south winds (Fig. 4). Therefore, an anisotropic model was created by dividing buffers into four quadrants (north, east, south and west of the residence as indicated by the labeled quadrants in the compass inset in each panel of Fig. 5) and calculating traffic volume for each buffer/quadrant combination.
Prior to inclusion in the final model, categories of land use were collapsed from 12 to 3 according to their effect on NO2 level in a model that included only land use variables: increased NO2 levels were significantly associated with only the “developed” land category; decreased NO2 levels were significantly associated with “forest/grass” categories (including deciduous forest, coniferous forest, turf and grass and other grasses); and no significant associations were found between NO2 level and any other category (including agricultural field, water, non-forested wetland, forested wetland, tidal wetland, barren, utility rights-of-way). Geographic distributions of land use (“developed,” “forest/grass,” “other”), population density and elevation are displayed in Figs. 6, ,77 and and8,8, respectively. Land use variables for “developed” and “forest/grass” land were calculated for each residential buffer prior to inclusion in the model. Model covariates for elevation and population density were calculated once for each residence.
Initial analyses using traffic volume variables calculated out to 10 km from the residence showed no significant improvement in model fit compared to using variables within 6 km, therefore covariates for the most distant residential buffers were dropped from further consideration. After variable selection, following the rules described in section 3, above, the final model incorporated significant traffic buffer/quadrant combinations, significant buffers for “developed” and “forest/grass” land as well as population density, elevation and variables representing season (Table 2).
Fig. 9 displays NO2 concentrations measured over the course of the study as well as NO2 as predicted by two different functions of date: a third-degree bspline with six knots (red line) and a trigonometric and linear function of date (green line), which were used to represent seasonality in model development and model prediction, respectively. The overall R2 (adjusted) for the final model was 0.6728 using the bspline function of date to represent seasonality (Table 2, Model 1) and 0.6430 using the trigonometric and linear function of date (Table 2, Model 2). The relationship between traffic volume in the final adjusted model (Table 2, Model 1) and NO2 level is shown for each of the four quadrants surrounding a residence (Fig. 5).
Data used for validating the model included 120 NO2 samples taken outside of Connecticut residences in 1994, and ranged in value from 4.39 to 33.10 ppb (mean [SD] of 13.75 [5.35]). Using the trigonometric and linear function of date to adjust for seasonality, in addition to covariates from the final model including traffic, land use, population density, elevation, and adjusting the intercept to reflect the 3.75 ppb difference in means between the two sample sets, the correlation between observed and predicted NO2 was 0.68. Results of the additional cross-validation analysis using the leave-one-residence-out strategy produced a RMSE of 2.40 compared to 2.38 for the final model (Model 2, Table 2) fitted using all observations. The similarity of RMSEs between the cross-validation sample and the final model indicates that the model performs well in estimating exposure given relevant data on traffic, land use and time.
A variogram was constructed (not shown) and revealed no evidence of spatial correlation for these data. There was, however, evidence of correlation among the repeated temporal measurements from each residence and this was incorporated into our final models (Models 1 and 2, Table 2).
The final model (with the trigonometric and linear function of date for seasonality) was used to predict NO2 levels in a 616 square km area around Hartford, Connecticut at two times in a year (Fig. 10). Orange areas indicate the highest predicted concentrations of NO2 and blue areas the lowest. Predicted values for February 1 ranged from 5.9 to 19.8 ppb (mean [SD] 10.4 [2.3]) (Fig. 10, left panel), and for August 1 ranged from 0.5 to 14.4 ppb (mean [SD] 5.0 [2.3]) (Fig. 10, right panel).
The model described here demonstrates significant relationships between traffic volume, land use, population density, elevation, season and NO2 concentration over a large geographic area. The final model incorporates methods used by Holford et al. (2010) and explains approximately two-thirds of observed variation in NO2 measurements (Table 2). This is an improvement over the isotropic step function or anisotropic cubic spline models reported by Holford et al. (2010) which were unadjusted for other covariates and were able to explain half of the variability. This in turn was an improvement on use of a Gaussian dispersion model that weighted the effect of average daily traffic (ADT) as a function of distance from residence (Holford et al., 2010) which was able to explain less than 10% of the variability in NO2 level. In the model we present here, inclusion of publicly available variables (namely, landscape characteristics including land use, population density and elevation) has resulted in a significant improvement in predicting NO2 concentration at a residence.
In the isotropic model unadjusted for any other covariates, the largest contribution to NO2 comes from traffic volume within 500 m of a residence, with significant (though small) contributions to NO2 from traffic volume as far as 6 km distant (Fig. 3). This is contrary to previous reports suggesting that NO2 is significantly elevated only close to roadways (Carr et al., 2002; Jerrett et al., 2007). Many previous models of NO2 have assumed a Gaussian dispersion function that precludes finding effects at greater distances, whereas the step function used here makes no such assumptions.
When an anisotropic model is considered, the impact of prevailing winds is easy to see. The wind rose summarizing prevailing winds during the three-year study (Fig. 4) suggests that the strongest effects should be observed for traffic volume in the northern, southern and western quadrants, with little effect accruing from traffic east of a residence, and this is exactly what we see in the model. Fig. 5 illustrates the traffic portion of the final adjusted model and shows a similarly-sized effect of traffic volume on NO2 levels for the quadrants north, south and west of a residence and a much smaller contribution to NO2 levels from traffic volume east of a residence (Fig. 5, Table 2). The strongest effect on NO2 level at a residence is from traffic volume within 4 km north, 4 km south and 2 km west of a residence (Fig. 5 and Table 2).
Our results suggest that developed land within 500 m of a residence is associated with increased levels of NO2, while forested and grass land within 2 km is associated with decreased NO2 (Table 2). Only interstates and numbered highways are included in state-collected traffic data, thus it is possible that a portion of the developed land category accounts for contributions to NO2 concentration coming from roadways not included in the state count. Developed land may also contribute to NO2 levels by accounting for the presence of other NO2 sources (e.g., non-traffic combustion processes related to industry, power generation or domestic heating). One limitation of our land use data source (CLEAR, 2006) was that there was only one category of “developed” land available. Additional categories such as “commercial,” “residential,” “industrial” which may be available from other land use data sources would be interesting to explore in future analyses. There is a roughly inverse relationship between the land use categories of developed and forest/grass since together these categories account for the majority of land use in any given buffer (Fig. 6). Thus, the negative effect on NO2 seen for forest/grass land may be an artifact of the absence of developed land. There is evidence, however, that certain types of land use have a true negative effect on NO2 levels, i.e., trees and plants may actually reduce levels of NO2 by a process of absorption (Atkins and Lee, 1995; Gesler et al., 2002; Hanson et al., 1989; Hargreaves et al., 1992), or by creating a physical barrier to NO2 dispersion from a roadway.
Interestingly, population density had a small, but significant positive effect on estimated NO2, while elevation had a small, but significant negative effect. In Connecticut, the highest population density (Fig. 7) and the highest percentage of developed land (Fig. 6) are along the shoreline (i.e., at sea level) and in the Connecticut River valley (Fig. 8). While these three covariates (population density, land use, elevation) are certainly related, they can also serve as proxies for unmeasured factors. For example, population density can represent factors associated with urban areas that may impact NO2 levels (e.g., volume of diesel truck traffic or emissions from fuel combustion for home heating). Elevation may represent factors of topology that impact meteorology that in turn modify levels of NO2.
The final model was developed using NO2 measurements from a recent 3.5 year period (2006 – 2009). The bspline function developed to account for seasonality was tied to this time period and thus did not seem appropriate to use to account for seasonality at the time the validation samples were collected over 12 years ago. Therefore, the more generalizable cosine, sine and linear function of date was used to represent seasonality. In addition, an adjustment was made to the intercept to account for the steady decrease in annual levels of NO2. In the state of Connecticut, the annual NO2 concentration has fallen from a mean (SD) of 24.6 (11.8) ppb in 1994 to 15.3 (8.5) ppb in 2006 (CT DEP, 2006; US EPA, 2010). The difference between the two sample sets was not as dramatic (13.75 (5.35) ppb in 1994 compared to 10.00 (3.99) ppb for the years of the study 2006–2009), but was significant (t-test, p< 0.0001). Therefore, the intercept in the validation model was adjusted by 3.75 ppb. The difference in means between the two studies suggests that model calibration would be wise before using it to predict NO2 levels in different time periods.
Fig. 10 shows an NO2 surface predicted by the final model for an area around Hartford, Connecticut. Areas with highest estimated values for NO2 correspond to areas with the busiest roadways, most developed land, highest population density and lowest elevation. Predicted NO2 levels are shown for a winter day (February 1, Fig. 10, left panel), and a summer day (August 1, Fig. 10, right panel). As would be expected from what was observed with the measured values, NO2 levels are predicted to be higher in winter than summer. This is reflected in the February map by the greater intensity and larger size of orange areas adjacent to the interstates passing through the center of Hartford, compared to the August map with its lower intensity and smaller area shaded orange in Hartford center and increased blue intensity in areas away from the city center.
One of the challenges to creating and implementing our model has to do with the temporal congruence, or lack thereof, of the covariate data sources. This was not a major issue for model development: NO2 measurements were from 2006 – 2009, traffic data from 2006, land use data from 2006, and population density data from 2005. Temporal mismatches were greater for the validation analysis where NO2 measurements were from 1994, which may explain some of the residual variance in the model. Although changes in land use and elevation may be negligible over this time period, better fits might have been obtained by using traffic and population density data from closer to the time of NO2 measurement.
The model we present is relatively simple, in that it considers only the main-effects of traffic and landscape characteristics. Although this enables straightforward interpretation of the covariates, it may fail to capture certain complexities that could be described using interaction terms. For instance, one might imagine that land use on the line between a road segment and the residence location might interact with the contribution of the segment, and this could be captured by a more complex model. As we have shown, there are clearly anisotropic effects of traffic due to meteorology and topography. We adjusted for the composite effect of prevailing wind over the period of the study. The temporal variability in the modifying effects of wind direction and wind speed could be captured and incorporated into the model in greater detail by calculating wind roses for shorter time periods and for more than four directional sectors.
The model presented here provides a method to estimate exposure to NO2, often used as a proxy for traffic-related pollution, without expensive monitoring efforts. This model is based on readily accessible information and permits a cost-effective assessment of residential exposures using a dispersion function for NO2 around roadways that also considers additional variables including land use, elevation, population density and season. The model, which could be calibrated and applied in areas other than Connecticut, has importance as a tool for exposure estimation in epidemiological studies of traffic-related air pollution.
This study was funded by the National Institutes of Health grants ES05410, ES011013, and ES017416.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.