|Home | About | Journals | Submit | Contact Us | Français|
To validate a commercial database of community-level physical activity facilities that can be used in future research examining associations between access to physical activity facilities and individual-level physical activity and obesity.
Physical activity facility characteristics and locations obtained from a commercial database were compared to a field census conducted in 80 Census block groups within two U.S. communities. Agreement statistics, agreement of administratively-defined neighborhoods, and distance between locations were used to quantify count, attribute, and positional error.
There was moderate agreement (concordance: non-urban: 0.39; urban: 0.46) of presence of any physical activity facility and poor to moderate agreement (kappa range: 0.14 to 0.76) of physical activity facility type. The mean Euclidean distance between commercial database versus field census locations was 757 and 35 meters in the non-urban and urban communities, respectively. However, 94 and 100% of non-urban and urban physical activity facilities, respectively, fell into the same 5-digit ZIP code, dropping to 92 and 98% in the same block group and 71% along the same street.
Our findings suggest that the commercial database of physical activity facilities may contain appreciable error, but patterns of error suggest that built environment-health associations are likely biased downward.
Research demonstrating statistical relationships, particularly in adolescents (1–3), between obesity and physical activity and aspects of the built environment, such as nearby retail destinations or parks and open space (4–7), has facilitated a growing interest in spatial analyses using geographic information system (GIS) technology.
GIS-derived measures of the built environment enable examination of built environment effects in large population studies because such measures do not rely on resource-intensive neighborhood audits or other forms of direct observation (8) and avoid limitations of perceived measures of the environment (6,9,10). However, large-scale GIS’s rely on existing databases of built environment components such as retail food outlets and recreation facilities.
Improved understanding of the implications of such analyses requires quantification of the degree of error in GIS data and ultimately the potential bias that such error introduces to environment-health associations. Errors may include incomplete data (count error), inaccurate classification of facilities or characteristics (attribute error), or inaccurate geocoded location (positional error) (Figure 1).
Despite the recognition of these potential errors and considerable growth in the use of GIS technology in health research, existing validation studies generally focus on the positional error of geocoded residential addresses (11–16). Count and attribute error in databases of community resources, as opposed to residential addresses that are managed in GIS programs have not been assessed.
In this study, we quantify the count, attribute, and positional error in a commercial database of physical activity facilities in two communities to be used in future research examining adolescent physical activity in relation to the built environment. While this validation study focuses on physical activity resources, it will inform future work with the analogous diet environment as well as broader built environment research using commercial databases.
The sample included 40 census block groups within each of two distinct geographic locations selected on the basis of urbanicity, representativeness, and safety concerns. The non-urban location was described by field staff as a “rural town,” with a population density of <2,000 persons/mi2. In contrast, the urban location was extremely dense (average >50,000 persons/mi2), comprised of block groups small in area, with a smaller proportion of racial/ethnic minorities and relatively low median income (Table 1).
The commercial database contained information on licensed businesses in the U.S. as of August 18, 2005, coded to eight-digit Standard Industrial Classification (SIC) codes. The vendor provided data (n=384 facilities) including businesses (1) with any of 6 SIC codes corresponding to a list of 169 SIC codes representing physical activity facilities or with facility name containing YMCA or YWCA text phrases; and (2) that fell within ZIP codes containing the 80 block groups described above. The list of SIC codes was generated collaboratively by the investigators and a vendor representative to ensure coverage of all physical activity resources and facilities. The vendor queries data by ZIP code, so facilities in ZIP codes corresponding to the 80 selected block groups were requested. Because park addresses in the commercial database correspond to their usually off-site administrative office or commercial entity contained in the park, parks data are more appropriately obtained from other databases and were excluded from this analysis.
Street address format was standardized by the commercial vendor, which facilitated the geocoding process. Preference was given to facility locations geocoded as a street segment match using ArcView 3.3 with default settings (spelling sensitivity=80, minimum match score=60, minimum score to be considered a candidate=30), 10-foot road offset, and Tele Atlas Dynamap/2000 for Download v15.1 (April 2005) street data (17), which is updated quarterly. An initial batch match was followed by an interactive session to resolve unmatched addresses, yielding 92.2% perfect matches and 2.6% matches with scores of 80–85 for an overall street segment match rate of 94.8% among the 384 facilities received from the vendor. Unmatched addresses resulted from street numbers outside the address range of the street segments, unmatched street name, mismatched street types, or similar issues. For the remaining 5.2% of facilities, latitude and longitude coordinates provided by the commercial vendor were used, yielding a total of 100% geocoding match rate of the records returned from the commercial vendor.
Following geocoding, a geoprocessing procedure was conducted to “clip” ZIP code polygon boundaries (geographic unit of commercial database) to block group boundaries (geographic unit of field census), resulting in 173 facilities within the study area. Facilities with duplicate identifiers or addresses were eliminated; for each duplicate set, facility type classification was identical and one record was retained, leaving a total of 161 facilities.
A full field census of physical activity facilities in the two selected locations was performed from August-October 2005. The field team identified and classified each physical activity facility as one or more pre-specified types (Table 2), obtained curbside Global Positioning System (GPS) readings and photographs, and recorded other pertinent details. GPS measurements were obtained using a Leica GS20 with differential correction and manual corrections for offsets, yielding locations accurate to approximately one to two meters. The field census was treated as the criterion measure.
Count, attribute, and positional errors in the commercial database were quantified by comparing to the field census overall and by facility type.
Count error was assessed by classifying each facility according to validity in the field and inclusion in the commercial database. In the field, a facility was classified as an invalid physical activity resource for adolescents if it was not found, not judged to be a recreation facility (e.g., an administrative office), or relevant only for very young children (e.g., playgrounds or other facilities for young children). Concordance was calculated as the number of facilities in agreement divided by the number of facilities. Kappa, a chance-corrected measure of agreement (21), sensitivity, and specificity were calculated using agreement statistic functions in Stata. Because facility types were not mutually exclusive, type-specific statistics may include overlapping resources. Averages weighted by type-specific field census counts provided global measures of agreement. “Insufficient cell size” indicates fewer than two commercial database or field facilities indicated for any given facility type.
Positional error was assessed using Euclidean and network distances calculated based on Universal Transverse Mercator (UTM) coordinates; ArcGIS Network Analyst (22) was used to calculate network route differences, hereafter referred to as “network distances”. To assess the impact of positional error on physical activity facility counts within neighborhoods defined by census geographies or ZIP codes, US Census geography shapefiles (2005), ESRI Streetmap USA files (2004), ZIP code (2005) and physical activity facility GPS and geocoded locations were spatially matched to determine the 5-digit ZIP code, census tract, block group, and nearest street upon which the geocoded and GPS locations fell. Facility coordinates were re-projected from UTM to the geographic coordinate system to match the other GIS layers. The street upon which geocoded and GPS coordinates were located was considered to match if: (1) the street segment identifier or (2) the street name and type for the nearest street segment matched. Locations were compared for each facility and summarized descriptively.
The commercial database and field census showed moderate agreement of presence of any facility, indicated by an overall concordance of 0.42 (Table 3). Kappa and specificity were not valid for overall counts because counts of facilities found in neither the commercial database nor the field census were, by definition, zero. Disagreement resulted primarily from facilities identified in the field but not the commercial database (90 facilities) or facilities in both data sources but not judged as valid recreation facilities (40 facilities), as opposed to facilities in the commercial database not found in the field (16 facilities) (data not shown).
Agreement of facility presence by facility type was assessed by comparing the GIS versus field census facility type classifications. Agreement was high, with concordance and specificity 0.82 and 0.90 or greater, respectively. Sensitivity ranged from 0.10 to 0.79 for non-urban and urban facilities combined. In general, kappa statistics indicated slight to substantial agreement, ranging from 0.14 to 0.76. Agreement was generally higher for non-urban than urban facilities. The weighted average (across facility types) of type-specific agreement statistics are listed in Table 3.
Euclidean distances between the commercial database (geocoded address) and field locations (GPS) were substantially larger for non-urban than urban facilities (mean Euclidean distance: non-urban, 757 meters vs. urban, 35 meters), and for large facilities such as golf courses and schools (e.g., mean Euclidean distance: outdoor non-urban, 3,235 meters vs. member non-urban, 302 meters) (Table 4). Network distances were larger but mirrored Euclidean distances (Table 4).
Because many studies use administrative boundaries such as ZIP code or Census tract to define neighborhoods, understanding the impact of positional errors on facility counts within these areas is important. In the non-urban and urban communities, 94 and 100% of geocoded and GPS locations, respectively, fell into the same 5-digit ZIP code and Census tract, dropping to 92 and 98% in the same same block group (Table 5). Further, for 71% of facilities, geocoded and GPS locations fell on the same street (Table 5), suggesting that most positional error resulted from the interpolation of points along the street segment during the geocoding process.
While these results are promising, they include only those facilities identified in both the commercial database and field census. Observed count errors, however, indicate that the commercial database provided an undercount of facilities. Table 6 presents field counts within block groups containing the specified number of physical activity facilities. For example, block groups containing one physical activity facility according to the commercial database actually contained 2.6 and 2.0 facilities in the non-urban and urban community, respectively.
Despite rapidly growing use of GIS’s in health research and concerns about GIS accuracy, published validation studies of GIS resource data are scant. In this study, we compared a commercial database of physical activity facilities to a field census in two communities. We found moderate overall agreement, with the main sources of error resulting from facilities not included or misclassified as physical activity facilities in the commercial database. Classification of physical activity facility type was similar according to the commercial database and field team. While the positional error was large for some facilities, both locations fell into the same Census block group for 95% of facilities.
Discordance in the presence of physical activity facilities, regardless of type, resulted from three types of error: (1) facilities included in the commercial database but not found in the field; (2) facilities included in the commercial database but not identified as valid physical activity facilities in the field; and (3) facilities identified in the field but not in the commercial database. The latter two errors were far more common than the first. Because only records matching pre-specified definitions of physical activity facilities were obtained from the commercial vendor, we were not able to investigate if these facilities were contained in the database but not captured by the a priori classification scheme. Administrative offices for physical activity facilities, in the absence of the actual facility itself, were the primary contributors to the second error type. Refinement of the classification scheme for physical activity facilities could potentially reduce both types of error.
Positional error ranged from negligible (5 meters Euclidean distance) to exceptional (almost 20,000 meters, in the case of a golf course) and was larger than distances observed in residential address geocoding validation studies (11–15). However, this study examined physical activity facilities, which are often large in size and could have ambiguous point locations (e.g., golf courses, which cover a relatively large geographic area), rather than residences. Indeed, the facility with the largest positional error was a golf course with a long driveway. Positional error for smaller facilities such as YMCA and member facilities was more consistent with error of residential locations.
For any given facility, GPS and geocoded locations may fall far apart but within the same neighborhood. In studies that use administratively defined neighborhoods, this scenario would neither alter facility counts within neighborhoods nor introduce bias to associations with health-related outcomes. In this study, commercial database and field locations fell into the same block group for 95% of facilities.
The discrepancy between large distance measures and low count error is consistent with the nature of the block group boundaries. Block groups generally contain 300–3000 individuals, so block group size increases with decreasing population density. The largest distances were found in the non-urban community in which block groups averaged 3.7 square miles versus 0.03 square miles in the urban community, yet the larger block group areas increased the likelihood that both locations would fall into the same block group. Large distance measures may have more potential for bias when using buffer-defined neighborhoods with a large radius (e.g., 5-mile radius around each respondent’s residence); however, investigation into this issue requires respondent locations and is thus outside the scope of this study.
The assumption that address numbers are uniformly distributed along a street segment is a known source of error in the geocoding process. Further, street segments are shorter in urban areas than less populated areas. Indeed, the commercial database and field locations fell along the same street for 71% of facilities in both communities, while the distances are substantially larger in the non-urban than urban community. That much of the error occurs along the street segment is a promising finding because it reduces the potential network distance between geocoded versus actual location. It is also likely to be random with respect to resident characteristics, thus biasing built environment-health outcome associations toward the null.
Field counts calculated within a given commercial database count face the issues of completeness as well as positional accuracy. Our commercial database generally underestimated the number of physical activity facilities within a block group, particularly in non-urban communities. However, commercial database and field counts increased together, albeit not in parallel, so rank orders of physical activity counts would be preserved. This error is likely to result in nonsystematic misclassification and thus attenuate associations between physical activity facilities and health outcomes.
While some differences such as facility size and geocoded street segment length are characteristic of any non-urban versus urban area, other differences were imposed by the study design. Because our field team confined their census area to specified block groups encompassing a larger area in the non-urban community, the maximum distance between the field and geocoded locations was larger and the team was more likely to locate a facility listed in the commercial database in the non-urban relative to the urban community. These considerations are consistent with generally larger positional error but lower count error in non-urban than urban communities. However, few facilities contained in the commercial database were not located in the field, so the latter issue likely has minimal impact.
While including two dissimilar communities provides validity data for both an urban and non-urban area, these communities are not representative of the range of geographic areas in the U.S and span a small range of urbanicity. The small number of facilities may have resulted in unstable estimates of error, but overall patterns were consistent, suggesting that our estimates are reasonably reliable and valid.
Errors observed in this study can be attributed to several sources. First, it is possible that facilities found in the field but not in the commercial database were new facilities not yet included in the database. We do not currently have access to more recent data to investigate this issue, but this source of error can be expected in any study using commercial data and is therefore important to capture in this validation study. Second, because facilities contained in the commercial database used in this study depended on both the source data as well as the investigator-supplied a priori classification strategy, it is difficult to determine the contribution of each to the observed errors. Third, disagreement of the street nearest to GPS versus geocoded locations can be attributed to error in the goecoded coordinates or in the street file. In general, errors observed in this study were reasonable, so these limitations should not reduce confidence in the utility of commercial databases for built environment research.
Geocoding of facility locations yielded a high match rate (94.8%), but for 5.2% of facilities, geocoded coordinates provided by the commercial vendor were used. However, comparison of positional error calculated from the two sources show only small (7.5 meters) difference in mean error, so use of two sources of geocodes in unlikely to impact our results.
Finally, this study assesses one potential GIS layer among many. Evaluation of other GIS data sources is required to more fully describe GIS data validity and potential bias in other GIS-derived measures of the built environment.
Despite these limitations, this study is the first to estimate completeness and attribute and positional accuracy of GIS physical activity facilities data. Criterion measures were provided by a detailed field census and GPS locations accurate to within one to two meters.
Overall, our validation study findings suggest that the commercial database of physical activity facilities may contain appreciable error, but the random nature of the sources of error and small error in neighborhood facilities counts suggest that bias to associations with health-related outcomes is probably small and toward the null. Our findings do not suggest that existing associations between aspects of the built environment and health related outcomes should be called into question.
An important next step is the application of validation findings to individual-level respondent locations and health data, allowing estimation, through simulation or empirical comparisons, of the magnitude of bias to the association between physical activity facilities and health outcomes introduced by this error.
The funding of this study comes from the National Institutes of Health NICHD (K01 HD044263-01), NIDDK (DK56350), NIEHS (P30ES10126) and NIA (P30AG024376). The authors would like to thank Allen Serkin, Peter Zambito, Evan Hammer, and Brian Frizzelle of the University of North Carolina, Carolina Population Center, Spatial Analysis Unit for collecting and managing the field census and commercial data, for their assistance and expertise regarding spatial data and analysis, and for their comments on the manuscript. The authors also thank Ms. Frances Dancy for her helpful administrative assistance. There were no potential or real conflicts of financial or personal interest with the financial sponsors of the scientific project.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.