Despite rapidly growing use of GIS’s in health research and concerns about GIS accuracy, published validation studies of GIS resource data are scant. In this study, we compared a commercial database of physical activity facilities to a field census in two communities. We found moderate overall agreement, with the main sources of error resulting from facilities not included or misclassified as physical activity facilities in the commercial database. Classification of physical activity facility type was similar according to the commercial database and field team. While the positional error was large for some facilities, both locations fell into the same Census block group for 95% of facilities.
Completeness and Attribute Accuracy
Discordance in the presence of physical activity facilities, regardless of type, resulted from three types of error: (1
) facilities included in the commercial database but not found in the field; (2
) facilities included in the commercial database but not identified as valid physical activity facilities in the field; and (3
) facilities identified in the field but not in the commercial database. The latter two errors were far more common than the first. Because only records matching pre-specified definitions of physical activity facilities were obtained from the commercial vendor, we were not able to investigate if these facilities were contained in the database but not captured by the a priori
classification scheme. Administrative offices for physical activity facilities, in the absence of the actual facility itself, were the primary contributors to the second error type. Refinement of the classification scheme for physical activity facilities could potentially reduce both types of error.
Positional error ranged from negligible (5 meters Euclidean distance) to exceptional (almost 20,000 meters, in the case of a golf course) and was larger than distances observed in residential address geocoding validation studies (11
). However, this study examined physical activity facilities, which are often large in size and could have ambiguous point locations (e.g., golf courses, which cover a relatively large geographic area), rather than residences. Indeed, the facility with the largest positional error was a golf course with a long driveway. Positional error for smaller facilities such as YMCA and member facilities was more consistent with error of residential locations.
For any given facility, GPS and geocoded locations may fall far apart but within the same neighborhood. In studies that use administratively defined neighborhoods, this scenario would neither alter facility counts within neighborhoods nor introduce bias to associations with health-related outcomes. In this study, commercial database and field locations fell into the same block group for 95% of facilities.
The discrepancy between large distance measures and low count error is consistent with the nature of the block group boundaries. Block groups generally contain 300–3000 individuals, so block group size increases with decreasing population density. The largest distances were found in the non-urban community in which block groups averaged 3.7 square miles versus 0.03 square miles in the urban community, yet the larger block group areas increased the likelihood that both locations would fall into the same block group. Large distance measures may have more potential for bias when using buffer-defined neighborhoods with a large radius (e.g., 5-mile radius around each respondent’s residence); however, investigation into this issue requires respondent locations and is thus outside the scope of this study.
The assumption that address numbers are uniformly distributed along a street segment is a known source of error in the geocoding process. Further, street segments are shorter in urban areas than less populated areas. Indeed, the commercial database and field locations fell along the same street for 71% of facilities in both communities, while the distances are substantially larger in the non-urban than urban community. That much of the error occurs along the street segment is a promising finding because it reduces the potential network distance between geocoded versus actual location. It is also likely to be random with respect to resident characteristics, thus biasing built environment-health outcome associations toward the null.
Field counts calculated within a given commercial database count face the issues of completeness as well as positional accuracy. Our commercial database generally underestimated the number of physical activity facilities within a block group, particularly in non-urban communities. However, commercial database and field counts increased together, albeit not in parallel, so rank orders of physical activity counts would be preserved. This error is likely to result in nonsystematic misclassification and thus attenuate associations between physical activity facilities and health outcomes.
While some differences such as facility size and geocoded street segment length are characteristic of any non-urban versus urban area, other differences were imposed by the study design. Because our field team confined their census area to specified block groups encompassing a larger area in the non-urban community, the maximum distance between the field and geocoded locations was larger and the team was more likely to locate a facility listed in the commercial database in the non-urban relative to the urban community. These considerations are consistent with generally larger positional error but lower count error in non-urban than urban communities. However, few facilities contained in the commercial database were not located in the field, so the latter issue likely has minimal impact.
Strengths & Limitations
While including two dissimilar communities provides validity data for both an urban and non-urban area, these communities are not representative of the range of geographic areas in the U.S and span a small range of urbanicity. The small number of facilities may have resulted in unstable estimates of error, but overall patterns were consistent, suggesting that our estimates are reasonably reliable and valid.
Errors observed in this study can be attributed to several sources. First, it is possible that facilities found in the field but not in the commercial database were new facilities not yet included in the database. We do not currently have access to more recent data to investigate this issue, but this source of error can be expected in any study using commercial data and is therefore important to capture in this validation study. Second, because facilities contained in the commercial database used in this study depended on both the source data as well as the investigator-supplied a priori classification strategy, it is difficult to determine the contribution of each to the observed errors. Third, disagreement of the street nearest to GPS versus geocoded locations can be attributed to error in the goecoded coordinates or in the street file. In general, errors observed in this study were reasonable, so these limitations should not reduce confidence in the utility of commercial databases for built environment research.
Geocoding of facility locations yielded a high match rate (94.8%), but for 5.2% of facilities, geocoded coordinates provided by the commercial vendor were used. However, comparison of positional error calculated from the two sources show only small (7.5 meters) difference in mean error, so use of two sources of geocodes in unlikely to impact our results.
Finally, this study assesses one potential GIS layer among many. Evaluation of other GIS data sources is required to more fully describe GIS data validity and potential bias in other GIS-derived measures of the built environment.
Despite these limitations, this study is the first to estimate completeness and attribute and positional accuracy of GIS physical activity facilities data. Criterion measures were provided by a detailed field census and GPS locations accurate to within one to two meters.
Overall, our validation study findings suggest that the commercial database of physical activity facilities may contain appreciable error, but the random nature of the sources of error and small error in neighborhood facilities counts suggest that bias to associations with health-related outcomes is probably small and toward the null. Our findings do not suggest that existing associations between aspects of the built environment and health related outcomes should be called into question.
An important next step is the application of validation findings to individual-level respondent locations and health data, allowing estimation, through simulation or empirical comparisons, of the magnitude of bias to the association between physical activity facilities and health outcomes introduced by this error.