Additional detail concerning the survey and variables analyzed here are presented in Huang et al. 2009 [
34]. This study is based on a subset of data from the 2001 California Health Interview Survey (CHIS). This large (N = 55,428 households) random digit dial telephone survey in California is administered in seven languages (English, Spanish, Mandarin, Cantonese, Vietnamese, Korean and Khmer) and had a response rate, based on the American Association for Public Opinion Research equation RR4 [
35], of 43.3% with a cooperation rate of 63.7% (weighted to account for the sample design) and 77.1% (unweighted).
We studied residents of San Diego and Los Angeles counties where over 70% of survey respondents supplied the name of the nearest intersection to their residence (In LA County, 8728/12196 = 71.5%, and in SD County 1952/2672 = 73%). These addresses were geocoded to represent the location of each respondent for purposes of this analysis. After exclusion of respondents with missing or invalid data, 8506 respondents from LA and 1883 respondents from SD were used in the analysis. These two counties were the only ones with nearest intersection data available in CHIS 2001.
The paper has two sections. In the first, we characterize street connectivity based on GIS-derived measures from buffers around the nearest intersections to respondents homes. In the second section we used a combination of CHIS variables, Census data, and the street connectivity data in a model-based analysis to explore the relative contributions of street connectivity and other variables to active transportation (AT).
Contextual and connectivity variables
We compiled street connectivity and two density-related variables using circular buffers (areas around a point) of radius 0.5 km surrounding each respondent's location (nearest intersection to home). These buffers were defined using TIGER map files from the 2000 U.S. Census Bureau and implemented with GIS software (ArcView, ESRI, Inc.). Data concerning population and employment density and characteristics of the street network for each buffer were then calculated at the census tract or census block (administrative units that are nested within census tracts) level.
Population density within a buffer was generated by downloading US Census data at the census block level. Each half-kilometer buffer usually overlapped more than one census block. We assumed that population density is uniform within each census block and assigned a portion of the population within the census block to the buffer based on the area of the census block within the buffer. For example, if a buffer covers half of a census block, half of the census block's population is assigned to that buffer, in addition to the population in census blocks that were completely within the buffer. The total population in the buffer was then divided by the area (0.785 square kilometers). Employment density data were generated using data from the metropolitan planning organization for each area - the Southern California Association of Governments (SCAG) for Los Angeles and the San Diego Association of Governments (SANDAG) for San Diego. Each agency provided total employment data by census tract for the year 2000. The method to calculate employment density was identical to that of population density, except that because of census data availability, we used tracts instead of blocks. Therefore, the variance associated with population and employment densities are likely to differ in this study.
For our measures of street connectivity, we first extracted or calculated values for nine variables for each buffer. Later we used principal components analysis (see results) to reduce the number of variables used in our analysis of variance. Variables included: 1) Link/Node Ratio, the link/node ratio is the total number of links divided by the total number of nodes. All nodes are included, meaning intersections and the ends of cul de sacs and dead-end streets. A higher ratio = higher connectivity. Links are defined as street segments and nodes as intersections or dead ends. 2) Intersection Density, intersection density is the number of real nodes (nodes that are at 4-way or 3-way intersections, not the end of cul de sacs) divided by the buffer area (0.785 sq. km.). A higher density = higher connectivity. 3) Street Network Density, the street network density is calculated by summing the lengths of all the links within the buffer (the total network distance within the buffer, ignoring the number of lanes on a road) and dividing by the area of the buffer (0.785 sq. km.) (Note buffer size choice was based on our expert opinion, budget constraints precluded analysis of more buffer sizes). The portion of a street (link) that continued outside the buffer was not included. A higher density = higher connectivity. 4) Connected Node Ratio, connected node ratio (CNR) is the number of real nodes divided by the total number of all nodes. If all the nodes in a buffer were at 4-way or 3-way intersections, the CNR would be 1.0. A higher ratio = higher connectivity (maximum = 1.0). 5) Block Density, block density is the total number of Census blocks within a buffer divided by the area of the buffer (0.785 sq. km.). Census block boundaries generally coincide with streets and are consistent with a block defined by the area within connecting streets. If a portion of a block was outside a buffer, only the area of the block within the buffer was included. A higher density = higher connectivity. 6) Average Block Length, the average block length is the average length of the links that are completely or partially within the buffer. For links (blocks) that continue outside the buffer, the entire length of the link is included in the calculation. Truncating the link at the buffer boundary would have reduced the length of the block artificially. A higher average length = less connectivity. 7) Median Block Length, median block length was calculated in the same manner as average block length. A higher median length = less connectivity
The eighth variable was the Gamma index, the ratio of the number of links in the network to the maximum possible number of links between nodes. The maximum possible number of links is expressed as 3 * (# nodes - 2) because the network is abstracted as a planar graph. In a planar graph, no links intersect, except by nodes [
28]. Values for the gamma index range from 0 to 1 and are often expressed as a percentage of connectivity, e.g. a gamma index of 0.54 means that the network is 54 percent connected. Only links that are completely within the buffer were included. This was because every link must have a node on each end. If links were truncated at the buffer, there would be no node. In addition, only the nodes that intersect with these links were included. Gamma was only calculated for buffers with three or more nodes. All the locations with the number of nodes less than 3 were treated as missing (3 points in SD and 6 points in LA). A higher value = higher connectivity (maximum = 1.0).
The ninth variable was the Alpha index. The alpha index uses the concept of a circuit - a finite, closed path starting and ending at a single node. The alpha index is the ratio of the number of actual circuits to the maximum number of circuits and is equal to:
Values for the alpha index range from 0 to 1. As with gamma, only links that are completely within the buffer were included and only the nodes that intersect with these links were included. Alpha can not be calculated if the number of nodes in a buffer is less than three or the number of nodes is equal to or greater than the number of links. These cases were coded as missing data (98 points in SD and 128 points in LA). The second condition was violated more often than the first, because only links within a buffer be included. This was usually in more rural areas. A higher value = higher connectivity (maximum = 1.0)
Several of the above measures were highly correlated; 7 of the 36 possible pairs of the 9 variables had correlation coefficients above 80% (See below). Including highly correlated covariates in a regression model leads to instability of the model, so we used principal components (orthogonal rotation) and factor analysis to identify the main components of variance in this data set. This process constructed indices that explained most of the variance of the built environment across the locations and that could be used as independent predictors in the models. Similar principal components were derived from analyses considering LA and SD separately. These analyses were carried out in SAS JMP Version 8.0 (Cary, NC).
Active transportation, demographic, and anthropometric variables from CHIS
CHIS 2001 survey data included in this study were a measure of active transportation, and multiple relevant demographic and anthropometric variables. AT was measured by asking three short questions: 1) "Over the past 30 days, have you walked or bicycled to or from work, school, or to do errands?", 2) "How many times per day, per week or per month did you do this?" and 3) "And on average, about how many minutes did you walk or ride your bike each time?". AT was analyzed either as a measure of prevalence such as yes/no (any AT or none) from the answers to the first question, or as a measure of duration such as minutes per week among walkers/bicyclists derived from the answers to the second and third questions.
Demographic and socioeconomic status (SES) variables including age, gender, race, education, and income were also extracted from the CHIS survey resource for each respondent, as were self-reported health status, immigration status and employment status. For self related health status we chose an activity related variable based on responses to the query "How much does your health limit you when climbing several flights of stairs?". Responses were on a three part scale, "Limited a lot", "Limited a little", "Not limited at all". CHIS includes a variety of other variables related to diet, tobacco and alcohol use, cancer screening practices, health care coverage; we focused on variables commonly used in past studies of active transportation. For some analyses we also used self-reported data on height and weight to obtain body mass index [BMI = Height/Weight (kg)2], a measure of obesity.
Geographic identifiers included latitude and longitude rounded to 0.01 degrees and Zipcode of address. Data concerning bus stops and light rail were obtained from the Los Angeles and San Diego Public transit agencies coded as present or absent within a buffer (Thanks to R. Adamski). Presence or absence of a freeway within a buffer was obtained from Tiger Line files.
Statistical analysis
Preliminary analysis showed that the distribution of the number of minutes reported in AT was skewed and had a spike at zero, representing respondents who do not report any AT. A logarithmic transformation normalized the distribution of non-zero minutes. The importance of the potential explanatory variables was tested separately by a logistic model for the AT/no AT response and a lognormal model for the number of minutes reported by those with any AT [
36]. These fixed effects models included all main effects and all possible two-way interactions at first. Non-significant (p > 0.05) interactions and then main effects were removed by a stepwise procedure.
Once the initial subset of variables and their interactions were determined, the data were analyzed by a multivariate regression, with a binary component for whether a person reported any AT and a lognormal component for the number of minutes of AT. This approach has the advantage of increased power to detect significant effects that indicate a common association with both responses. For example, if older respondents were less likely to report any AT and those who did report any AT spent less time in AT, then the combined model could estimate a single parameter for the age effect, increasing the power over that from two separate models. Another advantage of the multivariate model is that it can measure any correlation between the propensity to report AT and the length of time spent in AT in geographic areas with multiple respondents. A disadvantage of this approach is that the more complex model is difficult to apply, requiring larger sample sizes and greater computational effort to estimate its parameters than either model component separately. These difficulties are compounded by the need to account for the correlation of responses among neighbors.
Methods have been developed to analyze data that result from a mixture of two different statistical distributions. Zero-inflated Poisson (ZIP) methods, introduced by Lambert in 1992 [
37], are regression models for count data with an excess number of zero responses. These models include a model component to represent the probability the dependent variable occurred in a subject. More recently, these zero-inflated mixture model methods have been extended to other types of data [
38]. For example, Tooze et al. proposed a mixture model that included random effects correlation among the repeated responses of individuals [
39]. This method has been applied successfully to 24-hour dietary recall data, with separate regression components for whether the respondent ate a particular food during that day and for their amount consumed of that food [
40]. The probability that a person ate the food is modeled by a logistic regression model and the usual amount consumed is modeled by a normal regression model, after a suitable normalizing data transformation. This model produces a direct estimate of the correlation between the two model components but does not allow estimation of spatial correlation of the respondents, an important goal of our CHIS analysis.
We used SAS PROC GLIMMIX to implement a multivariate model that is a mixture of logistic and lognormal regression components for the probability that a person reported any AT and the amount of AT, respectively, similar to the model for dietary intake described above [[
41] example 5]. Covariates found to be significant predictors of either outcome (any AT and amount of AT) were included and were initially allowed to vary by type of outcome. Those with non-significant effects, as measured by p-values of the Type III (partial) sums of squares F test greater than 0.05, were removed. If there was no significant difference in an effect between the two model components, the two parameters were replaced by a single common parameter for that effect. Covariates that were significant predictors for only one of the two counties were retained in both county models for comparability of effects. Covariates indicating gender, race and age were retained regardless of significance in order to compare effects across models and counties.
Each of the two regression components could include correlation among persons living in the same small geographic area, i.e., AT habits could be similar in small neighborhoods. Failure to account for this correlation, if it exists, violates the assumption of independent residual errors in standard regression analyses and can lead to mis-specification of the variances and covariances of model parameters, which in turn leads to mis-specification of the corresponding statistical significance. The spatial correlation in the original data can be accounted for by model covariates that explain the spatial patterns or by use of a spatial error structure for the variance/covariance matrix of a model random effect (a hierarchical analysis) or of the model residuals. For this analysis, we attempted to include covariates that would explain most of the underlying spatial pattern in AT behavior but also included a random effect to account for any remaining spatial correlation.
We did not assume that the degree of spatial correlation was identical for the two types of responses. Spatial correlation was assessed in two ways: by an exponential decay function where correlation decreased with increasing distance between respondents' addresses, and by a threshold function where responses of persons who lived within a defined neighborhood had a constant correlation but were not correlated at all with responses from outside that neighborhood. Spatial correlation for each county was assessed by using a spline approximation on a 30 × 30 cell grid, corresponding to neighborhoods approximately 2.3 miles square; smaller neighborhoods had too few respondents for stable assessment of the correlation. The threshold model was repeated with neighborhood defined by the respondents' postal zip codes.
No single statistic is available to assess how well mixed effects models fit because of the complexity of the likelihood in the presence of random effects. We compared values of the generalized chi-square statistic for goodness-of-fit and checked the final models by rerunning their fixed effects equivalents separately to calculate the Hosmer-Lemeshow statistic [
42] for the logistic component and the likelihood ratio statistic for the lognormal component. Residuals were examined and variograms were plotted and compared for the original and residual data. Distances for the variogram calculations were Great Circle distances based on the geocoded locations.
The spatial and non-spatial models cannot be compared directly because of the default likelihood approximation used by SAS PROC GLIMMIX for random (spatial) effects models. Therefore we attempted to rerun the final models on a more powerful LINUX PC to obtain exact likelihood results. The local neighborhood spatial models did not converge, required more computer memory than was available or produced an invalid variance/covariance matrix. The zip code threshold models did converge using adaptive quadrature integral approximation methods. Because of the computational difficulties in optimizing the exact likelihoods, particularly for the larger LA sample, the results in this paper are the pseudo-likelihood (Restricted Maximum Likelihood) results, unless otherwise specified.
The computational difficulties involved in estimating parameters in models where the variances/covariances are unknown, as is the case for spatial models, are well documented [[
43], Chapter 9]. Inclusion of random effects or the need to estimate the covariance parameters requires use of an iterative estimation procedure, i.e., there is no exact solution to the optimization equations. Assessment of convergence, as reported above, is essential for any of these models, as it gives some assurance that the results are reliable. We addressed this problem by using a well-tested commercial software program [
34] for the iterative parameter estimation process and by screening covariates and their interactions carefully to develop a parsimonious model to improve model stability. Finally, we compared results for several types of models (fixed and random effects, separate and joint propensity and duration models), with several subsets of covariates and at different geographic scales, looking for consistent effects.
The joint model of propensity and duration is complex but allows information about one type of outcome (propensity or duration) to aid in predicting the other, in theory providing a more robust approach than analyses treating propensity and duration separately or simply using logistic regression with zero or zero + low levels of activity as one of the categories in the dependent variable.