The four communities included in HCHS/SOL are located in the Bronx, Chicago, Miami, and San Diego. The sampled area in each community was defined by a group of neighboring census tracts to provide geographical balance and diversity with respect to Hispanic/Latino background. Each community’s field center purposively selected its targeted tracts based on their proximity to the clinic, tract-level demographic distributions available from the 2000 decennial census; and local information about neighborhoods. The target population in HCHS/SOL corresponds to all non-institutionalized Hispanic/Latino adults aged 18-74 years residing in the four sampled areas. Probability sampling within these areas is employed to assure a broad representation of the target population and to minimize the various sources of bias that may otherwise enter into the cohort selection and recruitment process.
The Need for Probability Sampling
The design of a population-based sample must accommodate the specific informational needs of the study. For HCHS/SOL, the selected sample should be broadly representative of the target population in that the sample mirrors the full range of possible values for key outcome variables while also providing adequate representation of important combinations of predictor and outcome variables (1
). Probability sampling provides a means for achieving such balanced representation. Probability sampling also provides a basis for making unbiased inference to target population characteristics of interest. These advantages, however, come at a cost. Probability sample requires the exclusive use of random selection so that the statistical probability of choosing each sample member can be calculated (2
). Random selection requires enumeration of members of the target population, or well-defined subsets thereof, and can be costly to implement. As a result, study designs incorporating more convenient methods of selection are often utilized for population based cohort studies.
Any sample design that utilizes non-random selection (e.g., a convenience sample) produces a non-probability sample. Quasi-probability samples that combine random and non-random methods of selection (e.g., allowing interviewers to subjectively select a quota sample of households within a random sample of neighborhoods) are also non-probability samples (3
). Accompanying the simplicity and lower cost associated with non-probability sampling are two problems. First, there is no direct theoretical basis for making estimates of population characteristics from the sample (5
). Instead, one must either defend a model to explain the generation of the sample data from some underlying distribution or assume that the variability of sample based estimates is similar to that associated with simple random sampling. Both assumptions are difficult to verify. Second, non-probability samples,typically offer a skewed reflection of the sampled population due to diminished participation by population sectors (1
). Self-selected samples exclude those more reluctant to volunteer and who are less accessible; allowing interviewers to decide who is selected can also exclude those not meeting personal preferences, leading to potentially biased study results. The magnitude of this bias is directly related to the extent of under-representation in the sample and the degree to which key study measurements on those included differ from those not included. While it is true that sources of error unrelated to sample selection (e.g., non-response) can bias the analysis of data from probability samples, non-probability samples are subject to these same non-sampling errors, producing estimates with bias due to both non-random selection and
non-sampling sources (6
To illustrate the potential for bias in non-probability samples, we compared health outcomes estimated from a national probability sample to outcomes estimated from a simulated clinic-users sample using data from the 2005 Medical Expenditure Panel Survey (MEPS). MEPS utilizes a national probability sample of all civilian, non-institutionalized U.S. residents (7
). The subset of MEPS respondents reporting one or more physician visits in the past year (“clinic users”) mimics a convenience sample selected through physician practices alone. Estimates of the number of chronic conditions, average cost of physician visits, obesity prevalence, and number of work days missed due to illness/injury are provided in for the full sample and the clinic-users sample, both overall and by race/ethnicity. Sampling weights are incorporated in the analysis to account for disproportionate sampling of population subgroups in the MEPS study design. Estimates of the clinic-users sample standardized by age, race/ethnicity, and gender to the MEPS target population are also provided in , in an attempt to adjust for skewness in the convenience sample. The findings reveal that estimates from the clinic-users sample are consistently higher than those from the probability sample. This selection bias is not unexpected due to the association between the criteria for selecting the clinic-users sample and the outcome measures (i.e., clinic users are more likely to have health problems than the population as a whole). The fact that standardizing estimates from the clinic users sample does not consistently compensate for the skewed representation due to non-probability sampling, however, is unexpected. Standardization appears to offset the effect of selection bias for one outcome (chronic conditions), partially compensate for the effect in another (health care costs), and exacerbate the effect in the remaining two measures (obesity and days lost). For subgroup comparisons, the white-nonwhite difference in obesity prevalence for the clinic users sample overstates the actual difference, and standardization exacerbates this overstatement. While this example highlights the pitfalls of just one form of non-random selection, similar results would be expected for other forms of convenience sampling.
Results of a Simulated Comparison of Probability and Non-Probability Samples Using Data from the 2005 Medical Expenditure Panel Survey (MEPS)
Rationale for Key Sample Design Features
A probability-based sampling strategy was chosen for HCHS/SOL, with specific features dictated by the goals and overall design of the study. First, the decision to identify Hispanics/Latinos from the general residential population made controlling the cost of face-to-face recruitment a priority. The mode of recruitment and data collection is an important cost factor in population-based studies, and while mail and web-based methods are inexpensive, nonsampling errors due to incomplete frame coverage and non-response can occur. Telephone screening is also relatively inexpensive but its exclusive use was impractical for HCHS/SOL due to the declining use of telephone land-lines and the fact that an extensive clinic visit is a key component of data collection. Face-to-face sample recruitment was seen as the only real option for HCHS/SOL; consequently, steps to control the associated higher costs were needed. One obvious cost-saving measure was to sample geographic clusters of households (i.e., census block groups) at the first stage of a multi-stage sample in order to reduce the cost of return visits to neighboring households. More substantial cost savings were realized through over-sampling of both clusters and households within clusters most likely to be Hispanic/Latino, thereby reducing the number of sampled households that must be screened to achieve the study’s sample size goals. Geographic clusters were stratified by the proportion of the population found to be Hispanic/Latino in the 2000 decennial census, and clusters in the ‘high concentration’ stratum were selected at a higher rate than clusters in the “low concentration” stratum at the first-stage of sample selection. An optimal delineation point between high and low concentration was determined for each field center using Cochran’s cumulative √f rule (8
). Similarly, household addresses within clusters were divided into two strata, those associated with Hispanic/Latino surnames versus all others. Hispanic/Latino surname addresses were selected at a higher rate than other addresses at the second stage. Over-sampling in multiple stages of the selection process in this way provides efficiencies in sample identification while still retaining the advantages of random selection.
Meeting the HCHS/SOL objectives requires adequate representation of the socio-economic status (SES) distribution of residents of the defined community areas. Although SES is an individual- or household- level characteristic, it is rarely possible to stratify a sample of households by a direct measure of SES. The next best option is to use census measures such as educational attainment or household income as a practical proxy indicator (9
). To this end, geographic clusters were stratified by the proportion of residents aged 25 years or older with at least a high school education based on the 2000 census. The high and low SES delineation point was defined as the median value of the distribution across clusters, and the first-stage sample was allocated proportionately across strata to ensure broad SES representation. To meet the HCHS/SOL objective of identifying predictors of disease outcomes including cardiovascular events, a target sample size of 10,000 persons aged 45-74 years (62.5% of the full cohort) was set. Over-representation of this age group required sub-sampling households or persons within households according to the household’s age distribution. Such a procedure is best applied during screening, with the intention of retaining a higher portion of discovered older Hispanics/Latinos than would occur if persons were chosen at random. Sub-sampling according to age was accomplished in one of two ways. Method 1 was designed to keep all households intact, with no sub-sampling at the person level, and was adopted at study start. With this method, households in which the Hispanic/Latino adults are all aged 45-74 years are selected with certainty (probability of selection = 1) within the first-stage cluster, and all other households are sub-sampled with probability < 1. Method 2 involves dividing each household into two sub-clusters, Hispanics/Latinos aged 45-74 years and Hispanics/Latinos aged 18-44 years. The 45-74 year sub-clusters are selected with certainty (probability = 1), while the 18-44 year sub-clusters are selected with probability < 1. This method involves sub-sampling persons within a household rather than keeping households intact, but can result in fewer households needing to be screened and was adopted after study start for efficiency.
The final design consideration was the need to compare health characteristics by Hispanic/Latino background among the four field centers. Valid comparisons require comparability across sites in cohort recruitment, but not necessarily identical probability sample designs. Indeed, the same sample design structure was used (i.e., two-stage stratified sampling of households with the same sampling units and stratification variables in each stage), with some allowance for how the strata were defined and the sample allocated among the centers.
A stratified two-stage area probability sample of household addresses was selected in each of the four HCHS/SOL field centers. A summary of the center-specific designs is presented in . At the first stage, a stratified simple random sample of census block groups (BGs), which served as primary sampling units (PSUs), was selected in each field center. PSU sampling strata were defined by the cross-classification of (i) high and low Hispanic/Latino concentration and (ii) high and low SES, defined above. The distribution of BGs across strata and the over-sampling ratios for high and low Hispanic concentration strata are presented in . Special strata were created as needed to target specific neighborhoods. In the Bronx, a fifth stratum was defined as a portion of a high-rise housing complex (named Co-op City) in order to provide additional income diversity, and two additional strata were appended after study start to increase coverage. In Miami, a fifth stratum was defined with high expected concentrations of Central and South Americans, and a sixth stratum corresponding to an area with a high concentration of Cuban residents was appended after study start. All BGs within these special strata were selected. Overall, 632 (73%) of the 871 BGs in the target areas were selected for the PSU sample.
Summary of HCHS/SOL Sample Design Features
Design Characteristics of the HCHS/SOL Sample
Separate stratified second stage samples of household addresses were selected within each sample PSU. Address listings came from the Delivery Sequence File (DSF) available from the US Postal Service and obtained through MSG-Genesys of Ft. Washington, PA. The DSF addresses within each sample BG were cross-referenced with telephone and commercial mailing lists, and surname and telephone number were appended where available. provides the second-stage over-sampling ratios for the Hispanic/Latino surname strata used to achieve the final sample of 123,213 addresses.
The sample addresses in each field center were randomly sub-sampled to form three waves corresponding to the three years of recruitment. Thus, the yearly sample for each field center was representative of the target community area, thereby minimizing bias due to temporal trends.
A key feature of the HCHS/SOL sample design is the ability to modify components in order to adapt to recruitment experiences. The modifications made to date include the designation of a sixth stratum in the Miami field center to append certain block groups in the Hialeah neighborhood for increased coverage of the Cuban population and designation of a sixth and seventh strata in the Bronx to capture a neighborhood adjoining the original target area, thereby increasing coverage of the Bronx Hispanic/Latino community.
Approximately six months into recruitment, a decision was made to apply Method 2 for over-sampling adults aged 45-74 years in lieu of Method 1, based on the need to accept a higher proportion of households into the sample and reduce recruitment time. The selection probabilities for both methods of over-sampling 45-74 year-olds during household screening were initially based on 2005 American Community Survey data for the geographic region of each field center. The sample age distribution is monitored continually as data on HCHS/SOL households accumulates, and the selection probabilities are adjusted as needed. provides the sub-sampling rates for each method applied to each field center.
Sample Size and Data Analysis
Each field center will enroll 4,000 Hispanics/Latinos with the prescribed age distribution, namely 2,500 aged 45-74 years and 1,500 aged 18-44 years. In terms of Hispanic/Latino background, the Bronx field center sample is predominantly Puerto Rican and Dominican, while the majority of participants in the San Diego site are Mexican in origin. Study participants in the Miami field center are Cuban and Central/South American, and participants in the Chicago field center are Mexican, Puerto-Rican, and Central/South American. A minimum of 2,000 participants in each of the pre-specified Hispanic/Latino groups (Mexican, Puerto Rican, Cuban, and Central/South American) is required to support the analysis objectives, and sample sizes are monitored continuously to determine if adjustments to the sampling strategy are needed.
The HCHS/SOL sample size will support a broad range of analyses planned for the study. As an example, consider the possible association of an exposure variable with incident disease. The range of hazard ratios able to be detected with approximately 90 percent power are provided in by event rate and the relative sample sizes of low to high risk groups. The estimates incorporate a design effect to account for clustering in the sample of 1.25, based on an average cluster size (persons per block group) of 24 and intra-class correlation for incident disease of 0.01. Based on the entire study cohort of 16,000, a hazard ratio of 1.6 would be able to be detected for an event occurring at the rate of 4 per 1,000 person years of follow-up and equally sized low and high risk groups, e.g., for a continuous exposure variable dichotomized at the median value. With a population subgroup of size 4,000 (e.g., a single site or Hispanic/Latino subgroup), the hazard ratio able to be detected in the same circumstances is 2.25. For higher levels of intra-class correlation, power for both comparisons would decrease.
Hazard Ratios Detected with Approximately 90% Power by Event Rate and Lowto-High Risk Group Ratio
The use of multi-stage or clustered sampling creates complexity in data analyses due to correlations among sample units at the various stages of selection, here, correlations among households within the same block group and correlations among individuals in the same household. Similarly, over-sampling through the use of differential probabilities of selection requires the use of sampling weights for unbiased estimation of population characteristics. While clustering and unequal probabilities of selection tend to increase the variability of population estimates and reduce the power available for testing associations, stratification at one or more stages of sample selection has the reverse effect. To ensure accurate estimation of variances and valid statistical tests of hypotheses therefore requires appropriately accounting for the HCHS/SOL sample design during data analysis. Initial sampling weights will correspond to the inverse probability of selection for each participant. Non-response adjustments and calibration to known population totals (from the 2010 decennial Census, when available) will be applied. Final sampling weights, stratification variables, and cluster identifiers will be available for design specification during data analysis. A variety of statistical methods that account for multi-stage sampling are available (see, e.g., 10
), and most standard statistical software packages are able to accommodate probability sample designs (e.g., SAS, STATA). Special purpose software (e.g., SUDAAN) for complex sample designs is also available.
Successful implementation of probability sampling requires a systematic approach to recruitment in order to realize the benefits of the sample design. If subjective factors such as interviewer preference enter into the recruitment process, then the objectivity associated with random selection will not be achieved. The goals of HCHS/SOL recruitment are to optimize the ability to establish contact with, determine eligibility of, and actively engage households at every sample address, regardless of the neighborhood or living conditions encountered in the field. Recruitment teams inform potential participants of the study objectives and associated benefits of their participation. The research nature of the study is emphasized, including the information it is designed to provide and the impact the study results may have on policy making and health care for future generations of US Hispanic/Latinos. Extensive community engagement efforts provide the context for this information exchange, including collaborations with community based organizations and targeted media campaigns.
The recruitment protocol consists of three steps: (i) initial mailings to sample addresses describing the study; (ii) optional telephone contacts for households with telephone numbers available; and (iii) in-person contacts. Once contact is established, a brief household screener is administered via a digital hand-held device to determine eligibility and implement the age sub-sampling procedure (12
). Upon obtaining agreement to participate, a roster of household members is created, and individual eligibility confirmed. Persons on active duty military service, not currently living at home, planning to move from the area in the next six months, or are physically unable to attend the clinic examination are considered ineligible.
Household- and individual-level screening and eligibility rates and clinic participation rates are monitored continuously, and adjustments to selection parameters (for age) or fielding of sample addresses (for SES and background) are made as needed. At the conclusion of HCHS/SOL recruitment, final household and individual level participation rates will be computed among those eligible for the study. A goal of 60% participation was set at the onset of recruitment.