Consistent with previous research of the dimensionality of other CAHPS® measures (Reise, Morizot, and Hays 2007
), our results suggest that the CG-CAHPS® measures can be modeled either as a single-factor construct or as a multidimensional construct. For applications of IRT using the scores, such as the ordinal logistic regression/IRT DIF algorithm employed here, these findings suggest that the IRT model may be appropriate. However, if measurement precision is important, such as in a computerized adaptive test, a scoring strategy incorporating the bifactor structure may be more appropriate. Although the distribution of scores was approximately normal (Appendix C), the test characteristic and item information curves () indicate an inability of the measure to discriminate respondents’ overall experiences of care at the upper end of the scale. The inability to discriminate patient scores at the upper end of the performance continuum, for example, the 80th percentile from the 95th percentile, indicates that small differences at the top of physician performance distribution might be measured with low precision. It remains unclear the extent to which physician performance comparisons at the upper end of the scale can be made reliably. Considering the use of the CG-CAHPS® in pay-for-performance strategies (Pearson et al. 2008
; Damberg et al. 2009
; Rodriguez et al. 2009a
), this topic seems especially important for subsequent analyses.
The CG-CAHPS® survey does not function in systematically different ways for the racial and ethnic minority groups examined. Consequently, previously documented racial and ethnic group differences on the CG-CAHPS measures likely reflect true differences rather than measurement bias. We evaluated CG-CAHPS® items for DIF with respect to a large number of covariates and found a few items with nonuniform DIF, which was not surprising given the reliance on statistical significance for nonuniform DIF and the large sample size. When we accounted for all sources of DIF, we found negligible DIF impact. Previous studies demonstrate that respondents from some ethnic minority groups have extreme reporting tendencies on rating
scales (higher probability of using the high and low ends of the rating scale versus the middle; Weech-Maldonado et al. 2008
). Our results suggest that experience-based reports
may be less vulnerable to DIF by design. For example, rather than asking patients to provide a rating (“On a scale from 0 to 10, how would you rate …”), the reports that comprise the CG-CAHPS® measure specific patient experiences (“how often …”). Compared with ratings, reports may be less subject to bias because of different norms or standards that vary by cultural factors (Harris-Kojetin et al. 1999
; Schnaier et al. 1999
The one important exception to the general DIF pattern was Q11 (wait in the office), where nonuniform DIF was found by primary language spoken at home for Latinos, the duration of the physician–patient relationship, the number of physician visits made in the prior year, and self-rated physical and mental health. DIF may stem from the fact that the question uses a concrete time interval (15 minutes) for respondents to consider rather than a qualitative anchor, that is, the extent to which the physician “listens carefully.” Different wait expectations may result in DIF. Previous research suggests that Latinos have a higher tolerance for waits and that worse experiences of correlated are not as strongly correlated with overall impressions of care for Latinos compared with whites (Wilkins et al. 2011
). Our results indicate that scale developers should follow up with cognitive interviews to examine the sources for DIF for experiences of care that focus on time and/or waits.
Our study results should be viewed in light of important limitations. First, although the respondent sample is large and diverse, all patients are commercially insured and report an established relationship with a primary care physician. The commercially insured and established respondent sample is much more educated and less diverse than the overall primary care patient population in southern California. Different expectations of care among uninsured or Medicaid-insured patients might be associated with DIF with respect to insurance status and we are unable to assess these effects with the data we have. In addition, DIF with respect to CG-CAHPS survey language (Bann, Iannacchione, and Sekscenski 2005
) was not examined and equivalence of the scale by survey language, for example, Spanish versus English versions, should be clarified. Second, the survey response rate (39 percent) was modest, and differential patient nonresponse might introduce bias. As a result of the nature of the data (used for quality improvement purposes), limited data are available on the characteristics of the outgoing survey patient sample. Patient characteristics used for DIF assessment were all self-reported and ascertained in the survey. As a result, we are unable to assess differences in sociodemographic or health status differences among respondents versus nonrespondents. Vulnerable patients are less likely to respond to mailed surveys than other patients (Zaslavsky, Zaborski, and Cleary 2002
), so the racial/ethnic and primary language subgroup comparisons were conducted with a favorable selection of patient samples across subgroups. Previous studies, however, underscore the appropriateness of DIF detection for small or limited respondent samples (Morales, Reise, and Hays 2000
; Lai, Teresi, and Gershon 2005
), indicating that DIF detection methods are appropriate to employ using the study data. Future research should clarify the extent to which the sample representativeness affects the measurement of DIF impact on patient experience measures.
Third, all surveys were completed by mail and therefore assessment of DIF by survey mode was not possible. Finally, a unidimensional logistic regression IRT approach was used to identify DIF items even though the fit statistics were better for the bifactor model. The loadings on the primary factor for the single-factor model are not very different than the bifactor model, however, indicating that the scale is sufficiently unidimensional to employ a single-factor model such as IRT. The bifactor model findings are useful because they facilitate the use of an extensive framework for DIF analyses, and they are important intermediate results because they affirm an important assumption (a sufficiently unidimensional scale) made by our analyses. As DIF detection procedures are developed for bifactor models and other structures, it will be interesting and important to repeat these analyses to ensure that the findings are robust. At present, especially for the multiple covariate case considered here, procedures for analyzing and accounting for DIF with bifactor structures are not yet widely available. The current analyses represent the state of the art, and to our knowledge, is the first attempt to apply DIF analyses to the CG-CAHPS measures.
In conclusion, the English version of the CG-CAHPS® survey functions similarly across commercially insured respondents from diverse backgrounds. As a result, the racial and ethnic differences previously documented on the CG-CAHPS® measures (Rodriguez et al. 2008
) likely represent “true” differences rather than DIF. Future research, however, should examine whether the measures function differently by patient insurance status, as experiences of uninsurance might affect respondent expectations of care and may be associated with DIF. Importantly, the CG-CAHPS® test characteristic and information curves raise concerns about the use of standard scores of the instrument to measure patients' experiences over time because the standard scores have a nonlinear relationship with the underlying trait level measured by the test (Crane et al. 2008a
). Furthermore, it will be important for researchers to clarify the extent to which physicians with performance at the top end of the scale can be reliably differentiated from one another.