|Home | About | Journals | Submit | Contact Us | Français|
To develop a model for case-mix adjustment of Consumer Assessment of Healthcare Providers and Systems (CAHPS®) Hospital survey responses, and to assess the impact of adjustment on comparisons of hospital quality.
Survey of 19,720 patients discharged from 132 hospitals.
We analyzed CAHPS Hospital survey data to assess the extent to which patient characteristics predict patient ratings (“predictive power”) and the heterogeneity of the characteristics across hospitals. We combined the measures to estimate the impact of each predictor (“impact factor”) and selected high impact variables for adjusting ratings from the CAHPS Hospital survey.
The most important case-mix variables are: hospital service (surgery, obstetric, medical), age, race (non-Hispanic black), education, general health status (GHS), speaking Spanish at home, having a circulatory disorder, and interactions of each of these variables with service. Adjustment for GHS and education affected scores in each of the three services, while age and being non-Hispanic black had important impacts for those receiving surgery or medical services. Circulatory disorder, Spanish language, and Hispanic affected scores for those treated on surgery, obstetrics, and medical services, respectively. Of the 20 medical conditions we tested, only circulatory problems had an important impact within any of the services. Results were consistent for the overall ratings of nurse, doctor, and hospital. Although the overall impact of case-mix adjustment is modest, the rankings of some hospitals may be substantially affected.
Case-mix adjustment has a small impact on hospital ratings, but can lead to important reductions in the bias in comparisons between hospitals.
The Consumer Assessment of Healthcare Providers and Systems (CAHPS®) Hospital project is an extension of the CAHPS project, in which the Agency for Healthcare Research and Quality (AHRQ) funded a consortium of investigators to develop patient surveys to assess consumer experiences of health care (Homer et al. 1999; Hargraves, Hays, and Cleary 2003; Daniels et al. 2004; Landon et al. 2004). The CAHPS Hospital project developed surveys to assess the experiences of patients recently discharged from acute care hospitals.
Results of the CAHPS Hospital surveys will be used to compare quality among hospitals, to support decision making by patients, physicians, and payers, and to facilitate quality improvement in hospitals. When making such comparisons, there are at least two reasons why it might be desirable to adjust CAHPS Hospital scores (Zaslavsky et al. 2001). First, some processes of care are likely to vary with patient characteristics. For example, it might be more difficult to communicate clearly with less educated patients or patients who take more medication (Hargraves et al. 2001; Zaslavsky et al. 2001). Varying distributions of these characteristics across hospitals might affect the rate of problems with care. Second, patients' characteristics can influence how they respond to survey questions. For example, a younger patient might be more sensitive to waiting time and thus give lower scores than an older patient with fewer time constraints.
Without adjustment for case-mix, reports and ratings of hospital care may be misleading. Furthermore, hospitals would have an incentive to attract patients likely to give higher ratings and avoid those most likely to report problems. Case-mix adjustment uses statistical models to predict what each hospital's ratings would have been for a standard patient or population, thereby removing from comparisons the predictable effects of differences in patient characteristics that are consistent across hospitals.
Age and self-rated general health status (GHS) typically have the strongest and most consistent associations with patient-reported problems, with greater satisfaction among older patients and those with better self-perceived health (Cleary and McNeil 1988; Cleary et al. 1989; Ware and Berwick 1990; Ehnfors and Smedby 1993; Charles et al. 1994; Arnetz and Arnetz 1996; Rosenheck, Wilson, and Meterko 1997; Woodbury, Tracy, and McKnight 1998; Hoff et al. 1999; Hargraves et al. 2001; McNeill et al. 2001; Jenkinson, Coulter, and Bruster 2002; Thi et al. 2002; Wilson et al. 2002). Similar predictors are important for evaluations of health plans (Zaslavsky 1998; Elliott et al. 2001; Zaslavsky et al. 2001).
There is some evidence that other characteristics, such as education, marital status, income, and sex are related to survey responses about health care (Ehnfors and Smedby 1993; Charles et al. 1994; Rosenheck, Wilson, and Meterko 1997; Hoff et al. 1999; Thi et al. 2002) but those results are not consistent (Cleary and McNeil 1988). Lengths of stay and readmission have also been associated with hospital ratings and reports of care; however, these are not appropriate case-mix adjustors because they could be affected by the quality of care. In this paper, we assess which patient characteristics should be used in a model for adjusting CAHPS Hospital scores when making hospital comparisons.
The CAHPS Hospital pilot survey included 33 questions about patients' experiences with various aspects of care (e.g., “During this hospital stay, how often did you have to ask for pain medication?”) and three questions that elicit overall ratings of the hospital, doctors, and nurses, as well as a question about whether the patient would recommend the hospital to others (Elliott et al. 2005). There also are 13 questions about patient characteristics.
Patients were selected at each hospital using random sampling within service (medical, surgical, obstetric). Eligible patients were adult (aged over 18 years) medical, surgical, and obstetric patients who had an overnight stay and were discharged between December 2002 and January 2003. Patients were excluded from the study if they had a psychiatric diagnosis, were under age 18 at the time of their admission to hospital, were not discharged to home, or were missing data needed for identification and surveying. Sampling fractions were calculated to yield equal numbers of patients from each service, although this was not always possible (e.g., some hospitals did not provide obstetric services).
CAHPS Hospital survey questionnaires were mailed to all sampled patients. Telephone follow-up, or mailing of replacement questionnaires, began about 4 weeks after the mailing of the survey.
After excluding patients who had an undetermined service or hospital affiliation, the sample comprised 19,720 patients discharged from 132 study hospitals (Goldstein et al. 2005). We removed from analysis a single hospital with only eight responses and a hospital that had no medical service responses. We confined our analysis to respondents who received medical (37 percent), surgical (40 percent), or obstetric services (23 percent) at one of the 130 remaining hospitals, leaving a final sample of 19,683 respondents. Twenty-eight of those hospitals had no respondents receiving obstetric services. The number of respondents per hospital ranged from 28 to 512. Only the most recent hospital stay was retained for patients that had multiple hospital stays.
To identify potential case-mix adjusters, we analyze the extent to which patient characteristics predict overall ratings of “nurse,”“doctor,” and “hospital.” These outcomes were chosen because they are regarded as the patient's summary of the more topic-specific report items and because they are more subjective and therefore likely to be sensitive to reporting effects (Hargraves et al. 2001; Kim, Zaslavsky, and Cleary 2005). For each outcome the hospital is the unit of analysis. Analyses of the nurse and doctor items enable us to learn about the corresponding aspects of hospitals' performance, which are not necessarily captured by the hospital rating, and therefore may reveal important case-mix effects that would otherwise be missed.
The variables from the CAHPS Hospital pilot survey available as case-mix adjustors are: hospital service, self-reported general health status (GHS), self-reported mental health status (MHS), age, gender, education, whether Spanish is spoken at home, if a proxy helped complete the questionnaire, race, and the patient's Diagnostic-Related Group (DRG) code assigned by the hospital. The service variable has three categories indicating if the hospital stay was for surgery, obstetrics, or other medical services. Service was available both from patient self-report and from hospital records. We use DRG codes in these analyses because they were available for all patients and more accurate.
GHS and MHS had a 5-point response scale (excellent, very good, good, fair, poor). Age had eight categories, mostly 10-year intervals, from 18 to greater than 80 years of age. Education was a 6-category ordinal variable (eighth grade or less; some high school, but did not graduate; high school graduate; some college or 2-year degree; 4-year college graduate; more than 4-year college degree).
Race/ethnicity was represented by separate indicator variables for white, black, Hispanic, Asian, and Native American or Hawaiian. Respondents could check multiple race categories, so we assigned them to a group using the following order of priority: Hispanic, black, Native American or Hawaiian, Asian, and white; thus, a respondent that checked both Hispanic and black was categorized as Hispanic. Spanish language indicates if Spanish is the language mainly spoken at home, endorsed by about half the self-reported Hispanics.
Proxy help and proxy answer indicate if the patient required help completing the questionnaire or had the questions answered for them, respectively. Finally, the DRGs assigned at admission to the hospital were grouped in 20 Major Diagnostic Categories (MDC), providing a profile of the patient's condition specific to the inpatient stay in question. We did not consider variables for case-mix adjustment that were characteristics of the hospital or determined by the hospital's actions (e.g., length of stay), as adjusting for such variables might obscure real differences in quality between hospitals.
Our criterion for selection of case-mix adjustors is the “impact factor,” which is the product of two measures: predictive power (the strength of the relationship between the candidate adjustor and the outcome variable at the individual level) and heterogeneity factor (the amount of variation among hospitals in the adjustor variable) (Zaslavsky 1998). Predictive power quantifies the improvement in model fit (R2) attributable to a variable; unlike tests of statistical significance, it does not depend on sample size. The heterogeneity factor measures the extent to which the characteristic is unevenly distributed across hospitals and therefore potentially a source of bias in comparisons. A variable, such as gender, could be highly predictive of responses but have little impact on case-mix adjustment because its distribution is relatively homogeneous across hospitals. Conversely, a variable could have quite different distributions in different hospitals but be unrelated to the rating. By combining both predictive power and heterogeneity into a single measure, the impact factor is more informative than purely predictive measures such as R2; it approximates the magnitude of the incremental adjustments due to adding a variable to the case-mix model.
To select a core set of predictor variables we screened potential adjustors using stepwise regression; this exploratory technique is appropriate because we seek to identify a nonredundant set of variables that predict ratings of hospitals, not to test hypothesis about predictors. To select a parsimonious model, the inclusion and exclusion p-value criteria were set at 0.005, and to check that no important variables were omitted we compared the model against alternatives generated by an all-subsets regression. A further validation of the model was performed by randomly partitioning the data set into halves, refitting the model on one half and predicting the ratings in the other half, and comparing the accuracy of the predictions to those when the model is fit to the full data set. Nine separate regression models were run, for the three overall ratings in each of the three services. Variables selected in any of the models formed a core set eligible for final selection.
We estimated the predictive power, heterogeneity factor, and impact factor across all services for each case-mix variable on each CAHPS Hospital score. Interactions between service and the other case-mix variables were also assessed. When interactions were included as case-mix predictors, the corresponding main effects were also included. We also performed separate analyses for each service.
We measured predictive power by the incremental amount of variance explained by the predictor (represented as the partial r2× 1,000) in a linear regression analysis given the other variables already in the baseline model, including dummy variables for each hospital. We measure the heterogeneity of the predictor variable across hospitals as the ratio of between-hospital to within-hospital variance of the residuals when the variable is regressed on the same baseline variables. The product of the predictive power and heterogeneity factor is proportional to the impact factor, used to assess which variables are both important predictors of CAHPS Hospital ratings and are sufficiently variable across hospitals to warrant case-mix adjustment (Zaslavsky et al. 2001), as described above. We required a minimum impact factor of 1 for a variable to be included.
For the impact analysis, we treated ordinal variables (such as age, health status, and education) as linear effects. This assumes that the effect on ratings of a change between consecutive categories is uniform across the scale (e.g., the difference between ratings from those in poor versus fair health is the same as that between those in good and very good health). This approximation is convenient for calculating the impact of ordinal variables, but might not be the optimal specification if the uniformity assumption is incorrect. For each ordinal variable in the baseline model, we tested the uniformity assumption by comparing (with an F-test) the baseline model with one that recoded the linear variable as a set of dummy variables. Unless the categorical specification significantly improves on the linear specification, the latter can be used with no detectable loss of accuracy. After identifying the case-mix predictors we then tested interactions of adjustors with service, to determine which coefficients differed significantly across services, to arrive at our final model.
We used the CAHPS macro (AHCPR 1999) to compute mean nurse, doctor, and hospital ratings for each hospital adjusted for the various sets of predictors in these final models. These are predicted mean ratings for each hospital if they all had the same case-mix. By examining the changes in the predicted values for each hospital across models, we can determine how much each model adjusts for the relevant inter-hospital differences in case-mix.
To evaluate the overall impact of case-mix adjustment on each CAHPS Hospital score we compared the unadjusted scores to scores adjusted for variables selected for two of the three ratings for each service. We used two measures of the importance of adjustments to any rating variable: the ratio of the standard deviation of adjustments to the unadjusted standard deviations of the hospital means, and Kendall's τ correlation between the adjusted and unadjusted hospital rankings of the scores. Larger standard deviation ratios reflect greater impact. Kendall's τ is directly related to the proportion of pairs of hospitals that switched ordering as a consequence of case-mix adjustment.
The standard case-mix adjustment model relies on the assumption that the adjustors do not interact with hospital. If this assumption does not hold, the choice of covariate values will affect comparisons between hospitals (e.g., the ranking of hospitals), and therefore different reports may be needed for different types of patients. We use F-tests to evaluate if there is significant heterogeneity in the case-mix coefficients across hospitals.
For analyses of the nurse, doctor, and hospital ratings there were 16,745, 16,744, and 16,840 observations respectively with complete data for the dependent and all independent variables; 368 additional observations with missing values only for proxy help or gender were added to analyses of “doctor” when these variables were eliminated from models. Many of the missing values arose because the final two facing pages (containing 10 items) were left blank by 1,053 respondents. Because missingness of this block was not associated with the other ratings or concentrated in particular hospitals, we treat these data as missing completely at random and removed the corresponding cases from the analysis.
The distribution of ratings was concentrated at the high end of the scale, with 38 percent, 48 percent, and 36 percent of patients providing ratings of 10 for the nurse, doctor, and hospital items respectively; 60 percent of ratings on each item were 9 or higher. Consequently, the distribution of hospital mean ratings is concentrated toward the high end of the scale (65–99 percent of hospital means exceeded 8 across the nine rating-service combinations).
The main effect of service in the pooled model indicates that compared to surgical patients, medical patients gave lower overall doctor ratings but relatively similar nurse and hospital ratings, whereas obstetric patients gave more positive ratings for nurse and hospital but similar ratings for doctor. Service, GHS, MHS, age, education, being non-Hispanic black, and Spanish language are highly predictive for each of the nurse, doctor, and hospital ratings (Table 1). Male and proxy help also met the p <.005 threshold for the nurse and hospital ratings, but not for doctor. Hispanic, Asian, Native American, and proxy answer were not predictive of any rating.
In models fit separately by service (data not shown), medical patients gave a lower doctor rating but relatively similar nurse and hospital ratings to surgical patients, whereas obstetric patients gave more positive nurse and hospital ratings but similar doctor rating. Healthy (general and mental), older, less educated, non-Hispanic black, and Spanish speaking patients tended to give higher ratings. Males gave significantly more positive nurse and hospital ratings than females, but not doctor rating. Proxy help respondents gave lower ratings for nurse and hospital than patient respondents; this effect was attenuated for doctor rating.
Analysis of variance calculations for the ordinal GHS, MHS, age, and education variables indicate that with the exception of MHS for the hospital rating, the linear specification accounts for at least 88 percent of the variation explained by the categorical specification, so these variables may be adequately represented on a linear scale. The categorical specification is significantly (p <.005) better than the linear version only for age and education for some ratings, reflecting small departures from linearity.
The most pronounced interactions between the case-mix variables and service are between age and the doctor rating, non-Hispanic black and the hospital rating, and education and the nurse rating (Table 2). The regression coefficients indicate that the ratings for obstetric and medical patients increase more with age than for surgical patients. No other interaction effects are significant at p <.005. However, the interactions of service with age for the nurse rating, with MHS and non-Hispanic black for the doctor rating are significant at the 0.05 level, suggesting that additional interactions with service may exist.
Because of the interaction of some case-mix variables with service and because some of the case-mix variables do not apply to certain services (e.g., only females receive obstetric services and some of the DRG-based groups are only relevant to particular services), subsequent analyses are stratified by service. We also assume linear specifications of the ordinal case-mix variables (including age and education).
Of the 20 MDC examined, only five applied to more than 5 percent of the sample: circulatory disorder, digestive disorder, muscle disorder, female reproductive disorder, and respiratory disorder. The prevalence of the other conditions was so low that their impact on case-mix adjustment would be minimal, even if they were predictive of the ratings. Therefore, we tested only the above five conditions (Table 3).
For surgery patients, having a circulatory disorder was an important predictor of higher ratings for nurse and hospital, while having a female reproductive disorder was a significant positive predictor of the rating for doctor. Due to very low prevalence, none of the medical conditions were predictive of the ratings for obstetric patients. For patients attending the hospital for general medical services, muscle disorder was negatively associated with all three ratings, while circulatory and respiratory disorders had modest (positive) associations (p-values between .01 and .1).
The directions and significance levels of effects of GHS, MHS, age, education, non-Hispanic black, and gender (not shown) are similar across services, and largely in agreement with the results in Table 1. Hispanic was a strong predictor of ratings for general medical services whereas Spanish language was a strong predictor of ratings for obstetrics. Hispanic and Spanish language were never both in a model, since they largely explained the same ethnic variation.
Table 4 presents the predictive power, heterogeneity, and impact for the predictor variables that met the 0.005 threshold in the model for at least two of the specific service models. Hispanic was excluded as it did not meet this criterion.
The variables that have the greatest overall impact on one or more ratings are: age, non-Hispanic black, education, Spanish language, service, MHS, and GHS. Male and proxy help had relatively small impact. The results were consistent across ratings, although the impact factors for the doctor rating were typically larger than those for the other ratings.
The variables with the greatest predictive power do not necessarily have the greatest impact on the adjustment. For instance, GHS is highly predictive of each rating, but due to a homogeneous distribution across hospitals, has low impact. Conversely, the most heterogeneous variable, non-Hispanic black, has modest predictive power and is the second or third most important in terms of impact.
GHS and education were the only variables that had substantial impact for each service (Table 5). Although age had very high impact for both surgery and medical, it was not sufficiently predictive to even be considered as a case-mix adjustor for obstetrics. Non-Hispanic black had significant impact in surgery and medical, but not in obstetrics, whereas Spanish language and MHS had major impacts only on obstetrics. Circulatory disorder (surgery only) was the only MDC to substantially impact case-mix adjustment; although muscle disorder was highly predictive the homogeneity of its distribution across hospitals meant it had modest impact. Male (only for surgery) and proxy help (only for medical) had modest impact.
We quantify the overall impact of case-mix adjustment on hospital-level ratings by the ratio of the standard deviation of the adjustment to the standard deviation of the means and by Kendall's τ. The standard deviation ratios ranged from 0.17 to 0.28, indicating that the adjustments were modest but not negligible compared to the differences among hospitals. Furthermore, ratios of the maximum adjustments to the standard deviations of the means ranged from 0.47 to 1.09, suggesting that although the effect of the adjustment was small for most hospitals, it was important for some. Kendall's τ is between 0.82 and 0.88, meaning that the percentage of hospitals whose ordering would be changed by case-mix adjustment is between 9 percent and 5.9 percent for all services and ratings; surgery and obstetrics were most and least affected respectively.
Although the impact of the case-mix adjustment on between-hospital comparisons is of most interest, the amount of within-hospital variation explained by the case-mix model is a useful summary of model fit. The within-hospital R2 was between 5.8 percent and 7.5 percent for the overall model and between 4.4 percent and 8.3 percent for the service-specific models; slightly more variation was explained for surgery than for obstetrics or medical.
The slopes for service and GHS in the overall adjustment model varied significantly (p <.005), as did the slope for GHS in the service-specific models. The slopes of several other case-mix variables varied significantly across hospitals in the overall model but not in the service-specific models.
Case-mix adjustment is a widely used method for making comparisons among health care providers fairer. Careful adjustment may assuage hospitals' concerns that they may be disadvantaged in comparative ratings by an unfavorable patient population, contributing to acceptance of quality measures and making them more effective drivers of quality improvement.
In this study of patients discharged from hospitals in three states, service (surgery, obstetrics, medical) had a strong relationship to the ratings, and the proportion of patients in each service varies across hospitals, substantially affecting comparisons of hospitals. Service interacts with several other case-mix variables. Notably, age has a large impact on ratings for surgery and medical patients but little impact on obstetric patients, presumably because the ages of obstetric patients, e.g., those having child birth, are much more homogeneous.
To accommodate multiple interactions with service, we recommend fitting separate case-mix models for surgery, obstetric, and medical services. If a single report is to be made combining all services, the case-mix model should include interactions of each variable with service.
Besides service, age, non-Hispanic black, education, GHS, Spanish language, and circulation disorder (in surgery only) appear to be the other most important case-mix adjustors. Adjustment for GHS and education affected scores in each of the three services, while age and non-Hispanic black had important impacts for surgery and medical. Circulation disorder, Spanish language, and Hispanic affected scores for surgery, obstetrics, and medical respectively. The signs of the associations between the case-mix variables and the quality ratings cohere with previously reported results.
The limited impact of diagnostic categories is probably due to the small proportion of patients with each condition and thus the low variation between hospitals. For example, if the prevalence of a condition in a population is only 1 percent, then even if the proportion of patients with the condition varies substantially between hospitals the impact on case-mix adjustment will be minimal. The one exception is circulatory disorders, for which there is a relatively large number of cases. We do not know what characteristics of these patients or their experiences cause them to report more favorably. Because of the additional difficulty of collecting and coding diagnostic data, and because circulatory disorders only impact patients having surgery, we suggest more research in larger, more representative samples before recommending this variable or other diagnostic groups for case-mix adjustment.
The case-mix models explained a modest percentage of within-hospital variation, consistent with previous results which found that a similar set of variables only explained between 3 percent and 8 percent of the variation in ratings about hospital care (Hargraves et al. 2001). However, the overall impact of the case-mix adjustments reported here, measured by the ratio of the standard deviation of the adjustment to the unadjusted ratings and Kendall's τ, exceeded the impact of a similar set of predictors on ratings of health plans (Zaslavsky et al. 2001).
Our case-mix models assume that the case-mix coefficients do not vary across hospitals. If they do, comparative inferences such as the ranking of hospitals could depend on the “standard” patient or population used to standardize the CAHPS Hospital scores; one hospital might perform better than another for some types of patients but worse for others. We found some evidence of hospital by case-mix interactions but when we fitted separate models to each service the only significant interaction was with age. However, the variation in the slope of age is relatively minor compared to the main effect of age. A previous study of case-mix adjustment for health plans suggested that variation in case-mix coefficients had little importance for adjustment of plan means but might indicate that comparisons of plans could be substantially different depending on the characteristics of the individual patient (Zaslavsky, Zaborski, and Cleary 2000). The heterogeneity of the age effect across hospitals should be evaluated again when larger datasets involving more hospitals are available.
The CAHPS Hospital pilot survey data represented only three states, so the effects of Spanish language or race may be different in other geographic areas. For instance, the relationship between Spanish language and reported experiences might be affected by the local concentration of Spanish speakers or by the specific Hispanic ethnicity (e.g., Mexican American, Cuban American) in the area. We tested whether the case-mix variables interacted with the region where a patient lives but did not find significant results. The consequences of such interactions would depend on whether the data were to be used primarily for local or national comparisons. Finally, we did not test if survey mode (phone versus mail) is needed for case-mix adjustment. Although there are differences by mode (Elliott et al. 2005), they are confounded with initial nonresponse, and patient experiences might be related to the mode they respond to, rather than whether or not they initially responded.
CAHPS Hospital scores will be used by patients and their providers to select hospitals, by hospitals to focus and monitor quality improvement efforts, and by policy makers to monitor and promote high-quality care. Because case-mix adjustment has the potential to prevent patient characteristics from confounding comparisons between hospitals, using adjustment models, such as the one specified here, is of crucial importance.
The following supplementary material for this article is available online:
Procedure for Categorical Predictors.
The CAHPS II project is funded by the Agency for Health Care Quality and Research (AHQR) and the Centers for Medicare and Medicaid through cooperative agreements with Harvard Medical School, RAND, and the Research Triangle Institute. User support is provided through a contract with Westat. Additional information about the study can be obtained by calling the AHQR Clearinghouse at 800-358-9295. The authors thank project officers Chris Crofton, Chuck Darby, Beth Kosiak, and MaryBeth Farquahr for their active participation and helpful suggestions throughout the project and members of the CAHPS consortium for their role in the design and implementation of the data collection activities and helpful comments on an earlier draft of this manuscript.