|Home | About | Journals | Submit | Contact Us | Français|
To create a patient-reported, multidimensional physician/patient interpersonal processes of care (IPC) instrument appropriate for patients from diverse racial/ethnic groups that allows reliable, valid, and unbiased comparisons across these groups.
Data were collected by telephone interview. The survey was administered in English and Spanish to adult general medicine patients, stratified by race/ethnicity and language (African Americans, English-speaking Latinos, Spanish-speaking Latinos, non-Latino whites) (N = 1,664).
In this cross-sectional study, items were designed to be appropriate for diverse ethnic groups based on focus groups, our prior framework, literature, and cognitive interviews. Multitrait scaling and confirmatory factor analysis were used to examine measurement invariance; we identified scales that allowed meaningful quantitative comparisons across four race/ethnic/language groups.
The final instrument assesses several subdomains of communication, patient-centered decision making, and interpersonal style. It includes 29 items representing 12 first-order and seven second-order factors with equivalent meaning (metric invariance) across groups; 18 items (seven factors) allowed unbiased mean comparison across groups (scalar invariance). Final scales exhibited moderate to high reliability.
The IPC survey can be used to describe disparities in interpersonal care, predict patient outcomes, and examine outcomes of quality improvement efforts to reduce health care disparities.
Evidence of racial/ethnic disparities in quality of care is accumulating rapidly. Widespread disparities have been observed in many components of quality such as effectiveness of care for chronic and acute health conditions, patient safety, timeliness, communication, respectfulness, and discrimination (Cooper-Patrick et al. 1999; Collins et al. 2002; Agency for Healthcare Research and Quality 2003, 2004; Smedley, Stith, and Nelson 2003; Weech-Maldonado et al. 2003; Johnson et al. 2004). Because many of these quality of care indicators have been linked to poorer health, they may partially account for race/ethnic disparities in health status (Smedley, Stith, and Nelson 2003). Within Donabedian's structure-process-outcomes quality of care paradigm (Donabedian 1968), most quality indicators in these studies pertain to technical processes of care; only a small proportion address interpersonal aspects of care. Interpersonal care may be a critical pathway to optimal health outcomes for minority, lower socioeconomic status (SES), or limited English proficiency (LEP) patient subgroups considered as priority populations (Foundation for Accountability: FACCT 1997; Clancy and Chesley 2003). Interpersonal quality of care may be as important as technical quality in determining health outcomes (Fung et al. 2005). Thus, research on the role of interpersonal aspects of care is needed, especially among diverse ethnic groups.
Such research has been hampered in several ways by a lack of adequate measures that allow for valid comparisons across groups. First, research on disparities in interpersonal aspects of care has been heavily based on coding audio or videotapes of encounters (Cooper-Patrick et al. 1999; Gallagher, Hartung, and Gregory 2001; Eide et al. 2004), precluding large-scale investigations. Second, although interpersonal processes are multidimensional (Stewart et al. 1999), most patient report measures assess one or two domains such as decision making (Kaplan et al. 1995). In two widely used multidimensional quality-of-care instruments—the Consumer Assessment of Health Plans® (CAHPS) and the Primary Care Assessment Survey (PCAS) (Safran et al. 1998)—interpersonal care comprises only a small portion. The CAHPS 2.0 assesses only provider communication and staff helpfulness (Hargraves, Hays, and Cleary 2003) and the PCAS assesses communication and interpersonal treatment (Safran et al. 1998). Third, most patient-reported research on disparities in interpersonal processes is based on single items (Collins et al. 2002; Agency for Healthcare Research and Quality 2003, 2004); although practical, single items are limited in scope and reliability, have questionable validity, and may result in biased estimates of group differences. Fourth, concepts and measures of interpersonal processes must reflect adequately the concerns of minority, lower SES, or LEP subgroups. Most existing measures were not designed with these groups in mind, and may miss relevant dimensions. Last, measures should have equivalent psychometric properties—factorial invariance—across groups, which allow for meaningful quantitative group comparisons (Meredith 1993; Gregorich 2006).
This article presents a multidimensional, patient-reported interpersonal processes of care (IPC) instrument, designed to be appropriate for four diverse groups (African Americans, English- and Spanish-speaking Latinos, and nonLatino whites).
Items were developed based on: (1) 19 focus groups stratified by race/ethnicity (African American, Latino, non-Latino white) and language (Spanish, English) (Nápoles-Springer et al. 2005); (2) our original conceptual framework and items (Stewart et al. 1999), (3) literature on quality of care and physician–patient communication, and (4) cognitive interviews with adults representing the same four ethnic/language groups (Nápoles-Springer, Santoyo, O'Brien, and Stewart 2006). Items were developed simultaneously in Spanish and English, aiming for semantic equivalence (Marín and Marín 1991). The item development process yielded 85 items. The measurement model included three broad domains (communication, decision making, and interpersonal style); each had several subdomains (Table 1). Respondents reported on the care they had received from their doctors over the past 12 months. For each item, they were asked how often that type of care had been provided using a five-point scale (1, never; 2, rarely; 3, sometimes; 4, usually; 5, always).
Adult patients with at least one visit in the prior 12 months were sampled from a patient database of adult general medicine practices at an academic health center. We recruited approximately 400 patients within each of four groups: African Americans, English-speaking Latinos, Spanish-speaking Latinos, and non-Latino whites. Recruitment and sampling procedures are described elsewhere (Nápoles-Springer, Santoyo, and Stewart 2005). Telephone interviews (conducted October 2001 through January 2002) lasted about 30 minutes. All procedures were approved by the academic health center's Institutional Review Board.
Items within subdomains were hypothesized to be unidimensional. Associations between items and subdomains represented the first-order structure of the measurement model. Subdomains, in turn, were hypothesized to be unidimensional indicators of the associated three domains, representing the second-order structure. In common factor analysis parlance, subdomains and domains represented first- and second-order common factors; common factors are latent constructs hypothesized to be indirectly observed via responses to items.
The measurement model was assessed in four stages. Initially, within each group, a confirmatory factor analysis (CFA) tested the hypothesized first-order common factor model, with 85 observed items, and 15 common factors (subdomains) in Table 1. Common factors were identified by their respective items (no cross-loadings), common factor variances and covariances were freely estimated, and all item residuals were constrained to be uncorrelated. To identify the model, the factor loading of a single item for each common factor was fixed to unity. These models fit poorly within each group, and estimated interfactor correlations were very high, thus, we rejected the hypothesized measurement model and searched for a more appropriate empirical model.
In the second stage, we used multitrait scaling analysis to assess whether each item in a subdomain was linearly related (r≥0.30, corrected for overlap) to the total score for that subdomain (item convergence), and whether each item correlated at least two standard errors higher with its hypothesized subdomain than with other subdomains (item discrimination) (Hays and Hayashi 1990; Stewart, Hays, and Ware 1992). We analyzed the hypothesized scales within domains—communication, decision making, and interpersonal style—separately for each group. We first eliminated items not meeting the item convergence criterion in all four race/ethnic groups, followed by elimination of nonconvergent items in at least three groups. We then eliminated items not meeting the item discrimination criterion using the same approach. This left 56 items.
In the third stage, data from these 56 items were modeled with multiple-group CFA. These analyses assessed the unidimensionality of items within each revised subdomain across the four groups. Frequently, item sets hypothesized to represent a single subdomain were found to represent multiple highly correlated subdomains, suggesting a higher-order factor structure. In addition, items were dropped via a process of backward elimination, either because they did not have salient loadings (<0.40) on the hypothesized factor, or loaded on more than one factor. The modified measurement models from each domain were then combined to form a single, empirical measurement model: a second-order factor model with 29 items, 12 first-order common factors, and seven second-order common factors (Figure 1). Detailed results from analysis stages 2 and 3 are not reported.
In the fourth analysis stage, a series of nested multiple-group factor models, based on the empirical model derived from the third stage, were fit to test the invariance of corresponding model parameters across the four groups (Meredith 1993). First- and second-order factors were identified by their respective items and first-order factors. For each common factor, the loading of a single item or first-order factor, as appropriate, was fixed at unity and the corresponding intercept was fixed at 0 to identify the model. All residual variances were constrained to be uncorrelated. Four basic hypotheses were tested via multi-group models: (1) invariance of the item/factor configuration, (2) invariance of first- and second-order factor loading estimates, commonly known as factor pattern or metric invariance (evidence of equivalent factor meaning across groups), (3) invariance of estimated item and first-order factor intercepts, known as strong factorial or scalar invariance (evidence that comparisons of observed means across groups are unbiased), and (4) invariance of estimated item and first-order factor residual variances, known as strict factorial invariance. These nested models were tested sequentially, first for the first-order and then for the second-order factor structure.
In analysis stages 1, 3, and 4, models were fit to the data with LISREL 8.54 using maximum likelihood estimation (Jöreskog and Sörbom 1998). Goodness of fit was assessed by examining model χ2 and degrees of freedom, the root mean square error of approximation (RMSEA, Steiger 1990), and the comparative fit index (CFI, Bentler 1990). Comparisons between models were aided by the expected cross-validation index (ECVI, Browne and Cudeck 1993). Generally, significant χ2 tests indicate lack of “exact fit.” RMSEA values below 0.05 or 0.06 and CFI values above 0.95 suggest approximate model fit (Browne and Cudeck 1993; Hu and Bentler 1999). In a series of nested models, ECVI reaches a relative minimum value for models with higher expectation of cross validation in independent samples of the same size. Point estimates of RMSEA and ECVI were augmented with 90 percent confidence intervals (Browne and Cudeck 1993). In stage 4, empirical model modifications were guided by LISREL's modification indices. Cross-group equality constraints on parameter estimates that contributed the most to lack of fit were subsequently freed and the model reestimated. In deciding when to stop empirical model modifications, we paid particular attention to the RMSEA and ECVI, which adjust for the number of estimated parameters.
Because of nonnormal distributions, data were pooled across groups and transformed to normal scores before analysis (Blom 1958). As analyzed in phase 4, the transformed data had median values of absolute skewness and kurtosis equal to 1.2 and 1.3, respectively. Even with ordinal variables, the observed nonnormality will not affect parameter estimates, but can result in overestimation of χ2 test statistics and underestimation of parameter standard errors (Browne 1984; Muthén and Kaplan 1992). No correction was made to the χ2 tests because they were expected to be conservatively biased. However, for each model, standard errors were estimated from 200 bootstrap samples. To deal with missing data (<5 percent of all data points), CFA models were fit to covariance matrices and mean vectors estimated by the expectation–maximization (EM) algorithm (Little and Rubin 2002). The EM covariance matrices partialed the effects of respondent age, gender, and education.
Once final scales were selected, we calculated scale scores by averaging nonmissing items; scores ranged from 1 to 5 and a higher score indicated higher frequency of the construct (e.g., higher discrimination scores indicate more discrimination and higher compassion scores indicate more compassion). We calculated internal-consistency reliabilities, the pooled within-groups interscale correlation matrix, and mean scale score differences across the four groups. Readability of item stems and instructions was summarized using the Flesch–Kincaid formula.
Of those contacted and eligible, 70 percent responded (N = 1,664). This represented 42 percent of the sampling frame (Nápoles-Springer, Santoyo, and Stewart 2005). Men, those aged 18–39 and 75 years and older, and non-Latino whites were slightly underrepresented. A broad age range was obtained and the majority was female (Table 2). Spanish-speaking Latinos were the oldest (mean = 62), had the lowest SES, and were most likely to report fair or poor health (56 percent). English-speaking Latinos were the youngest (mean = 43). The mean number of visits during the prior year was 7 (SD = 8.5). Most respondents reported receiving a prescribed medication (84 percent) and a medical test/procedure (89 percent) in the prior year.
The hypothesized first-order factor model did not fit well in any group: χ2(3,380, n = 421) = 6,949.77; χ2(3,380, n = 421) = 6,881.23; χ2(3,380, n = 364) = 7,298.25; and χ2(3,380, n = 415) = 7,251.31, with all p-values <.001, for the African American, Latino/English, Latino/Spanish, and white groups. The corresponding RMSEA values ranged from 0.053 to 0.059 suggesting approximate fit, whereas CFI values ranged from 0.780 to 0.839, suggesting poor fit. The hypothesized model was abandoned in favor of the empirical measurement model developed in stages 2 and 3.
Multi-sample CFA models tested the invariance of the 29-item, second-order empirical factor model across the four groups (Figure 1). Although the χ2 tests of “exact fit” were significant for each of the models, the other indices suggested approximate fit: all RMSEA values were below 0.04 and CFI values were greater than 0.96.
A series of first-order CFA factor models were fit (Models 1–3f, Table 3). Model 1 allowed all parameter estimates to be freely estimated across groups to test first-order configural invariance. The contribution of each group to the overall χ2 was nearly equal and the RMSEA and CFI values provided evidence of the same item clusterings in each group. Model 2 constrained corresponding first-order factor loadings to be equal across groups, testing metric invariance. Based upon approximate fit indices, this model fit as well or better than the configural invariance model and suggested that the 12 first-order factors had the same meanings across all four groups. Model 3 further constrained corresponding item intercepts to be equal across groups, testing scalar invariance. This model resulted in a worsening of fit, suggesting that some intercept values were not equivalent across groups. Modification indices suggested that model 2 imposed six equality constraints on item intercepts that should be freed: three each for the Latino/Spanish (items q18, q17, and q4) and white groups (items q3, q12, and q17) (Table 4). Models 3a–3f represented partial scalar invariance models: each freely estimated one additional item intercept parameter (Byrne, Shavelson, and Muthén 1989; Steenkamp and Baumgartner 1998).
For model 3f, the values of RMSEA (0.034) and CFI (0.970) suggested good approximate fit. Further, the point estimate of ECVI approached that for model 2, which was arguably the best-fitting model considered. Model 3f suggested that unbiased group comparisons of means were possible if analyses were restricted to those items demonstrating invariant factor loadings and item intercepts (Steenkamp and Baumgartner 1998).
On the left-hand side of Figure 1, items with invariant intercept estimates have arrows with solid lines. A further model (not shown) tested the invariance of item residual variances across groups—a test of partial strict factorial invariance. Although indices suggested approximate model fit, their point values were well outside of the 90 percent confidence intervals of the previous models (RMSEA = 0.048; ECVI = 2.23), thus the model was rejected. Further empirical modifications were considered to identify a well-fitting partial strict first-order factorial invariance model, but after 10 such modifications (freeing 10 cross-group equality constraints on residual variance estimates) the resulting RMSEA and ECVI values still suggested worse fit than model 3f (i.e., RMSEA = 0.038; ECVI = 1.94). Therefore, attempts to specify a model with invariant item residual variances were abandoned.
Next, a series of second-order CFA models were fit. With model 3f defining the first-order structure of the measurement model, configural, metric, and scalar invariance of the second-order factor structure were investigated. Model 4 assessed second-order configural invariance. Associated approximate fit indices suggested that each group had the same clusterings of first- and second-order factors. Model 5 tested the invariance of corresponding second-order factor loadings across groups (i.e., second-order metric invariance). Relative to model 4, the fit of this model was poor and model modifications were considered. Based upon modification indices, a set of models relaxed some cross-group equality constraints to test partial second-order metric invariance. The last of these models, 5d, freed four second-order factor loading estimates: the loadings for discriminated due to race/ethnicity in the white and Latino/Spanish groups (models 5a and 5d); assumed SES in the Latino/English group (5b); and hurried/distracted in the Latino/Spanish group (5c; Table 3). The approximate fit of model 5d was similar to the well fitting, but less parsimonious model 4. Model 6 constrained all corresponding first-order factor intercepts to be equal across groups, except for those associated with the four second-order factor loadings freely estimated in model 5d. This model assessed partial second-order scalar invariance. The associated RMSEA, ECVI, and CFI values suggested a worsening of fit relative to model 5d. Again, modification indices guided model modifications. The final model, 6c, freely estimated three additional first-order factor intercepts, for explained medications, hurried/distracted, and asked patient for whites. The fit of this most parsimonious model suggested a reasonable approximation and compared well with other models.
Common metric standardized factor loadings and intercept estimates from model 6c are summarized in Table 4. Cross-loadings of explained medications on patient-centered decision making, and respectful on discriminated equaled 0.38 and−0.47, respectively (not tabled).
Eighteen items representing seven constructs met the scalar invariance criterion, either at the first- or second-order level (underlined items in Table 4). The fit of model 6c suggested that group mean comparisons based upon these 18 items would be approximately unbiased. Final items for the full 29-item survey and the 18-item “short form” are presented in Table 5.
For six of the seven short-form scales, internal-consistency reliability was above 0.70 in the total sample (range 0.65–0.90); the reliability for lack of clarity was 0.65 (Table 6). Within all four groups, reliabilities were also generally high (range 0.61–0.91); of 28 coefficients, 24 were above 0.70. The Flesch–Kincaid grade level of the item stems and instructions for the 29-item survey was 8.6; for the 18-item short form it was 5.8.
We observed significant mean group differences on all seven short-form scales, although no group consistently had the lowest or highest scores. Overall, the worst scores were observed for decided together with a total sample mean of 3.13; Spanish-speaking Latinos had the lowest scores (2.84) and whites the highest (3.31; Table 6). Mean scores on discriminated due to race/ethnicity were near 1.0 (less discrimination) for all groups; African Americans experienced the most discrimination followed by English-speaking Latinos. Latinos (Spanish- and English-speaking) scored the lowest on all three communication scales.
Research on race/ethnic disparities in interpersonal aspects of care has been limited by a lack of measures that reflect the multidimensional nature of these processes and allow valid, unbiased comparisons across diverse groups. This study helps fill this gap by conceptualizing and operationalizing interpersonal processes as multidimensional. We provide a patient-reported survey developed through a sequence of qualitative and quantitative studies, with Spanish and English versions developed in parallel, with evidence of reliability and validity, and demonstrating scalar invariance of a subset of items in each domain. The final empirically based framework shared many features of the hypothesized model, which in turn was based on previous work done in this area. However, it had fewer subdomains, in part because hypothesized constructs were interrelated in more complex ways than originally thought.
The IPC Survey should facilitate research to explore how specific aspects of interpersonal care affect various health outcomes and whether interpersonal care explains disparities in such outcomes. The 29-item survey performed well within each group and can be used for within-group studies. The 18-item short form can be used to make unbiased mean comparisons across the four groups represented in this sample. Analyses of determinants and outcomes of IPC using these measures are forthcoming.
One notable finding was the relatively high scores overall. Relatively good processes could be the true state in these practices located in a major medical teaching university serving a highly diverse population. Also, many patients had attended these clinics for over a decade and may have found providers with whom they were comfortable. Despite the relatively high scores, there is room for improvement, particularly with respect to patient-centered decision making.
Although there were significant group differences in interpersonal processes, these differences did not consistently favor any group. Our finding that African Americans obtained the best scores on two communication scales are consistent with results on provider communication from a CAHPS Medicaid Managed Care survey (Weech-Maldonado et al. 2003), and our finding that whites had the highest scores on decided together is consistent with one other study (Cooper-Patrick et al. 1999).
We envision four broad applications of the IPC Survey. First, the short-form facilitates comparative population- or clinic-based studies of disparities in interpersonal processes (e.g., by race/ethnicity). Another application is to determine if the measures predict technical processes (e.g., procedures or tests) or patient outcomes (e.g., patient adherence or satisfaction) (Stewart 1995; Blanchard and Lurie 2004; Fung et al. 2005).
A third application is to use the IPC Survey as an outcome of quality improvement policies (e.g., provider training). There is evidence that race/ethnic concordance of physicians and patients is associated with better communication (Saha et al. 1999) and more participatory decision making (Cooper-Patrick et al. 1999). If such findings are replicated with these measures, systems of care might be more likely to diversify their health care professional staff and develop targeted interventions to improve specific aspects of interpersonal processes (Cegala, Post, and McClure 2001). Finally, the IPC Survey could be useful to administrators as measures of outcomes of continuous quality improvement (e.g., to monitor disparities or provide feedback to physicians on interpersonal care).
We recommend continued validation research on the IPC Survey across a range of groups and settings. We empirically eliminated several conceptually relevant subdomains such as empowerment and cultural sensitivity that warrant continued measurement efforts. Because power differentials between patients and physicians place vulnerable patients at a unique disadvantage, empowerment may be an outcome of quality care. Cultural sensitivity was found to be multidimensional in our qualitative analyses (Nápoles-Springer et al. 2005), possibly explaining why our efforts to derive a unidimensional scale were unsuccessful. Cultural sensitivity may be difficult to measure because it is manifested through a broad spectrum of behaviors and attitudes (e.g., respect, compassion) toward nonwhite patients (Clancy and Stryer 2001).
Our results should be interpreted in light of several limitations. The study was conducted within a single university-based system of care in a geographic area known for diversity. We did not include Asian Americans or other ethnic subgroups. The measurement models were tested and modified using a single data set, thus results are provisional, conditional on future replication in independent samples. To achieve invariance across four groups, we eliminated items that worked well within some of the groups (e.g., were culture-specific). Future studies might supplement invariant scales with ethnic-specific scales when evidence suggests that a construct may help explain disparities in that group. Because we used telephone administration to accommodate persons with limited literacy or English proficiency, we do not know how well self-administration would work.
Numerous reports and policy statements call for quality measures that are relevant, valid, and unbiased across ethnic and linguistic groups to assess possible quality-of-care disparities (Bethell et al. 2003; Fortier and Bishop 2003; Beach et al. 2004). Although, we demonstrated the methodological complexities associated with doing so, the IPC Survey should help to fill this gap, and may prove useful in assessing quality of care disparities in other settings and ethnic groups.
This study was supported by grant R01 HS10599 from the Agency for Healthcare Research and Quality; it was also supported in part by grant P30-AG15272 from the National Institute on Aging, the National Institute of Nursing Research, and The National Center on Minority Health and Health Disparities, National Institutes of Health. We appreciate the care with which the Public Research Institute designed and conducted the computer-assisted telephone interview survey. We are indebted to Dr. A. Eugene Washington who provided the initial spark and unfailing enthusiasm. We thank Dr. Eliseo Pérez-Stable for his support and guidance along the way.