|Home | About | Journals | Submit | Contact Us | Français|
To assess the reliability of survey measures of organizational characteristics based on reports of single and multiple informants.
Survey of 330 informants in 91 medical clinics providing care to HIV-infected persons under Title III of the Ryan White CARE Act.
Surveys of clinicians and medical directors measured the implementation of quality improvement initiatives, priorities assigned to aspects of HIV care, barriers to providing high-quality HIV care, and quality improvement activities. Reliability of measures was assessed using generalizability coefficients. Components of variance and clinician–director differences were estimated using hierarchical regression models with survey items and informants nested within organizations.
There is substantial item- and informant-related variability in clinic assessments that results in modest or low clinic-level reliability for many measures. Directors occasionally gave more optimistic assessments of clinics than did clinicians.
For most measures studied, obtaining adequate reliability requires multiple informants. Using multiple-item scales or multiple informants can improve the psychometric performance of measures of organizational characteristics. Studies of such characteristics should report the organizational level reliability of the measures used.
Rising concern about the quality of medical care and preventable medical errors has increased interest in how systems of care operate. Health care organizations can shape the quality of care through the selection of clinical staff or educational programs for patients. Influencing clinician behavior, however, is arguably the most important way in which organizations affect care (Flood 1994; Landon, Wilson, and Cleary 1998). Organizations can influence clinicians using financial incentives, management strategies (e.g., utilization review, guidelines, profiling), structural arrangements (e.g., presence of particular facilities or domains of expertise, governance structures), and normative practice styles or organizational cultures.
Studies of organizational influences on the quality of care require measures of organizational characteristics that are rarely, if ever, recorded in a standardized way. Organizational data are commonly collected by surveying informants about their organizations. Surveys often ask for factual data such as the number of FTE medical staff or whether professionals with particular specialties are on site. They can also ask about subjective phenomena, such as an organization's culture or mission. Recent examples include Kralewski et al. (2000), who gathered data on revenue sources and methods of physician compensation from clinic medical directors or administrators, and Meterko, Mohr, and Young (2004), who measured hospital culture by surveying hospital employees.
Lazarsfeld and Menzel (1980) distinguish “global” and “analytical” organizational survey measures. Global measures refer to organization-level properties such as size or centralization of decision making. “Analytical” measures are organization-level averages of respondent-level data, such as the proportion of clinicians who are board certified in infectious diseases.
High reliability is necessary but not sufficient for the validity of measurement (Bohrnstedt 1983). Imprecise measurement (low reliability) will sometimes lead investigators to incorrect conclusions about relationships between an organizational factor and outcome measures of interest. Nonetheless, few organizational studies examine the reliability of informant reports.
If informant reliability is low, relying on a single informant per organization may be unwise. Just as using multiple-item scales can improve respondent-level survey measures, combining reports from multiple informants may raise reliability for organizational measurements. Assessing measure reliability can offer guidance about the number of informants needed to adequately measure different organizational properties.
When organizations are the objects of measurement, studies usually can select among several possible informants, so researchers must decide which informants to approach. Standard advice is to seek out informants who are knowledgeable, motivated, and unbiased (Huber and Power 1985). Managerial or administrative informants are often chosen on the assumption that they have good access to information. Such informants, however, also may tend to present the organization positively (Seidler 1974). Studies rarely examine differences in descriptions of an organization between types of informants (e.g., medical directors and physicians).
This article addresses issues of measure reliability and differences across informant types using data from a national study of medical clinics, the Evaluation of Quality Improvement for HIV (EQHIV) study. That study gathered data about clinic characteristics from the clinic director and several clinicians in each practice studied. It asked about implementation and assessment of improvement initiatives, HIV care priorities, and barriers to improvement. We examine the reliability of single-informant organizational measures based on individual survey items as well as multiple-item scales, and how reliability can be improved by using multiple informants. We also calculate the number of informants required to obtain reliable organization-level measures, and assess clinician–director differences in descriptions of a clinic.
Several health care studies have used surveys or interviews with informants to measure organizational characteristics. Studies relying on data from a single informant per organization have examined effects of group practice and payment methods on costs of care (Kralewski et al. 2000) and effects of care management processes on the quality of care (Casalino et al. 2003). Other studies used multiple informants in assessing organizational characteristics and performance in intensive care units (Shortell et al. 1991), long-term care teams (Temkin-Greener et al. 2004), and hospitals (Shortell et al. 1995; Aiken and Sloane 1997; Aiken and Patrician 2000; Meterko et al. 2004). Studies have used both single-item organizational measures (e.g. Aiken and Sloane 1997), and multiple-item scales (e.g. Shortell et al. 1991).
Multiple-informant studies often present one-way analyses of variance (ANOVA) of informant reports classified by organization to support combining informant assessments into organization-level measures. A statistically significant F ratio in such an analysis indicates a nonrandom resemblance in reports by informants within a given organization, but does not directly measure the extent of resemblance. The F ratio is sometimes supplemented by the correlation ratio η2, equivalent to the coefficient of determination (R2) for regressing an informant report on a set of indicator variables for organizational differences. Like R2, η2 can be misleadingly large when there are many indicator variables relative to the total number of reports.
Bohrnstedt (1983) generically defines the reliability of a measure as the ratio of true-score variance to total variance, or alternately as the complement of the ratio of error to total variance:
The last expression in (1) shows that reliability is low when error variance is large relative to total variance. The next-to-last expression shows that reliability is also low if variation in a phenomenon is limited within a given study population.
When several informant reports are available, it is common to use their average
as an organization-level measure. In (2) rjh is the report of informant h about organization j and nj is the number of informants for organization j; J is the number of organizations and is the total number of informants. If nj=1, (2) is the report rjh of a single informant. The measurement rjh may be a scale averaging K items xkjh; if K=1, rjh is a single item.
When rjh is a scale score, two potential sources of error variation in (2) are distinguishable, measurement error in rjh and error because of informant differences in rjh. Since the object of measurement is the organization, (2) is reliable when organizational variability is high relative to these sources of error. Likewise, the informant-level measure rjh is affected by organizational and informant differences as well as errors of measurement. Assuming that these sources are independent, the variance of rjh is
where , , and refer, respectively, to organizational, informants-within-organizations, and error components of variance. The variance of the organizational measure is then
The latter two components of reflect error in , while is reliable organizational variance. Expressing as a fraction of yields a generalizability coefficient (O'Brien 1990; Shavelson and Webb 1991) measuring the reliability of :
Measure (5) gives the fraction of variance in the organizational measure attributable to systematic organizational differences rather than informant variations or measurement error.
If rjh is a single item, informant and error variance are indistinguishable; and then combine into a single “error” variance component , and the reliability of becomes
If, moreover, organizations are measured using a single informant (nj=1), (6) simplifies further to
a quantity known as the intraclass correlation (see, e.g., Scheffé 1959, p. 223).
Title III of the Ryan White Comprehensive AIDS Resources Emergency (CARE) Act administered by the HIV/AIDS Bureau of the Health Resources and Services Administration (HRSA) supports comprehensive primary health care for HIV-infected persons. In 1999, HRSA required that clinical sites newly awarded funding under Title III participate in a quality improvement collaborative conducted by the Institute for Healthcare Improvement (IHI), and invited other Title III clinics to participate. The EQHIV study (Landon et al. 2004) conducted pre- and postintervention surveys of clinicians and medical directors in the participating clinics and a matched set of comparison clinics. Here we examine data from the preintervention surveys conducted between August 2000 and January 2001.
Of the 200 Title III sites in the continental United States in May 2000, we excluded 16 reporting HIV caseloads lower than 100 per year, 12 that initially enrolled in the collaborative but did not participate, and one that lost CARE Act funding shortly before the collaborative began. Of the remaining 171 sites, 62 participated in the collaborative, and 54 of those participated in the study and provided survey data. Control sites were matched with intervention sites on type (community health center, community-based organization, health department, hospital, or university medical center), location (rural, urban), number of locations delivering care, region, and number of active HIV cases. Of 40 control sites, 37 participated in the study and provided survey data. The Committee on Human Studies of Harvard Medical School approved the study protocol.
EQHIV surveyed clinic directors and clinicians to assess clinic and clinician characteristics. Surveys were mailed to the medical director and random samples of up to five clinicians who had primary responsibility for HIV patients. If a site had five or fewer clinicians, all were selected. Completed surveys were returned by 79 medical directors (87 percent response rate) and 300 clinicians (89 percent response rate). At 49 sites, the medical director was also a sampled clinician, and completed both instruments, so there were 330 distinct informants.
Survey instruments asked about clinic characteristics such as leadership commitment to quality, quality improvement initiatives, teamwork, patient care priorities, clinic priorities and limitations, and use of computers, as well as individual characteristics including formal education and training, HIV care experience, HIV knowledge, and basic demographic information. We constructed eight scales including items with common substantive content, using guidance from factor analyses. The longest scale (seven items) assessed the organization's openness to quality improvement. Others measured HIV knowledge (six items), research emphasis (three items), clinician autonomy (three items), emphasis on helping patients (three items), stress on guidelines (two items), barriers to quality improvement (five items), and a clinician's patient load (three items).
The director and clinician surveys had 15 identical items.1 As we are concerned with the reliability of measures across multiple informants within organizations, we examined the items answered by clinicians, including responses by directors to identical items.
Assessing the reliability of organization-level measures via (5), (6), or (7) requires estimates of variance components. Estimates were obtained by maximum likelihood using Stata (StataCorp 2003) and GLLAMM (Rabe-Hesketh, Pickles, and Skrondal 2001).2
For K-item scales, we estimated three-level mixed-effects regressions for items nested in informants nested in organizations, including fixed effects for differences in item means:
where μK is the mean for the last item in a scale, zkjh is an indicator variable identifying observations on item k, βk is the difference in means between item k and item K, υj is a random organization effect, ηjh is a random effect for informant h within organization j, and kjh is a residual term for item-level error. Estimates for , , and in (5) are variances of the random effects υj, ηjh, and kjh, respectively.
For single items, we estimated random-effects regressions for informants nested in organizations:
where μk is the mean for item k and kjh is a residual combining item- and informant-level error. We calculated reliabilities in (6) and (7) using the estimated variances and of the random effects υj and kjh, respectively.
With estimates of the variance components, we can calculate the implied number of informants required to measure an organizational characteristic at any criterion level of reliability. We set reliability in (5) or (6) at the conventional threshold of 0.70 (Nunnally 1978; Shortell et al. 1991) and solve for nj. For single items, this leads to3
The necessary number of informants increases with informant/error variance and the criterion level of reliability, and declines with organization-level variance. For a K-item scale, similar manipulation of (5) yields
The EQHIV study sites were representative of Title III clinics nationally (Landon et al. 2004). Differences between intervention and control sites in terms of location (rural/urban, regional), site type, and clinic status (general medicine versus specialized HIV practice) were statistically insignificant. Just over three-quarters of the informants were clinicians rather than directors or clinician–directors, 51 percent were male, and 71 percent were physicians. Clinicians and clinician–directors had a mean age of 42. The mean number of informants per clinic was 3.4 for items on the clinician survey only, and 3.6 for those on both the director and clinician surveys.
Table 1 presents estimated reliabilities for 26 single-item global measures that ask informants to report organization-level features. The first column presents the intraclass correlation ρx, interpretable here as the reliability of a single informant report. The second column presents the multiple-informant reliability evaluated at the mean number of informants per organization. The implied number of informants required to reach appears in column 3; columns 4–6 give the numbers of informants and clinics for each item, the correlation ratio η2, and the F ratio from one-way ANOVA.
Most estimated one-informant reliabilities ρx are small; the median intraclass correlation is 0.18 for the 26 measures. An exception is the priority placed on research, with estimated reliability over 0.60. The remaining 25 estimates of ρx range between 0.04 (funding limitations as a barrier to improvement) and 0.36 (whether a computer is available for patient care).
The estimated reliabilities for clinic means are higher than the intraclass correlations for individual items because averaging across multiple informants lowers error variance. Nonetheless, with the number of informants per organization in the EQHIV study (around 3.3 for most items after deletion of informants with missing values), only the organization-level mean for research emphasis has an estimated reliability greater than 0.70. Other estimates range from 0.12 (funding limitations) to 0.66 (computer availability). The median in Table 1 is 0.43. Other clinic-level measures that approach 0.70 reliability include the priority assigned to outreach/prevention activities , limited visit time as a barrier to improvement (0.59), and collaboration among clinical staff (0.60).
Given informant variations and item-level measurement errors, a substantial number of informants would be required to obtain reliable measures of many global organizational features. Values of range from 1.5 (research emphasis) to over 60 (limited funding), with a median of 10.5. While appreciably higher than the number of informants per organization in EQHIV, for these single-item measures is usually lower than the number of informants per organization in other multiple-informant studies in health care settings. Both the Shortell et al. (1991) and Temkin-Greener et al. (2004) studies, for instance, had over 40 informants per organization.
All but three F ratios from ANOVAs for the global items are significant at the 0.05 level. Thus, finding significant organizational differences does not imply high reliability. Values of the correlation ratio η2 range from 0.30 (limited funding) to 0.72 (research emphasis). Because of the relatively small number of informants per organization, values of η2 are high by comparison with the intraclass correlations ρx.4
Analytical organizational characteristics such as a clinic's specialty composition can be measured using means of individual characteristics reported by sampled respondents within an organization. For such measures, respondent-level variance reflects heterogeneity rather than disagreement. Such heterogeneity nonetheless reduces the reliability of an analytical measure.
Table 2 evaluates 30 one-item analytical measures. The estimated reliabilities vary widely, although F ratios indicate organizational differences on most measures (p<.05 for 24 of 30). No organizational commonalities are evident for some, including mean hours devoted to administrative work and mean frequency of discussing guidelines. Other clinic means are relatively reliable, however, even with the limited number of informants in this study. The proportion of physicians who are board certified in infectious diseases, for example, has an estimated organization-level reliability of 0.76. Clinic means on measures of patient load—outpatients per week, percentage of outpatients with HIV, number of HIV patients in a clinician's panel—have estimated reliabilities of 0.86, 0.82, and 0.78, respectively. Across the 30 measures in Table 2, the median value of needed to obtain clinic-level reliability of 0.70 is just over 12.
Multiple-item scales can yield more reliable organizational measures than single items, as item-level errors tend to cancel out when items are combined. Table 3 assesses the reliability of organization-level scale means in the EQHIV study. Scales include measures of both global and analytical properties.
The first column of Table 3 presents ρr, i.e., (5) evaluated assuming one informant per organization. The second column presents estimated reliabilities for scale means, i.e., (5) evaluated at the mean number of informants per organization in EQHIV. Column 3 gives the implied number of informants per organization needed to obtain a mean with 0.70 reliability, and column 4 gives the p level for testing the hypothesis of no organizational variance using a likelihood-ratio statistic.
To highlight differences between reliability assessments taking organizational and informant standpoints, column 5 presents an informant-level reliability measure
Measure (12) treats both organizational and informant variance as reliable; only item-level variation is regarded as erroneous. Values of ρr(i) are comparable with those of Cronbach's α presented in column 6. Contrasting ρr(i) and α with the organization-level reliabilities in columns 1 and 2 illustrates differences in scale reliability at organizational and informant levels.
Significant (p<.05) organization-level variance is present for all eight scales. For two, reliable organizational differences can be detected using the number of informants in EQHIV. The one-informant reliability ρr is almost 0.70 for the research emphasis scale, and clinic means on this scale have a reliability of nearly 0.90 at 3.6 informants per organization. Likewise, the multiple-informant reliability of the seven-item scale measuring openness to quality improvement is 0.67. Other scales perform less well. The results imply that reliable organizational measures could be obtained with fewer than 10 informants for most scales.
Even though these scales are relatively short, their estimated within-informant reliabilities often approach or exceed 0.7. Estimates of ρr(i) and α range from 0.50 (patient load scale) to 0.80 (openness to QI scale). A scale can be reliable at the informant level and yet be a weak organization-level measure if informant-level variance is large. For example, informants answer the five items on barriers to improvement consistently (ρr(i)=α=0.65), but appreciable informant differences produce ρr of only 0.12, and an estimated reliability for organization means (at 3.6 informants) of 0.32. Informant variations are much smaller for openness and research emphasis, so their internal consistency and organizational reliability are both high.
Multiple-item scales clearly can improve organizational measurement, but informant differences limit the improvements possible through adding scale items. Assuming one informant and arbitrarily many items, organizational reliability in (5) cannot exceed . For the EQHIV data, this upper bound on the organizational reliability of a scale ranges from 0.18 for the barriers to improvement scale, where the informant-level variance is over four times the organizational variance, to 1.0 for research emphasis, which had estimated informant variance of 0. Further improvements in reliability would require multiple informants.
We compared the responses of clinicians with those of directors (including clinician-directors) on all items and scales in Tables 1, ,2,2, ,3.3. Differences significant at or below the 0.10 level are displayed in Table 4. The first column gives the clinician/director difference using the units of measure in the EQHIV surveys; the second column uses standard deviation units.
Directors and clinicians assessed a few organizational characteristics differently. Significant differences, ranging between a quarter and a third of a standard deviation, were found for five of the 26 global organizational indicators from Table 1. Clinicians characterized their clinics as placing a lower priority on clinical care and a higher priority on research than did directors. Clinicians rated the education of HIV clinical staff somewhat higher than directors did, reported less decentralization, and were less likely to report a recent QI initiative.
Clinicians and directors differed on three of eight scales. Clinicians reported more emphasis on guidelines, somewhat less autonomy, and scored lower than clinician-directors on the HIV knowledge scale. There were several clinician–director differences on the individual characteristics from Table 2, most of which reflect factual rather than perceptual differences.
This study found that survey measures of organizational properties for Title III HIV clinics had low to modest reliability. Reports reflect common organizational phenomena, but vary substantially among informants within organizations. This can reflect perceptual differences, different interpretations of questions, and other measurement errors. Multiple-item scales can improve organizational measures, but scale scores also vary substantially within organizations. Our analyses suggest that obtaining reliable organizational measurements usually requires aggregation of reports across multiple informants.
The relatively low reliabilities for organizational means reflect a limited number of informants per organization, rather than especially low informant-level agreement. Informants can be familiar with the full organization in the relatively small EQHIV sites. One would expect lower concordance in studies of larger health care organizations such as hospitals.
The EQHIV intraclass correlations are high relative to those we calculated from other multiple-informant health care organization studies. Approximate intraclass correlations for constructs in a study of PACE teams (Temkin-Greener et al. 2004) range between 0.06 (conflict management) and 0.07 (perceived team effectiveness).5 Teams there were assessed, on average, by over 40 informants, so organization-level means have relatively high reliability; we calculate a range from 0.72 (conflict management) to 0.76 (effectiveness).
Reducing the informant and error components of variance in (4) can increase measure reliability. Pretesting, clarifications in item wording, and specific probes (Casalino et al. 2003) can reduce item-level error. Ensuring that the object of measurement (e.g., a clinic rather than a floor or team) is salient to informants also can reduce informant variations. Adding both scale items and informants can improve reliability. Additional items raise reliability by reducing item/error variance, while additional informants lower both informant and item/error variance. Improvements in reliability from adding informants are potentially greater than those from adding items. Recruiting new informants is, however, more expensive than lengthening a scale.
Directors occasionally gave more optimistic assessments than did clinicians. Such differences occurred only slightly more often than expected by chance, though, and were relatively small. Other informant differences also may influence assessments, however. Temkin-Greener et al. (2004) found that professionals assessed teams more positively than did paraprofessionals.
Our reliability estimates reflect variation in phenomena within the EQHIV study population as well as agreement among informants. If true variation is limited, a measure will have low reliability if there is even modest informant disagreement. Agreement coefficients (James, Demaree and Wolf 1984; LeBreton, James, and Lindell 2005) assess agreement per se by comparing observed disagreement with a conceivable level calculated using a null (e.g., uniform) distribution, rather than, with observed variation within a study population. As variation in several EQHIV measures is highly restricted, agreement coefficients are much higher than reliabilities for these. For example, the priority assigned to high-quality HIV care is high and varies little across organizations; the mean priority on a 1–5 scale is 4.75, with a standard deviation of 0.54. The pooled agreement coefficient (LeBreton, James, and Lindell 2005) is 0.87 for high-quality care, while the single-informant reliability in Table 1 is only 0.11. Leadership responsiveness is another example of high agreement but low reliability . These comparisons suggest that our measures might be more reliable if assessed using more heterogeneous organizations. While agreement coefficients are generally higher than the corresponding reliabilities, agreement levels are low for many EQHIV measures; examples include limited staff as a barrier to improvement , decentralization and presence of a recent QI initiative .
Another limitation of this study is that its findings for Title III clinics may not generalize to other health care organizations. As well, the clinician survey included many indicators prone to subjective interpretation. It is likely that informant reliability is higher for objective features such as the size of the medical staff or total clinic caseload. EQHIV assembled such information in a single-informant site survey, so we were unable to assess the reliability of such data.
Multiple-informant organizational measures are usually constructed by taking a mean across several reports. Informant variation reduces the reliability of such measures, but it also can be of substantive interest. Temkin-Greener et al. (2004), for example, use an ethnic diversity index to predict team performance. Our study did not attempt to assess the reliability of measures of organizational diversity or variation.
Surprisingly few studies of clinic or hospital characteristics report the organization-level reliability of their measures. Many that do rely on statistics such as the F-statistic or the correlation ratio do not adequately describe unit level reliability. Some studies report the informant-level internal consistency of scales, but a scale can be internally consistent within informants yet be unreliable as an organizational-level measure. This study found substantial item and respondent variability in clinic assessments, and modest or low clinic-level reliability for many measures. We suggest that studies of organizational characteristics should report the organizational-level reliability of the measures used, if possible.
Supported by a grant (R-01 HS10227) from the Agency for Healthcare Research and Quality (ARHQ). Carol Cosenza, MSW, and Patricia Gallagher, PhD, of the Center for Survey Research assisted with instrument development and survey administration. We also thank colleagues at the Health Resources and Services Administration and at the Institute for Healthcare Improvement who participated in and facilitated the EQHIV study, and two anonymous reviewers for helpful comments on a previous version of the this article.
1Informants who responded to both the director and clinician questionnaires answered the 15 overlapping items twice. Paired t-tests detected significant differences between the “director” and “clinician” responses on two items: informants gave significantly higher assessments of the priority placed on community outreach activities (p = .044) and the barriers to improvement posed by limited funding (p = .033) when responding as directors rather than clinicians. We used the “director” responses of these informants on the 15 overlapping items.
2Most indicators in the EQHIV surveys are ordered and dichotomous measures. We follow typical practice by assigning equally spaced scores to these and treating them as quantitative variables. We reached similar conclusions about reliability using logit and ordinal logit models that treat the indicators as discrete variables (Snijders and Bosker 1999).
3It is possible for to exceed the number of eligible respondents in some organizations, since rises with both error and informant variance. Large values of reflect low reliability.
4The expected value of the between-group sum of squares in ANOVA (the numerator of η2) depends on both the within-group variance and the between-group variance (Searle, Casella, and McCulloch 1992), so η2 is positive even with no between-group variance. When , η2 is (J−1)/(N−1); this ratio is substantial, 0.274, for illustrative values of J=91 and N=330 from EQHIV.
5Our calculations assume that the number of informants is the same in all organizations. F statistics then imply intraclass correlations ρx = (F −1)/, and organization-level reliabilities . If the number of informants differs across organizations, reliabilities are higher than calculated, but only slightly so unless the variation in informants is very large.