|Home | About | Journals | Submit | Contact Us | Français|
Mini-CEX scores assess resident competence. Rater training might improve mini-CEX score interrater reliability, but evidence is lacking.
Evaluate a rater training workshop using interrater reliability and accuracy.
Randomized trial (immediate versus delayed workshop) and single-group pre/post study (randomized groups combined).
Academic medical center.
Fifty-two internal medicine clinic preceptors (31 randomized and 21 additional workshop attendees).
The workshop included rater error training, performance dimension training, behavioral observation training, and frame of reference training using lecture, video, and facilitated discussion. Delayed group received no intervention until after posttest.
Mini-CEX ratings at baseline (just before workshop for workshop group), and four weeks later using videotaped resident–patient encounters; mini-CEX ratings of live resident–patient encounters one year preceding and one year following the workshop; rater confidence using mini-CEX.
Among 31 randomized participants, interrater reliabilities in the delayed group (baseline intraclass correlation coefficient [ICC] 0.43, follow-up 0.53) and workshop group (baseline 0.40, follow-up 0.43) were not significantly different (= 0.19). Mean ratings were similar at baseline (delayed 4.9 [95% confidence interval 4.6–5.2], workshop 4.8 [4.5–5.1]) and follow-up (delayed 5.4 [5.0–5.7], workshop 5.3 [5.0–5.6]; =0.88 for interaction). For the entire cohort, rater confidence (1=not confident, 6=very confident) improved from mean (SD) 3.8 (1.4) to 4.4 (1.0), =0.018. Interrater reliability for ratings of live encounters (entire cohort) was higher after the workshop (ICC 0.34) than before (ICC 0.18) but the standard error of measurement was similar for both periods.
Rater training did not improve interrater reliability or accuracy of mini-CEX scores.
clinicaltrials.gov identifier NCT00667940
The online version of this article (doi:10.1007/s11606-008-0842-3) contains supplementary material, which is available to authorized users.
Residency program directors must certify the competence of graduating residents. Valid assessments of resident performance are needed.1–3 However, obtaining valid assessments of clinical skills is challenging.1,2,4 Research found scores from the American Board of Internal Medicine (ABIM) “long case” clinical evaluation exercise (CEX) to be unreliable due to large inter-case and inter-rater variance.5–9 Consequently, the ABIM now endorses the mini-CEX,10,11 in which faculty observe residents during multiple brief clinical encounters and rate performance overall and in six specific domains (interviewing, physical examination, humanistic qualities/professionalism, clinical judgment, counseling, and organization/efficiency) using a nine-point scale.
The reliability of scores from one long case5,7,8 and one mini-CEX encounter2,3,10,12,13 are similar. However, the mini-CEX's brevity facilitates multiple observations which collectively demonstrate acceptable reliability2,10 (i.e., consistency of scores across measurement replications).14 Improving interrater reliability could reduce the number of resident–patient encounters required for reliable scores and improve the validity of mini-CEX score interpretations.
Although faculty training could improve rater performance, few studies have examined faculty development in clinical performance assessment.15 One study found that neither 30-minute nor two-hour training sessions improved score reliability, but this study was limited by sample size.16 Another study found that a 15-minute instructional videotape did not affect rating accuracy.9 More recently, a study using the mini-CEX found that following a one-day “direct observation of competence training” course participants’ mini-CEX ratings were lower, but not necessarily more accurate, than nonparticipants’ ratings.17 However, research suggests that rater training can improve score reliability in clinical assessments.18–20 Moreover, a meta-analysis of non-medical education studies suggested that rater training workshops can improve score accuracy.21 Given the potential benefits of rater training, further investigation in medical education appears warranted.
We sought to evaluate the effect of a rater training workshop, in comparison to no intervention, on the interrater reliability and accuracy of mini-CEX scores from resident continuity clinic preceptors. We hypothesized that the workshop would improve both interrater reliability and accuracy.
We conducted a randomized trial comparing a rater training workshop (held in May 2006) to no intervention. Rater reliability and accuracy were measured at baseline and four weeks later using videotaped resident–patient encounters (Fig. 1). A single-group pre/post comparison using ratings of live resident–patient encounters was also performed. Our Institutional Review Board exempted this study.
All internal medicine residency continuity clinic preceptors were invited to participate in the workshop and in the study. Study participation was voluntary, and all participants consented. A subset agreed to participate in a randomized trial. Author DAC used MINIM (version 1.5, London Hospital Medical College, London) to randomly assign these participants to immediate workshop (intervention) or delayed workshop (control), with stratification by continuity clinic firm.
The half-day workshop used rater training methods similar to those used by Holmboe et al.1 and shown to be effective in other settings.15,16,21 One of the workshop facilitators (author DMD) had attended the rater training course described by Holmboe1 and was thus familiar with these methods. Our workshop focused on the need for accurate and reliable resident assessment; potential errors and biases in rating systems (rater error training); facilitated discussion of the domains of the mini-CEX (performance dimension training); methods to improve observation (behavioral observation training); and facilitated discussion in large and small groups of videotaped interactions between residents and standardized patients (frame of reference training). Since one meta-analysis found that frame-of-reference training improves rater accuracy more than other training methods,21 we used this method for more than half the workshop.
Delayed workshop participants received no specific intervention until after rating the second set of videotaped encounters (posttest). They then attended the workshop.
The primary outcome was interrater reliability for mini-CEX scores of videotaped resident–patient encounters. Secondary outcomes included score accuracy, mean scores and interrater reliability for mini-CEX scores of real resident–patient encounters, halo effect, and preceptor perceptions of the mini-CEX.
Participants in the immediate workshop group rated 16 videotaped encounters between residents and standardized patients just before the workshop (pretest) and a second set of 16 encounters four weeks later (posttest), while those in the delayed workshop group rated the first set of cases four weeks prior to the workshop (pretest) and rated the second set immediately before the workshop (posttest). Raters were instructed to rate whatever competence domain(s) they could appropriately assess and provide an overall rating for each case.
Nine cases came from the Yale Rater Training Program and nine cases from the ABIM-NBME Faculty Development Program. These cases, developed for and used in previous studies of the mini-CEX,1,17 were scripted to portray unsatisfactory, satisfactory, and superior performance for interviewing, examination, and counseling skills. To achieve the number of cases required by our power analysis, we developed an additional fourteen cases using standardized patients to portray seven patient scenarios. We coached internal medicine residents to demonstrate varying levels of competence, but performance was not scripted. These 32 cases were divided into two case sets (pretest and posttest), each consisting of nine scripted cases (one at each level of performance for each skill domain) and seven local cases (one from each patient scenario). Case sequence within each set was randomized. All participants completed the same pretest and the same posttest.
We also obtained ratings from all resident–patient encounters observed in residents’ outpatient clinic by workshop participants. Faculty rated these encounters using a modified mini-CEX with a five-point scale where 1=“needs improvement,” 2–4=average, and 5=“top 10%."22 Observations for the 12 months preceding and following the workshop were compared.
Prior to the pretest or starting the workshop, each workshop participant completed a questionnaire ascertaining demographic information. Confidence in the use of, and perceived accuracy and usefulness of, the mini-CEX were assessed before and after the course using a six-point Likert scale.
Interrater reliability for scores from videotaped encounters was measured with intraclass correlation coefficients (ICC), calculated using variance components estimated via maximum likelihood. ICC confidence intervals were obtained via a profile likelihood approach, and comparisons among ICCs were made using a likelihood ratio test accounting for repeated assessments made by each rater.23 A sample of 15 preceptors per group and 16 ratings per preceptor was estimated to provide 80% power to detect a change from an estimated pre-workshop ICC of 0.6 to a post-workshop ICC of 0.8.24
Accuracy analyses used only scripted encounters. Discrimination was evaluated by comparing mean ratings across the three levels of competence using mixed linear models with repeated measures on subjects. The frequency with which actual ratings matched scripted performance was compared between groups using a logistic regression model that included repeated measures across subjects via generalized estimating equations.25 Our experience with the scripted cases indicated that experts had legitimate disagreements in performance ratings, suggesting some degree of error in the scripted performance levels. Hence, ICC (which is similar to weighted kappa26) was calculated as a measure of chance-corrected agreement with scripted performance level. We calculated ICCs for each rater individually and compared these between groups using mixed linear models with repeated measures on subjects.
The halo effect was evaluated by averaging across domains for each set of ratings (i.e., the average of interviewing, exam, efficiency, etc. from one rater on one case) and comparing the resultant standard deviations between groups27 using mixed linear models with repeated measures on subjects. In this analysis, lower standard deviations indicate greater halo (less variability among domains).
Raters occasionally omitted overall ratings for some videotaped cases. When this occurred, the average of all rated domains was used as the overall rating (supplemented scores). To verify the appropriateness of this method, the strength of the association between overall scores and domain scores was computed as the square of Pearson’s r (R2, or percent of variance explained). Domain scores accounted for >61% of the variance in overall scores (R2 range 0.61–0.84). Because of this, and because analyses based on the subset of raters recording overall ratings yielded similar results, only analyses with supplemented scores are reported.
For scores from observations of live resident–patient encounters, mean ratings before and after the workshop were compared using mixed linear models with repeated measures on raters. To estimate the reliability of these scores, random effects variance components were used to calculate the standard error of measurement (SEM)28,29 and the dependability index (phi).29 Residents with fewer than three observed encounters were excluded from reliability analyses.
Raters’ beliefs about the mini-CEX were compared pre-workshop and post-workshop and between groups using the Wilcoxon signed rank and rank sum tests.
All analyses were conducted with SAS 9.1 and S-Plus 8.0.1 using a two-sided alpha of 0.05. All participants were analyzed in their assigned group.
Fifty-two of 54 eligible faculty members participated in the workshop and consented to their data being reported (Fig. 1). Mean (SD) age was 45.3 (11.2) years. Fifteen (29%) were women, six (12%) were full professors, 18/50 (36%) had been precepting for more than 12 years, and 7/51 (14%) had prior rater training. There were no significant demographic differences between the 31 who participated in the randomized trial and the 21 who did not ( ≥ 0.18); see Appendix for details.
Differences in interrater reliability for overall ratings among the four study conditions (immediate or delayed workshop, and pretest or posttest; = 0.19) were small relative to the variability in the estimates (Table 1). For overall ratings the difference in ICC change of −0.12 favored the control group, with an approximate 95% confidence interval (CI) −0.50 to 0.26. Variance components are reported in the Appendix.
Accuracy was measured in three ways: mean ratings (discrimination), the frequency with which ratings matched scripted performance (percent agreement), and chance-corrected agreement using ICC. Ratings (Table 2) improved significantly as scripted performance rose (mean [SD] 3.5 [1.4] for unsatisfactory, 5.3 [1.5] for satisfactory, and 6.4 [1.4] for superior performance, <0.0001 for each pair-wise comparison). Overall ratings for delayed (5.0 [1.8]) and immediate groups (5.1 [1.9], CI for difference −0.2 to 0.4, = 0.68) were similar when averaged across pre- and posttests. However, posttest ratings were higher (5.3 [1.9]) than pretest ratings (4.8 [1.8], CI for difference 0.2 to 0.8, =0.002) when averaging across groups. The interaction between group and testing period was not significant (=0.88), suggesting that the workshop did not affect mean ratings. The interaction between performance, group, and testing period was also not statistically significant (=0.94), suggesting that results were similar for each performance level. Similar results were found for analyses of interviewing, exam, and counseling domains (see Appendix).
Percent agreement (Table 2) was similar for delayed and immediate groups when averaging across tests (odds ratio [OR] for overall ratings 1.08 [CI 0.64 – 1.81] for delayed versus immediate [referent], =0.78) and between pretest and posttest when averaging across groups (OR 1.04 [0.66 – 1.65] for posttest versus pretest, =0.85). When assessing whether percent agreement for overall ratings changed to a greater extent from pretest to posttest in the immediate group relative to the delayed group, the gains were only 1.03-fold higher (CI 0.57–1.88, =0.92) in the immediate group. Differences for interviewing, exam, and counseling domains were likewise not statistically significant (≥.08; see Appendix).
For overall ratings, accuracy ICCs were similar between groups at pretest (0.56 [0.28] delayed group, 0.54 [0.29] immediate group; CI for difference −0.19 to 0.22, =0.86) and again following the intervention (0.58 [0.26] delayed, 0.56 [0.21] immediate; CI for difference −0.15 to 0.19, =0.80). The interaction between group and testing period was not statistically significant (=0.49). Similar results were found for analyses of interviewing, exam, and counseling domain ratings.
Halo describes the degree to which overall impressions influence domain-specific ratings. In our analysis, lower standard deviations indicate greater halo. However, these values did not vary significantly between groups or testing periods (≥0.25) nor was the interaction significant (=0.85).
Over two years, 45 workshop participants assessed 177 different residents in outpatient clinic using the mini-CEX. On average, preceptors rated each of 116 residents 2.6 (1.6) times in the 12 months preceding the workshop, and each of 138 residents 3.7 (2.0) times in the 12 months following. Mean overall scores (using the 1 to 5 scale described above) were similar before (3.77 [0.55]) and after (3.78 [0.56]) the workshop (CI for difference, −0.08 to +0.07, =0.80).
The reproducibility coefficient (phi) was somewhat higher after the workshop (0.34) than before (0.18). However, the SEM was similar for both periods (0.51 before, 0.47 after). Variance components are reported in the Appendix.
Among 45 respondents, baseline ratings were 3.8 (1.4) for “confidence in your ability to use the mini-CEX,” and 4.1 (1.2) for accuracy and 4.6 (1.2) for usefulness of mini-CEX scores, using a scale where 1=Not confident/accurate/useful at all and 6=Very confident/accurate/useful. Following the workshop, confidence among 50 respondents (43 of whom provided pre-workshop responses) increased to 4.4 (1.0) (change 0.5 [0.1 to 00.9], =0.018). However, accuracy (3.9 [1.3]) and usefulness (4.7 [1.2]) ratings were not significantly changed (accuracy change −0.1 [−0.5 to +0.3], =0.53; usefulness change 0.0 [−0.3 to +0.3], =0.83). Ratings were not significantly different between the two randomized groups (≥0.058) or between randomized and nonrandomized participants (≥0.085).
This half-day rater training workshop did not appear to significantly affect the reliability, accuracy, or rater halo of mini-CEX scores. These findings were consistent in both a randomized comparison using videotaped encounters as the object of assessment, and in pre/post comparisons of scores from actual resident–patient encounters. Although participants’ confidence in using the mini-CEX improved, beliefs about the accuracy and usefulness of the mini-CEX changed little.
Several reasons might explain why rater training did not significantly affect accuracy and interrater reliability. First, we selected the sample size to detect effects larger than were observed. While the power to detect small effects was inadequate, the point estimate for reliability change favored the control group and the upper confidence limit (0.26) only slightly exceeded the a priori level of educational significance (0.2). These findings and the relatively small magnitude of other observed differences suggest that this workshop would be unlikely to produce desired gains in reliability and accuracy. Second, the workshop format might have been ineffective. Although we used instructional methods similar to a previous study of rater training,1 our workshop was somewhat shorter. Nevertheless, it was longer than the training provided in other studies of rater training.9,16 Moreover, these previous workshops did not clearly improve ratings although one workshop resulted in more stringent ratings.1 Thus a third explanation could be, as Williams et al.15 speculated, that “physician raters are impervious to training.” A fourth explanation is that frame of reference training is limited by case specificity.30 Workshop participants noted that consensus reached on one case (for example, examining a patient with dyspnea) may have little relevance to another case (examining a patient with knee pain). This failure of cases to generalize suggests that effective training would require detailed discussion of numerous cases, which may be infeasible in a typical workshop. A final explanation is that different raters observe and value different things.31 Participants often disagreed regarding what constituted unsatisfactory, satisfactory, or superior performance, and even prolonged discussion did not guarantee consensus.
While inconsistent ratings will lower interrater reliability and accuracy, differences of opinion among faculty may actually enhance formative feedback by illustrating alternative approaches to specific tasks. For summative assessments mini-CEX scores are limited by modest reliability.2 Yet the mini-CEX's more important role may be to encourage faculty to “get in the room” with residents, observe specific skills, and provide formative feedback. Despite its importance in professional development, feedback is insubstantial and infrequently given.7,32–34 Perhaps feedback should be the focus of future rater training.35
The workshop did enhance confidence in using the mini-CEX. The post-workshop increase in mini-CEX observations of working residents may also reflect a benefit from the workshop.
This study has several limitations. We felt that a 0.2-point increase in reliability would be educationally significant since Cohen classified correlation coefficient changes <0.3 as small 36 and Landis and Koch's classification of rater agreement 37 used blocks 0.2 points wide, but smaller changes could also be significant. The randomized trial used videotaped cases of standardized patients as the outcome, and these may not accurately reflect performance in practice. However, pre-post data from actual encounters demonstrated concordant results. Despite admirable validation efforts,17 the scripted competence classifications are undoubtedly imprecise, which may lead to underestimates of accuracy. The study took place at a single institution and all participants were general internists.
This study also has several strengths. We examined both reliability and accuracy, evaluated workshop outcomes using both videotaped and real-life applications of the mini-CEX, and achieved the target sample size for the randomized trial.
Future research will need to clarify the purpose of the mini-CEX (formative versus summative assessment) and then develop effective methods for training raters to achieve this purpose. Behavioral observation training and performance dimension training might be more realistic than trying to improve accuracy and reliability across domains. Alternatively, extending the training period using methods such as Web-based learning may help.38
Below is the link to the electronic supplementary material
Cook et al, CEX Rater Training (DOC 129kb)
We thank E.S. Holmboe for use of scripted cases, G.R. Norman for advice on psychometric analyses, and Mayo Internal Medicine faculty for their participation in this study. Funding was provided by internal sources (the Mayo Education Innovation Program). Study sponsors had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; or preparation, review, or approval of the manuscript. Dr. Cook had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
Conflict of Interest None disclosed.
Electronic supplementary material
The online version of this article (doi:10.1007/s11606-008-0842-3) contains supplementary material, which is available to authorized users.