|Home | About | Journals | Submit | Contact Us | Français|
BackgroundFrom 2010 to 2011, more than 70% of the clinical rotation competency evaluations for nephrology fellows in our program were rated “superior” using a 9-point Likert scale, suggesting some degree of “grade inflation.”
ObjectiveWe sought to assess the efficacy of a 5-point centered rotation evaluation in reducing grade inflation.
MethodsThis retrospective cohort study of the impact of faculty education and a 5-point rotation evaluation on grade inflation was measured by superior item rating frequency and proportion of evaluations without superior ratings. The 5-point evaluation centered performance at the level expected for stage of training. Faculty education began in 2011–2012. The 5-point centered evaluation was introduced in 2012–2013 and used exclusively thereafter. A total of 68 evaluations, using the 9-point Likert scale, and 63 evaluations, using the 5-point centered scale, were performed after first-year fellow clinical rotations. Nine to 12 faculty members participated yearly.
ResultsFaculty education alone was associated with fewer superior ratings from 2010–2011 to 2011–2012 (70.5% versus 48.3%, P=.001), declining further with 5-point centered scale introduction (2012–2013; 48.3% versus 35.6%; P=.012). Superior ratings declined with 5-point centered versus 9-point Likert scales (37.3% versus 59.3%, P=.001), specifically for medical knowledge, patient care, practice-based learning and improvement, and professionalism. On logistic regression, evaluations without superior scores were more likely for 5-point centered versus 9-point Likert scales (adjusted odds ratio [aOR]=8.26; 95% CI 1.53–44.64; P=.014) and associated with faculty identifier (aOR= 1.18; 95% CI 1.03–1.35; P=.013), but not fellow identifier or training year quarter.
Grade inflation was reduced with faculty education and the 5-point centered evaluation scale.
Grade inflation, with a majority of learners being given “superior” ratings, is common in graduate medical education.
A 5-point centered scale and faculty education reduced the percentage of superior ratings.
Single site, single specialty study limits generalizability; dual intervention makes attribution of effect complex.
Faculty development and use of a 5-point centered scale reduced grade inflation.
Editor's Note: The online version of this article contains the evaluation form used in the study.
Many internal medicine subspecialty programs use end-of-rotation evaluations based on the recently modified American Board of Internal Medicine (ABIM) FasTrack 9-point Likert scale to assess trainee performance in the 6 Accreditation Council for Graduate Medical Education (ACGME) competencies: medical knowledge (MK), patient care (PC), interpersonal communication skills (ICS), professionalism (PROF), systems-based practice (SBP), and practice-based learning and improvement (PBLI). This ordinal item rating scale defines “superior” as 7 to 9, “satisfactory” as 4 to 6, and “unsatisfactory” as 1 to 3.1 Validity may be reduced by grade inflation and poor interrater reliability.2 Validity and reliability improve by employing an optimal number of response categories (4 to 7 for a Likert-type scale, with larger numbers adding little value) and “anchoring” descriptions for each response category.3
The ACGME Milestone Project requires that rotation evaluations meaningfully assess whether trainees are progressively improving and meeting competency milestones. Assessment is not peer comparison, but demonstrates individual objective milestone attainment.4 In 2015, the ABIM introduced the ACGME Milestone reporting worksheet, which uses a 9-point Likert scale with anchoring descriptions of milestone progress as the annual trainee assessment.5 This evaluation schema has not yet been validated.
In 2010, clinical rotation assessments for nephrology fellows at the Walter Reed National Military Medical Center demonstrated grade inflation. In the academic year (AY) 2010–2011, 70.5% of item assessments in the 6 competencies, using a 9-point Likert scale, were superior ratings (ie, in the 7 to 9 range). Fellows expected “superior” ratings.
To address this problem, we conducted faculty education regarding grade inflation. We also developed a 5-point rotation assessment anchored for each response category, which centers trainees performing at the level expected for their stage of training (ie, meeting milestones) at response category 3. This is 1 of the measurement tools used in our curricular milestone schema and informs clinical competency committee decisions regarding milestone achievement.6,7
Before AY 2012–2013, our clinical rotation evaluation was based on the ABIM FasTrack 9-point Likert scale, with anchor descriptions at the lowest and highest scale categories, and a scale item for each of the 6 competencies (14 additional items assessed nephrology-specific performance in physical examination, transplant management, renal replacement therapy, outpatient clinic, transitions of care, nephrology procedures, etc). The MK item is shown in figure 1A. The ordinal rating scale for each item defined superior as 7 to 9 (“far exceeds reasonable expectations”), satisfactory as 4 to 6 (“always meets and occasionally exceeds reasonable expectations”), and unsatisfactory as 1 to 3 (“consistently falls short of reasonable expectations and does not show progress”).1 This rating scale will be referred to as the “9-point Likert” scale.
In AY 2011–2012, clinical faculty received education from the program director regarding end-of-rotation grade inflation. Faculty raters were encouraged to give 4 to 5 ratings in each competency as baseline satisfactory performance early in training. As fellows advanced through training, this baseline would allow higher scores to be used to indicate progressive improvement and milestone achievement.
In July 2012, we adopted a 5-point rating scale, centered at 3, defined as satisfactory performance for level of training. Ratings 4 and 5 indicated performance above level of training. Unsatisfactory ratings (1 and 2) and ratings of 5 required written explanations. The MK item is shown in figure 1B, and the entire evaluation form is provided as online supplemental material. In addition to the 6 competency items, 5 items for assessment of transitions of care, outpatient clinic, transplantation, renal replacement therapy, and nephrology-related procedures were included. This rating scale will be referred to as the “5-point centered” scale.
At twice-yearly formative evaluations, fellows were assured that a 3 rating represented satisfactory performance for level of training (ie, milestones were being met), and that absence of superior ratings did not indicate poor performance. Faculty were informed that a 3 should be the most frequent rating given to a successful fellow, that 4 indicated that milestones were being met earlier than expected, and that a rating of 5 required explanation. Evaluations were not to be referenced to peer performance, but to individual progress in meeting milestones. Thus, a graduating fellow “ready for unsupervised practice” in a given competency would have the same rating (3) as a successful fellow in the first month of clinical training.
Evaluations were programmed and distributed using medical evaluation software (E*Value, Advanced Informatics, Minneapolis, MN). Scores were accessed using the Trainee Reports/Aggregate Performance search feature, filtered by start date, end date, type of rotation (activity), primary training site, faculty evaluator, evaluation type, and trainee cohort. All faculty evaluations of first-year fellows during inpatient and outpatient rotations at the primary training site were reviewed between AY 2010–2011 and AY 2013–2014. Item ratings in each competency based on academic year, fellow cohort, rotation block, rotation type, faculty, and type of evaluation (9-point Likert versus 5-point centered) were entered in a Microsoft Excel (Microsoft Corp, Redmond, WA) spreadsheet.
The study was determined exempt from Institutional Review Board review and approved by the Walter Reed National Military Medical Center Department of Research Protections. The manuscript was approved by the Walter Reed National Military Medical Center Department of Research Programs and Office of Public Affairs.
For each evaluation, 6 items were evaluated, 1 for each competency. AY 2011–2012 was the reference year when the project began and the 9-point Likert evaluation was in use. Data are presented as absolute numbers, proportions, or percentages. Descriptive statistics were performed in Microsoft Excel. Comparisons were made as appropriate using Fisher exact test (QuickCalcs, GraphPad Software Inc, La Jolla, CA). P=.05 was considered to be significant.
Logistic regression was performed in Stata SE 12.1 (StataCorp LP, College Station, TX), with a binary outcome variable “no superior score” versus “≥ 1 superior scores” in the 6 competency items for each evaluation. There were a total of 131 evaluations (observations). Nonsuperior scores were defined as less than 7 to 9 for the 9-point Likert scale and less than 4 to 5 for the 5-point centered scale. Independent variables (covariates) were attending faculty anonymous identifier, fellow anonymous identifier, AY quarter (with the first quarter as the reference), and type of evaluation (9-point Likert or 5-point centered). Of 131 evaluations, 32 (24.4%) did not have a superior score in any of the 6 competency items. Ninety-nine (75.6%) had 1 or more superior scores in the 6 competency items.
Fellow performance was independently assessed between AY 2010–2011 and 2011–2012 versus AY 2012–2013 and 2013–2014, by determining the percentage of first-year fellow chart audit deficiencies for each time period, as previously described.8
Table 1 shows the number of first-year fellow evaluations by academic year at the primary training site, the number of faculty, the number of entering first-year fellows, and yearly project events. Seven faculty did evaluations in all 4 training years. There was minimal overlap between the evaluation types. Before AY 2012–2013, all evaluations were 9-point Likert. Three 9-point Likert evaluations were done early in AY 2012–2013. Subsequently, all evaluations were 5-point centered. Rating distribution for each evaluation type is shown in figure 2. The item rating distribution for the 9-point Likert scale in 2010–2011 and 2011–2012 was confined to a 6-point spread (4 to 9) centered at 7 (figure 2A).
Faculty education alone was associated with fewer superior ratings from 2010–2011 to 2011–2012 (70.5% versus 48.3%, P=.001), declining further after 5-point centered scale introduction (2012–2013; 48.3% versus 35.6%; P=.012; figure 3). There were 68 nine-point Likert evaluations (408 items total), and 63 five-point centered evaluations (378 items total). A total of 242 (59.3%) 9-point Likert scale evaluation items were rated superior versus 141 (37.3%) with the 5-point centered scale (P=.001). The proportion of evaluations without superior scores increased significantly in AY 2012–2013, with the introduction of the 5-point centered scale (figure 3). Among 68 nine-point Likert evaluations, only 7 (10.3%) had no superior score in any of the 6 competency items, while 25 of 63 (39.7%) 5-point centered evaluations had no superior score (P=.001). Three of 7 faculty who did evaluations in all 4 study years were the authors. There was no difference in percentage of evaluations without superior scores between authors (n=3, 7 of 47, 14.9%) versus non-authors (n=4, 14 of 53, 26.4%; P=.22).
The percentage of superior ratings in MK, PC, PBLI, and PROF significantly declined in association with the 5-point centered scale (table 2). This was most marked for MK, where superior ratings decreased from 42.6% to 14.5% (P=.001).
The percentage of first-year fellow chart audit deficiencies between 2010–2011 and 2011–2012 versus 2012–2013 and 2013–2014 were not different (15.1% versus 14.5%, P=.62), suggesting that the decline in superior scores was unrelated to performance differences between the 2 cohorts.
On logistic regression, evaluations without superior scores in any competency were significantly associated with the 5-point centered versus the 9-point Likert scale (adjusted odds ratio [aOR]=8.26; 95% CI 1.53–44.64; P=.014). Evaluations without superior scores were also associated with a faculty identifier (aOR=1.18; 95% CI 1.03–1.35; P=.013), but not with a fellow identifier or academic year quarter.
Faculty education and the introduction of a 5-point centered end-of-rotation evaluation were associated with significant declines in superior scores, specifically for MK, PC, PBLI, and PROF. It is not possible to differentiate the relative contributions of faculty education and the 5-point centered evaluation. Education alone was associated with 22% absolute reduction in superior item ratings, but a significant increase in evaluations without superior scores occurred only after 5-point centered evaluation introduction.
Evaluations without superior scores were significantly associated with the 5-point centered scale on logistic regression. However, superior scores did not decline significantly for SBP and ICS, and 37% of items received superior scores after full implementation of education and the 5-point centered evaluation. Regardless, these interventions were associated with significant reductions in grade inflation and could be applied to other evaluation scenarios, such as mini-clinical evaluation exercises.9
The 5-point centered anchoring statements were designed to reassure faculty and trainees that response category 3 describes completely satisfactory performance for stage of training, while not falsely suggesting attainment of “ready for unsupervised practice” or aspirational milestones.4 Receiving this type of rating may make trainees more prepared to focus on deficiencies. The requirement that extreme superior ratings (response category 5) have written explanations served as a disincentive. On logistic regression, evaluations without superior scores were associated with faculty identifier, indicating differences in rating strictness or leniency.10 Fellow identifier was not a significant predictor for evaluations without superior scores, suggesting that the decline in superior scores was not due to poor fellow performance. This is supported by unchanged first-year fellow outpatient chart audit deficiencies before and after 5-point evaluation introduction.8
Grade inflation is a tenacious problem for medical educators, due to leniency bias, halo effect, desire to reward well-liked trainees, and unpleasant message avoidance.10–12 Fifty-five percent of internal medicine clerkship directors reported difficulty with grade inflation.11 The standardized letter of recommendation for residency applicants to emergency medicine programs placed 40.1% of potential trainees in the top 10%.13 Varney et al2 devised a criteria-based, anchored evaluation system for their internal medicine residency program, resulting in a significant decline in superior ratings. Before intervention, their residents expected scores of 8 to 9 on the 9-point Likert scale.
The 9-point Likert scale is associated with grade inflation for residents and practicing physicians.10 Our most frequent item score was 7, the mean rating given by program directors to graduating nephrology fellows who passed the ABIM nephrology examination.14 The 9-point Likert scale may have too many categories.3,10 Four to 7 categories are considered optimal, and extreme categories may complicate ratings. This is demonstrated by the frequency distribution of our 9-point Likert item scores, essentially confined to item scores of 5 to 9 (figure 2A).
Limitations of our study include its retrospective study design, a single training program, a small number of trainees and faculty, and a relatively short study period, with all limiting the ability to generalize our findings. Given time, grade inflation may reoccur despite faculty education and 5-point centered scale implementation.
Future investigation should be directed to prospectively evaluating the individual effects of faculty education and the 5-point centered score on grade inflation, and the introduction of an integrated set of evaluation tools to sufficiently assess clinical competency and milestone achievement, above and beyond traditional end-of-rotation appraisals.7
Faculty education and the introduction of a 5-point centered end-of-rotation evaluation were associated with significant declines in superior scores. End-of-rotation evaluations by faculty should not be used as the sole determinant of milestone achievement due to many inherent biases, including grade inflation, which are independent of instrument design and resistant to faculty training.10 Faculty education may moderate these biases, but cannot completely remove them.