|Home | About | Journals | Submit | Contact Us | Français|
Attending evaluations are commonly used to evaluate residents.
Evaluate the quality of written feedback of internal medicine residents.
Internal medicine residents and faculty at the Medical College of Wisconsin from 2004 to 2012.
From monthly evaluations of residents by attendings, a randomly selected sample of 500 written comments by attendings were qualitatively coded and rated as high-, moderate-, or low-quality feedback by two independent coders with good inter-rater reliability (kappa: 0.94). Small group exercises with residents and attendings also coded the utterances as high, moderate, or low quality and developed criteria for this categorization. In-service examination scores were correlated with written feedback.
There were 228 internal medicine residents who had 6,603 evaluations by 334 attendings. Among 500 randomly selected written comments, there were 2,056 unique utterances: 29 % were coded as nonspecific statements, 20 % were comments about resident personality, 16 % about patient care, 14 % interpersonal communication, 7 % medical knowledge, 6 % professionalism, and 4 % each on practice-based learning and systems-based practice. Based on criteria developed by group exercises, the majority of written comments were rated as moderate quality (65 %); 22 % were rated as high quality and 13 % as low quality. Attendings who provided high-quality feedback rated residents significantly lower in all six of the Accreditation Council for Graduate Medical Education (ACGME) competencies (p <0.0005 for all), and had a greater range of scores. Negative comments on medical knowledge were associated with lower in-service examination scores.
Most attending written evaluation was of moderate or low quality. Attendings who provided high-quality feedback appeared to be more discriminating, providing significantly lower ratings of residents in all six ACGME core competencies, and across a greater range. Attendings’ negative written comments on medical knowledge correlated with lower in-service training scores.
An important obligation of program directors and attendings in medical education programs is to provide feedback to their learners.1–3 Feedback is “specific information about the comparison between a trainee’s observed performance and a standard, given with the intent to improve trainee’s performance,”4 and is an essential component for the growth of trainees.2,5 Unfortunately, despite considerable information on the subject, the quality of oral and written feedback is often low.3 Previous studies have shown that feedback tends to be nonspecific, is not provided in a timely manner, and does not provide learners with sufficient information to improve their performance.6–9 Residents and attendings frequently disagree on the quality and quantity of feedback provided,10–15 with the result that feedback is commonly cited as needing improvement.16,17
Several studies have examined feedback. Frye and colleagues found that feedback varied widely in its organization, level of interaction, and depth.18 Kogan found that feedback was complex, that there was considerable variability in feedback techniques, and that many factors affected how staff felt about delivering feedback.19 Delva found that feedback was affected by four factors: learning culture, relationships, purpose of feedback, and emotional responses to feedback.20 Ende found that feedback was often implicit and inferential rather than explicit, and consequently was frequently misunderstood by residents.21 Several papers have provided opinions on improving feedback quality.2,4,11,22,23 For example, Skeff characterized high-quality feedback as specific, emphasizing behavior, frequent, selective, timely, balanced, tailored to the learning climate, interactive, labeled as feedback, and resulting in an action plan for improving performance.24 However, few studies have directly observed and evaluated feedback quality; most rely on resident and attending surveys of their opinions about the quality of feedback delivered. No previous study has developed criteria for assessing written feedback quality. The objectives of our study were to 1) describe the characteristics of written feedback, 2) correlate written feedback with ratings of residents by their attendings and with scores on the in-service training examination, 3) develop criteria for assessing feedback quality, and 4) use that schema to rate the quality of written feedback.
Subjects for this retrospective analysis were Medical College of Wisconsin (MCW) internal medicine residents, across all training levels, who completed residency from 2004 to 2012. Residents were evaluated at least monthly by their attendings as they moved through various inpatient and ambulatory rotations and at least semiannually by their continuity clinic preceptors. These evaluations rated resident performance in six domains (patient care, medical knowledge, interpersonal communication, professionalism, practice-based learning and improvement, and systems-based practice),25 and were rated on a scale from 1 through 9, anchored as 1 (unsatisfactory), 5 (satisfactory), and 9 (superior). Attendings provided an “overall” rating of residents on a scale from 1 through 9, and were also asked to provide written comments on their residents. Five hundred attending evaluations that included written feedback were randomly selected from among the 6,603 available evaluations. Randomization was achieved by assigning each attending evaluation a unique number and then randomly selecting, without replacement, 500 numbers between 1 and 6,603 for inclusion. Randomization and all calculations were performed using STATA software (v. 13.1; StataCorp LP. College Station, TX, USA)
Among these 500 resident evaluations, attending written comments were coded independently by two coders (JLJ, CK) with good inter-rater reliability (ICC: 0.85). Each statement that provided a single feedback item was coded as a unique utterance. For example a statement that the “resident was reliable and very well organized” would be coded as two utterances (reliable, well organized). Utterances were secondarily coded, when possible, into one of the six ACGME core competencies (patient care, medical knowledge, interpersonal communication, professionalism, practice-based learning and improvement, and systems-based practice). Statements that were generic, such as “this was a good resident,” were coded as nonspecific. Statements about personality characteristics, such as “X was enthusiastic,” were coded as personality characteristics. Secondary coding included whether the utterance was positive, negative, or neutral.
We led a series of small group exercises of medicine residents and medicine attendings. Attending written feedback statements were de-identified and placed on 4 × 6 index cards. The groups were asked to sort the statements into three categories of high-, moderate-, and low-quality feedback, and to discuss these decisions aloud, including the criteria used to determine the rating. One of the group members served as secretary, keeping track of the criteria on a flip chart. Field notes were recorded by at least two observers (JLJ, CK, or WJ). In addition, the sessions were audiotaped, and de-identified transcripts were reviewed to confirm our notes and that all quality characteristics mentioned had been captured. The attendees were not provided a list of potential feedback characteristics, but were asked to discuss each attending utterance and specify how they would label the feedback. At the end of the exercise, participants formally developed criteria that they used to rate the feedback as high, moderate, or low quality. All discussion group members provided informed consent and received no compensation for participation.
In addition to coding the transcripts for content, informed by the criteria proposed by the small groups, our two coders then coded the transcripts as high-, moderate-, or low-quality feedback (Table 1). Feedback that met none of these criteria were rated as low quality. Moderate-quality feedback met at least one quality criteria. To be considered of high quality, feedback had to meet two or more of the above-mentioned quality domains.
In-service training examinations were conducted each year during the study period, and we had at least one in-service training examination score for all residents. There was very high correlation between service examination scores.26 In cases where more than one was present, we used the average score. We examined the relationships between in-service scores and the quality of feedback and between the polarity (positive, negative, neutral) of feedback in the seven domains and in-service examination scores using analysis of variance. We used quadratic kappas and intraclass correlation coefficients to assess inter-rater reliability between the different group classifications of the quality of feedback as well as the coders. This study was approved by our institution’s institutional review board.
There were 228 internal medicine residents, with a total of 6,603 evaluations by 334 attendings; 1,387 (21 %) had no written feedback. Among 500 randomly selected written comments, there were 2,056 unique utterances (mean 2.9, range 1–8). The 500 randomly selected comments were equally distributed among the 8 years comprising the sample time frame (p=0.87) as well as among interns and second- and third-year residents (p=0.63). The majority of evaluations were from inpatient rotations (n=1,826, 88 %) and consultation rotations (n=148, 7 %); a smaller number (n=82, 4 %) were from continuity experiences. Continuity written feedback had slightly more utterances than inpatient or other ambulatory rotations (5.1 vs. 3.9 vs. 4.1, p=0.002).
Of unique utterances, the most common type was nonspecific (29 %, n=600); 20 % (n=415) of the comments were about resident personality, 16 % (n=324) about patient care, 14 % (n=292) interpersonal communication, 7 % (n=146) medical knowledge, 6 % (n=117) professionalism, and 4 % each on practice-based learning (n=89) and systems-based practice (n = 73) (Table (Table2).2). The majority of written feedback comments were positive (n=1,813, 88 %); 8 % (n=155) were negative, and 4 % (n=88) were neutral (Table 3). Nonspecific comments and comments on a resident’s attitude or personality were less likely to be negative than the other domains (nonspecific, OR: 0.22, 95 % CI: 0.13–0.39; attitude/personality, OR: 0.53, 95 % CI: 0.34–0.82). Three ACGME competencies were more likely to include negative comments: medical knowledge (OR: 3.5, 95 % CI: 2.2–5.6), practice-based learning (OR: 2.5, 95 % CI: 1.3–4.8), and systems-based practice (OR: 4.6, 95 % CI: 2.5–8.3).
The distribution of utterance types differed significantly among inpatient, ambulatory, and continuity experiences (p=0.001). Ambulatory preceptors were similar to inpatient preceptors except that they were less likely to comment on resident communication skills (OR: 0.42, 95 %: 0.22–0.80; Table 4). Continuity preceptors were less likely to comment on the resident's personality characteristics (OR: 0.26, 95 % CI: 0.12–0.56), and were more likely to make negative comments (OR: 2.8, 95 % CI: 1.2–4.3) and to comment on the resident’s systems-based practice (OR: 2.3, 95 % CI: 1.1–4.9) and professionalism (OR: 2.0, 95 % CI: 1.2–3.4).
We conducted 10 small group sessions, with a total of 31 participants; 12 were faculty and 19 were medicine residents. The small groups identified several characteristics of higher-quality written feedback, which included the following: quantifiable, specific, actionable, balanced, objective, based on goals, and behavioral/not personal (Table 1). The groups uniformly proposed that written feedback that included none of these characteristics should be rated as low quality, that feedback meeting at least one of these criteria was moderate, and that meeting more than one of these criteria was high-quality feedback. While all of the groups proposed the same criteria for judging feedback quality as low, moderate, or high, the inter-rater reliability among groups was low (quadratic kappa ranging from 0.22 to 0.28).
Two coders (JLJ, CK) independently applied these criteria, with good inter-rater reliability (quadratic kappa: 0.87). Based on the criteria, the majority of attendings' written comments were rated as moderate in quality (65 %, n=322); 22 % were rated as high quality (n=11,1) and 13 % low (n=65). None of the written feedback from continuity preceptors was rated as low quality, though rates of moderate- (61 %) and high-quality feedback (39 %) were similar to non-continuity rotations (p=0.36). There was a stepwise increase in the number of written comments as the feedback rating increased from low to moderate to high quality (average: 2.3 vs. 4.4 vs. 4.6, p <0.0001). Attendings who were rated as having high-quality written comments rated residents significantly lower and had greater spread of ratings in all six of the ACGME competencies as well as on their overall performance (Table 5).
There was no relationship between in-service training examination scores and the quality (p=0.18) or polarity of feedback (positive, negative, neutral, p=0.32). However, residents who received negative attending comments regarding their knowledge had lower in-service training scores (53.6 vs. 57.5, p=0.009).
Attending written feedback was generally limited by several factors. First, 21 % of evaluations had no written comments at all. While the online evaluation system could require some kind of written comment, it is likely that attendings mandated to enter comments would not provide thoughtful or meaningful ones. Moreover, even when there were comments, only 22 % of evaluations were considered high quality. As might have been expected, the more comments that were provided, the more likely that the evaluation would meet criteria for meaningful feedback. While each evaluation had an average of four comments, the fact that only one-fifth had two or more meaningful comments (meeting criteria for high quality) suggests that most of the comments were not helpful.
Almost all comments were positive. Negative comments were mostly related to the medical knowledge, practice-based learning, and systems-based practice competencies. However, comments on practice-based learning and systems-based practice were rare (each only 4 % of the total) such that the benefit of these was quite limited. While it is difficult to correlate negative comments in these two competencies with outcomes, negative comments in the medical knowledge competency correlated with poorer scores on the in-training examination.
While our coders achieved very high reliability in coding utterances and applying the criteria to categorize written feedback quality as high, moderate, or low, our small groups had low inter-rater reliability. This is interesting given that all of the small groups came up with similar criteria for rating the quality of the feedback. Field notes indicate considerable a discrepancy between groups in determining when statements were sufficiently specific; some groups were more liberal and others stricter. A second area of disagreement was in categorizing statements as examples of providing actionable feedback.
Characteristics of higher-quality written feedback included being quantifiable, specific, actionable, balanced, objective, goal-based, and behavioral rather than personal. We found two characteristics in particular where faculty commonly fail when providing feedback: 29 % of comments were nonspecific, and another 20 % were based on the resident’s personality rather than behavior-based. Addressing these two factors alone could significantly improve the quality in half of the feedback comments provided by faculty.
Several barriers to providing high-quality feedback have been identified in the literature. A common one is inadequate time to evaluate the resident. This could explain why there were no examples of low-quality feedback from continuity preceptors who are evaluating every 6 months based on a longer exposure period. Other barriers include concern about damaging the relationship with the resident and the tendency for negative feedback to elicit emotional responses.3 A recent challenge is the “millennial generational issue,” which suggests that the current generation of residents were raised in an environment in which their mentor feedback led them to feel that they were special, and they are consequently now poor at self-assessment27 and lack the reflective skills to incorporate feedback.28
Some aspects of our work are similar to previous findings; studies have found that written comments are often sparse29–31 and nonspecific,8,32 and fail to distinguish among competence levels of residents.33 In addition, resident evaluations commonly suffer from both grade inflation and range restriction.34 Faculty who put the time and thought into providing more meaningful comments may also be more accurately assessing the performance level of the resident.
There are a few notable limitations to this study. First, it was at a single site involving a single specialty. While other studies have suggested that poor feedback is a common problem, generalizing our results to other specialties or sites should be done with caution. Secondly, we had in-training examination scores for all participants rather than the more important American Board of Internal Medicine (ABIM) scores, and did not have other objective outcomes by the residents for comparison. However, we have previously shown that in-training exam scores correlate significantly with ABIM exam scores (reference the Acad Med paper).26 Third, the inter-rater reliability among the groups for rating feedback was low. The groups were consistent in developing the characteristics comprising higher-quality feedback, but differed in their decisions whether specific statements met those criteria. Fortunately, our coders, trained to the same standard for determining when statements met criteria for higher-quality feedback (specific, balanced, actionable, etc.), had very good inter-rater reliability. Strengths of this study include the large number of evaluations that were analyzed, the use of discussion groups and standardized criteria for assessing quality, and the fact that the evaluations were completed before the study was planned, so that there is no Hawthorne effect of faculty filling out evaluations differently because they knew that they would be studied. A final limitation was that this study was based on the prior version of the ABIM/ACGME evaluation tool. We had previously shown that both the immediately two preceding versions of the medicine resident evaluation forms had poor validity and reliability.35 Whether assessments based on the new ACGME Internal Medicine Milestones36 will truly improve the evaluation process remains to be seen.
Most clinical teaching is performed by clinicians who have no formal training in medical education, and this is likely why there has been a lag in the translation of the considerable theoretical and practical knowledge regarding feedback to medical education settings.3 Fortunately, studies have found that faculty development can modestly improve the quality of written and oral feedback.8,32,37 Several specific recommendations emerge from this study that can help guide faculty development in providing feedback. First, faculty should understand the value of providing written comments that are multiple in number and scope. Second, comments should be specific, focusing on elements of the resident’s performance in the assessed competencies, and not just generalized comments on the resident overall. Third, comments should address behaviors in the resident's performance, and not personality or personal characteristics. The use of specific incidents as examples may help in this regard. Fourth, feedback should be balanced, providing both positive comments to reinforce good behaviors and constructive comments with action items and goals to address deficiencies. Formal mechanisms for providing feedback such as field notes have been shown to improve feedback quality.38 Interventions to improve feedback optimally need to occur at the individual, collective, and institutional cultural levels.39 Further research should evaluate the effectiveness of specific interventions to improve the quality of feedback to residents, with the ultimate outcome of improved resident performance.
The authors have no conflicts of interest related to this article.
All opinions expressed in this manuscript represent those of the authors and should not be construed to reflect, in any way, those of the Department of Veterans Affairs or the U.S. government.