|Home | About | Journals | Submit | Contact Us | Français|
Although a variety of validity evidence should be utilized when evaluating assessment tools, a review of teaching assessments suggested that authors pursue a limited range of validity evidence.
To develop a method for rating validity evidence and to quantify the evidence supporting scores from existing clinical teaching assessment instruments.
A comprehensive search yielded 22 articles on clinical teaching assessments. Using standards outlined by the American Psychological and Education Research Associations, we developed a method for rating the 5 categories of validity evidence reported in each article. We then quantified the validity evidence by summing the ratings for each category. We also calculated weighted κ coefficients to determine interrater reliabilities for each category of validity evidence.
Content and Internal Structure evidence received the highest ratings (27 and 32, respectively, of 44 possible). Relation to Other Variables, Consequences, and Response Process received the lowest ratings (9, 2, and 2, respectively). Interrater reliability was good for Content, Internal Structure, and Relation to Other Variables (κ range 0.52 to 0.96, all P values <.01), but poor for Consequences and Response Process.
Content and Internal Structure evidence is well represented among published assessments of clinical teaching. Evidence for Relation to Other Variables, Consequences, and Response Process receive little attention, and future research should emphasize these categories. The low interrater reliability for Response Process and Consequences likely reflects the scarcity of reported evidence. With further development, our method for rating the validity evidence should prove useful in various settings.
Experts stress the need for reliable and valid teaching assessments.1,2 Despite this, medical educators have not used consistent validity criteria when developing and evaluating instruments to assess clinical teaching. For example, we recently reviewed the literature on the psychometric characteristics of instruments for assessing clinical teachers.3 In our analysis of these studies,4–25 we found that authors usually pursue a limited variety of validity evidence, and authors' interpretations of validity substantially differ in terms of the relative importance placed on different categories of validity evidence.
These findings led us to reflect more deeply on the definition and measurement of validity in assessing clinical teaching. According to modern theory, validity is a hypothesis, and all sources of validity evidence contribute to accepting or rejecting this hypothesis.1 For this reason, scores from teaching assessment instruments should be supported by a variety of validity evidence. The American Psychological and Education Research Associations published standards that identify 5 sources of validity evidence: (1) Content, (2) Response Process, (3) Internal Structure, (4) Relation to Other Variables, and (5) Consequences 26(see Table 1) Notably, this 5-category validity framework, articulated by Messick 27 over 10 years ago, has been increasingly regarded by education and psychology researchers as the most comprehensive conceptualization of validity. Moreover, experts emphasize the importance of incorporating these sources of evidence into clinical teaching assessments.1,3 Experts also assert that validity is a property of scores and score interpretations, and not a property of the instrument itself.1,26,27
In the present study, we critically evaluated the published literature for evidence supporting the validity of clinical teaching assessments. Our aims were: (1) develop a reliable and systematic method by which medical educators can rate the validity of scores from teaching assessment instruments, (2) evaluate the quantity and quality of validity evidence for scores from published teaching assessment instruments, and (3) identify areas warranting further research.
Our method for identifying studies on the assessment of clinical teaching is described in detail elsewhere.3 Electronic databases, including MEDLINE, EMBASE, PsycINFO, ERIC, and Social Science Citation/Science Citation indices were searched for English-language articles published between 1966 and July 2004 using the terms validity, medical faculty, medical education, evaluation studies, instrument, and the text word reliability. This search yielded over 330 articles. Review articles, editorials, qualitative studies, and case discussions were excluded. Additional articles were found by reviewing the bibliographies of retrieved articles and by consulting colleagues with expertise in medical education. After reviewing all titles and abstracts, we identified 22 relevant studies describing instruments designed for assessing clinical faculty by learners.
We agreed upon operational definitions of the 5 sources of validity evidence based on Standards26 published by the American Psychological and Education Research Associations and interpretations by another author.1 We then developed the following rating scale: N=no discussion of this source of validity evidence and/or no data presented; 0=discussion of this source of validity evidence, but no data presented, or data failed to support the validity of instrument scores; 1=data for this source weakly support the validity of score interpretations; and 2=data for this source strongly support the validity of score interpretations. We adopted this scale after considering multiple alternatives because it awarded numerical points (score=1 or 2) only for articles that provide data supporting the validity of score interpretations, distinguished articles that discussed a category of validity evidence (score=0) from those failing to address a category of validity evidence (score=N), and avoided unnecessary complexity. See Table 2)for specific rating criteria for each category of validity evidence. Authors T.J.B. and D.A.C. independently analyzed the 22 studies using this rating scale. After calculating interrater reliability, authors T.J.B. and D.A.C. discussed their individually assigned ratings for each article until they reached a consensus on the final ratings. These ratings were then summed across all studies to yield a total score for each evidence category.
For the purpose of determining interrater reliability, the categories of N, 0, 1, and 2 were converted to the numbers 1, 2, 3, and 4, respectively. Weighted κ scores and κP values were calculated for each category of validity evidence using the weighting scheme proposed by Fleiss and Cohen.28,29 Kappa values were interpreted according to Landis and Koch's 30 guidelines, where κ values under 0.4 represent poor agreement, values from 0.4 to 0.75 fair to good agreement, and values 0.75 and over represent excellent agreement. For all analyses, κP values ≤.05 were the criteria for concluding that there was significant agreement between observers.
The highest possible score for each category of validity evidence, when summing scores for the 22 articles, is 44. Highest sum scores were given for the categories of Content and Internal Structure (scores 27 and 32, respectively) (Table 3) With the exception of 1 article in the category of Relation to Other Variables, Content and Internal Structure were the only categories with articles having evidence that strongly supported validity (=score of 2) (Table 4).
The lowest sum scores were given for the evidence categories of Relation to Other Variables, Consequences, and Response Process (scores 9, 2, and 2 respectively) (Table 3). Within the categories of Response Process and Consequences, the vast majority of articles provided no or insufficient data to support validity.Table 3 summarizes the scores for each category of validity evidence and Table 4 summarizes validity evidence scores for individual studies of clinical teaching.
Regarding interrater reliability, weighted κ scores were good to excellent for the categories of Content, Internal Structure, and Relation to Other Variables (κ range 0.52 to 0.96, all P values ≤.01), but weighted kappa scores were poor for Consequences and Response Process.Table 3 summarizes weighted κ scores and corresponding P values.
Based on our prior observations,3 we anticipated that evidence of Content and Internal Structure validity would be strongly represented in the literature on clinical teaching assessments. After developing and implementing an objective method for scoring the validity evidence, we confirmed that the highest scoring categories are Content and Internal Structure, and the lowest scoring categories are Relation to Other Variables, Consequences, and Response Process. These findings suggest that, contrary to requests for validity evidence from a variety of sources,1,3,26,27 most authors report only a limited subset of validity evidence. This raises questions about the validity of interpretations that can be derived from existing teaching assessments. To illustrate how future research could improve upon the past, we will review the sources of validity evidence found among assessments of clinical teaching and provide examples of articles that utilize validity evidence effectively.
Internal Structure evidence refers to the degree to which individual items fit the underlying construct of interest, and is most often reported as measures of factor analysis or internal consistency reliability.26 Some experts include all reliability in this category.1,27 We found that evidence of Internal Structure is commonly demonstrated among studies on clinical teaching assessments, perhaps because such evidence can be sought without prior planning. In other words, performing statistical analyses (e.g., reliability, factor analysis) on preexisting data can generate evidence of Internal Structure. The majority of studies in our review reported at least 1 type of reliability. We are concerned, however, that the types of reliability reported may not be the most important. Interrater reliability is the favored type of reliability when assessing clinical performance 31; yet, our review on the psychometric characteristics of clinical teaching assessments revealed that less than half of published studies report this reliability measure.3
As evidence for Internal Structure is common, we were able to recognize patterns that facilitated our rating of this category. Notably, all articles utilized factor analysis and/or some measure of reliability. A noteworthy example of Internal Structure evidence comes from a study by Irby and Rakestraw.12 In this study, factor analysis with orthogonal rotation revealed 4 distinct factors and satisfactory interitem correlations. Additionally, estimates of interrater reliability for 20 ratings, calculated by the Spearman-Brown prophecy formula, were excellent.
Content evidence refers to themes, wording, and format of items, and includes expert review and other systematic item development strategies. In this study, evidence of Content among clinical teaching assessments ranked second only to Internal Structure. When scoring Content evidence, we utilized 2 criteria: (1) items should represent the construct(s) they intend to measure, that is, the items should have a convincing theoretical basis and (2) instrument development strategies, including expert review, should be clearly described. Perhaps the best example of content evidence is a study by Guyatt et al.10 The authors initially defined assessment criteria (domains) using input from faculty, residents, and a careful literature review. Items were then created to represent the defined domains. Finally, the authors articulated a method by which physicians reviewed and modified the ultimate set of items.
Relation to Other Variables evidence refers to relationships (convergent or discriminating) between scores and other variables relevant to the construct being measured. Relation to Other Variables is a powerful yet underutilized source of evidence among assessments of clinical teaching. We suspect that Relation to Other Variables is underutilized because, in most cases, studies must be specifically designed to evaluate predicted associations or hypotheses. As discussed below, it seems that studies of clinical teaching assessments are often not hypothesis-driven or designed to confirm anticipated associations. Examples of meaningful relationships between faculty assessments and other variables could include correlations (either positive or negative as predicted by theory) between scores and outcomes such as teaching awards or academic promotion, and demonstrating predicted correlations between scores from 2 different instruments designed to assess the same teaching behaviors.22 Unfortunately, we identified only 1 study with data strongly supporting Relation to Other Variables. James and Osborne 13 showed that student-on-teacher assessment scores (a proxy for instructional quality) were significantly related to certain theoretically predicted outcomes, such as clerkship grades and choice of medical specialty, but were not related to other outcomes, such as National Board of Medical Examiners scores.
The categories of ConsequencesResponse Process were the least represented sources of validity evidence among published clinical teaching assessments, and the most difficult to rate as demonstrated by the low κ scores. Just as the prevalence of Internal Structure evidence facilitated our ability to assign accurate ratings, we suspect that the dearth of examples of Consequences and Response Process evidence challenged our ability to recognize patterns, and thus to formalize criteria for rating these categories.
Assessments are intended to have some desired effect (consequence), but may also have unintended effects. Therefore, analyzing Consequences of assessments may support the validity of score interpretations, or reveal unrecognized threats to validity. Yet, simply demonstrating consequences, even significant and impressive consequences, does not constitute evidence for validity unless investigators explicitly demonstrate that these consequences have an impact on score interpretations (validity).26 We identified only 2 studies whose data weakly supported evidence of consequences validity. In a study by Cohen et al.6 teachers were given assessment scores as formative feedback. All but one of the teachers with low scores improved, as compared with moderate to outstanding teachers, whose scores generally remained the same. Although these data do not show causality, they imply that the assessment process may have influenced teaching behaviors, which in turn constitutes evidence of consequences. In another study, anecdotal reports indicated that an assessment raised awareness of effective clinical teaching behaviors.7 Again, this observation implies that the consequences of faculty assessment may impact on the validity of assessment interpretations.
Examining the reasoning and thought processes of learners, or systems that reduce the likelihood of response error, can provide evidence of Response Process. We identified 2 articles with Response Process data that weakly supported validity. McLeod et al.17 showed that students and resident physicians, utilizing the same forms, provide significantly different scores, although it was not entirely clear whether students and residents were assessing the same teachers. Risucci et al.19 used interitem correlations to estimate halo error, and found that more advanced trainees have lower interitem correlations (lower halo error). Findings from both these studies imply that assessments completed by different learner levels could impact on the validity of these interpretations. Unfortunately, neither author discussed this implication, nor did they discuss the reasons why different levels of learners provide different scores.
We recognize limitations to our study method. As noted above, developing and applying rating methods for Consequences and Response Process was challenging, likely because of the scarcity of evidence in the literature. Another source of variance when assigning ratings was the multitude of criteria contributing to each evidence category. For example, factor analysis, differential item functioning, and various types of reliability all provide evidence for Internal Structure validity.26 Nevertheless, operational definitions became clearer as our review progressed, and by the conclusion of our rating process, we had arrived at a system that accommodated these sources of variance. An additional limitation of our scoring system was that each article was given equal weighting, despite significant heterogeneity of study methods. For example, not all articles disclosed important details regarding the assessment setting or utilized methods designed to discover anticipated outcomes. Therefore, there remains a need to evaluate the quality of the methods used in studies on the assessment of clinical teaching, separate from the evidence supporting the validity of these assessments.
Another potential limitation of our study is that ratings were assigned to categories of validity evidence on the assumption that each category has equal importance in the assessment of clinical teaching. One might argue that the lesser-used categories of validity evidence (e.g., Relation to Other Variables, Consequences, and Response Process) are underrepresented because they are less important in clinical teaching assessment. However, we are inclined to believe that these categories are underrepresented because seeking evidence from these categories requires studies designed to demonstrate these sources of evidence, and also because these categories tend to be misunderstood. Additional challenges include identifying meaningful correlates and outcomes (for evidence of Relation to Other Variables and Consequences, respectively), and utilizing samples sufficiently large to demonstrate such correlates and outcomes. While evidence for Content and Internal Structure may have primary importance during the initial development of an instrument, we urge authors to seek additional evidence of Relation to Other Variables, Consequences, and Response Process in subsequent studies. Finally, we acknowledge that accurately rating validity evidence from categories of Content, Consequences, and Response Process can be especially challenging because data from these categories are often qualitative.
Our findings have important implications for research in assessing clinical teaching in particular, and for research involving psychometric instruments in general. First, future studies should seek a broader variety of validity evidence with greater attention to the categories of Relation to Other Variables, Consequences, and Response Process. All the same, when reviewing the scores assigned to individual articles in the current study, readers are cautioned against comparing articles solely based on the sum of their evidence scores (indeed, we did not report sum scores for individual articles for this reason). Such scores are only crude estimates of validity. The study by James and Osborne,13 for example, did not receive the highest overall score; yet, this was the only study to receive a perfect score in the category of Relation to Other Variables.
A second implication of our findings for future research is that authors of instruments using multiple observers should more frequently report interrater reliability.3 A third implication is that future studies should clearly state hypothesized outcomes and the theoretical bases for these outcomes. Many investigators seemed to analyze existing data, and then attempt to explain the results a posteriori. Of course, not all studies of clinical teaching assessments need to be prospective. Medical educators have been encouraged to use traditional epidemiologic approaches, including retrospective cohort and case-control studies.32,33 Nonetheless, we stress that even retrospective studies should be theory-based and hypothesis-driven. Lastly, our study provides a reliable method for rating validity evidence from the categories of Content, Internal Structure, and Relation to Other Variables. Rating the categories of Consequences and Response Process, while not reliable in this study, may improve as authors become more comfortable with these categories and as evidence is presented more frequently in the literature. Our method would be best utilized (as we have performed in this article) to identify weaknesses in a body of literature. We encourage educators to expand upon our method for rating the validity evidence, and to apply this or similar methods when analyzing the validity of their assessments. We anticipate that systematic methods for rating the validity evidence will benefit researchers and critical appraisers of the literature in all fields where validity is utilized.