Based on our prior observations,3
we anticipated that evidence of Content and Internal Structure validity would be strongly represented in the literature on clinical teaching assessments. After developing and implementing an objective method for scoring the validity evidence, we confirmed that the highest scoring categories are Content and Internal Structure, and the lowest scoring categories are Relation to Other Variables, Consequences, and Response Process. These findings suggest that, contrary to requests for validity evidence from a variety of sources,1,3,26,27
most authors report only a limited subset of validity evidence. This raises questions about the validity of interpretations that can be derived from existing teaching assessments. To illustrate how future research could improve upon the past, we will review the sources of validity evidence found among assessments of clinical teaching and provide examples of articles that utilize validity evidence effectively.
evidence refers to the degree to which individual items fit the underlying construct of interest, and is most often reported as measures of factor analysis or internal consistency reliability.26
Some experts include all reliability in this category.1,27
We found that evidence of Internal Structure is commonly demonstrated among studies on clinical teaching assessments, perhaps because such evidence can be sought without prior planning. In other words, performing statistical analyses (e.g., reliability, factor analysis) on preexisting data can generate evidence of Internal Structure. The majority of studies in our review reported at least 1 type of reliability. We are concerned, however, that the types of reliability reported may not be the most important. Interrater reliability is the favored type of reliability when assessing clinical performance 31
; yet, our review on the psychometric characteristics of clinical teaching assessments revealed that less than half of published studies report this reliability measure.3
As evidence for Internal Structure is common, we were able to recognize patterns that facilitated our rating of this category. Notably, all articles utilized factor analysis and/or some measure of reliability. A noteworthy example of Internal Structure evidence comes from a study by Irby and Rakestraw.12
In this study, factor analysis with orthogonal rotation revealed 4 distinct factors and satisfactory interitem correlations. Additionally, estimates of interrater reliability for 20 ratings, calculated by the Spearman-Brown prophecy formula, were excellent.
evidence refers to themes, wording, and format of items, and includes expert review and other systematic item development strategies. In this study, evidence of Content among clinical teaching assessments ranked second only to Internal Structure. When scoring Content evidence, we utilized 2 criteria: (1) items should represent the construct(s) they intend to measure, that is, the items should have a convincing theoretical basis and (2) instrument development strategies, including expert review, should be clearly described. Perhaps the best example of content evidence is a study by Guyatt et al.10
The authors initially defined assessment criteria (domains) using input from faculty, residents, and a careful literature review. Items were then created to represent the defined domains. Finally, the authors articulated a method by which physicians reviewed and modified the ultimate set of items.
Relation to Other Variables
evidence refers to relationships (convergent or discriminating) between scores and other variables relevant to the construct being measured. Relation to Other Variables is a powerful yet underutilized source of evidence among assessments of clinical teaching. We suspect that Relation to Other Variables is underutilized because, in most cases, studies must be specifically designed to evaluate predicted associations or hypotheses. As discussed below, it seems that studies of clinical teaching assessments are often not hypothesis-driven or designed to confirm anticipated associations. Examples of meaningful relationships between faculty assessments and other variables could include correlations (either positive or negative as predicted by theory) between scores and outcomes such as teaching awards or academic promotion, and demonstrating predicted correlations between scores from 2 different instruments designed to assess the same teaching behaviors.22
Unfortunately, we identified only 1 study with data strongly supporting Relation to Other Variables. James and Osborne 13
showed that student-on-teacher assessment scores (a proxy for instructional quality) were significantly related to certain theoretically predicted outcomes, such as clerkship grades and choice of medical specialty, but were not related to other outcomes, such as National Board of Medical Examiners scores.
The categories of ConsequencesResponse Process were the least represented sources of validity evidence among published clinical teaching assessments, and the most difficult to rate as demonstrated by the low κ scores. Just as the prevalence of Internal Structure evidence facilitated our ability to assign accurate ratings, we suspect that the dearth of examples of Consequences and Response Process evidence challenged our ability to recognize patterns, and thus to formalize criteria for rating these categories.
Assessments are intended to have some desired effect (consequence), but may also have unintended effects. Therefore, analyzing Consequences of assessments may support the validity of score interpretations, or reveal unrecognized threats to validity. Yet, simply demonstrating consequences, even significant and impressive consequences, does not constitute evidence for validity unless investigators explicitly demonstrate that these consequences have an impact on score interpretations (validity).26
We identified only 2 studies whose data weakly supported evidence of consequences validity. In a study by Cohen et al.6
teachers were given assessment scores as formative feedback. All but one of the teachers with low scores improved, as compared with moderate to outstanding teachers, whose scores generally remained the same. Although these data do not show causality, they imply that the assessment process may have influenced teaching behaviors, which in turn constitutes evidence of consequences. In another study, anecdotal reports indicated that an assessment raised awareness of effective clinical teaching behaviors.7
Again, this observation implies that the consequences of faculty assessment may impact on the validity of assessment interpretations.
Examining the reasoning and thought processes of learners, or systems that reduce the likelihood of response error, can provide evidence of Response Process. We identified 2 articles with Response Process data that weakly supported validity. McLeod et al.17
showed that students and resident physicians, utilizing the same forms, provide significantly different scores, although it was not entirely clear whether students and residents were assessing the same teachers. Risucci et al.19
used interitem correlations to estimate halo error, and found that more advanced trainees have lower interitem correlations (lower halo error). Findings from both these studies imply that assessments completed by different learner levels could impact on the validity of these interpretations. Unfortunately, neither author discussed this implication, nor did they discuss the reasons why different levels of learners provide different scores.
We recognize limitations to our study method. As noted above, developing and applying rating methods for Consequences and Response Process was challenging, likely because of the scarcity of evidence in the literature. Another source of variance when assigning ratings was the multitude of criteria contributing to each evidence category. For example, factor analysis, differential item functioning, and various types of reliability all provide evidence for Internal Structure validity.26
Nevertheless, operational definitions became clearer as our review progressed, and by the conclusion of our rating process, we had arrived at a system that accommodated these sources of variance. An additional limitation of our scoring system was that each article was given equal weighting, despite significant heterogeneity of study methods. For example, not all articles disclosed important details regarding the assessment setting or utilized methods designed to discover anticipated outcomes. Therefore, there remains a need to evaluate the quality of the methods
used in studies on the assessment of clinical teaching, separate from the evidence supporting the validity of these assessments.
Another potential limitation of our study is that ratings were assigned to categories of validity evidence on the assumption that each category has equal importance in the assessment of clinical teaching. One might argue that the lesser-used categories of validity evidence (e.g., Relation to Other Variables, Consequences, and Response Process) are underrepresented because they are less important in clinical teaching assessment. However, we are inclined to believe that these categories are underrepresented because seeking evidence from these categories requires studies designed to demonstrate these sources of evidence, and also because these categories tend to be misunderstood. Additional challenges include identifying meaningful correlates and outcomes (for evidence of Relation to Other Variables and Consequences, respectively), and utilizing samples sufficiently large to demonstrate such correlates and outcomes. While evidence for Content and Internal Structure may have primary importance during the initial development of an instrument, we urge authors to seek additional evidence of Relation to Other Variables, Consequences, and Response Process in subsequent studies. Finally, we acknowledge that accurately rating validity evidence from categories of Content, Consequences, and Response Process can be especially challenging because data from these categories are often qualitative.
Our findings have important implications for research in assessing clinical teaching in particular, and for research involving psychometric instruments in general. First, future studies should seek a broader variety of validity evidence with greater attention to the categories of Relation to Other Variables, Consequences, and Response Process. All the same, when reviewing the scores assigned to individual articles in the current study, readers are cautioned against comparing articles solely based on the sum of their evidence scores (indeed, we did not report sum scores for individual articles for this reason). Such scores are only crude estimates of validity. The study by James and Osborne,13
for example, did not receive the highest overall score; yet, this was the only study to receive a perfect score in the category of Relation to Other Variables.
A second implication of our findings for future research is that authors of instruments using multiple observers should more frequently report interrater reliability.3
A third implication is that future studies should clearly state hypothesized outcomes and the theoretical bases for these outcomes. Many investigators seemed to analyze existing data, and then attempt to explain the results a posteriori. Of course, not all studies of clinical teaching assessments need to be prospective. Medical educators have been encouraged to use traditional epidemiologic approaches, including retrospective cohort and case-control studies.32,33
Nonetheless, we stress that even retrospective studies should be theory-based and hypothesis-driven. Lastly, our study provides a reliable method for rating validity evidence from the categories of Content, Internal Structure, and Relation to Other Variables. Rating the categories of Consequences and Response Process, while not reliable in this study, may improve as authors become more comfortable with these categories and as evidence is presented more frequently in the literature. Our method would be best utilized (as we have performed in this article) to identify weaknesses in a body of literature. We encourage educators to expand upon our method for rating the validity evidence, and to apply this or similar methods when analyzing the validity of their assessments. We anticipate that systematic methods for rating the validity evidence will benefit researchers and critical appraisers of the literature in all fields where validity is utilized.