The assumption built into the first analytic strategy, that subjects make internally consistent but personally idiosyncratic interpretations of confidence, generates what may be termed upper-bound estimates of alignment between confidence and correctness. Under the assumptions embedded in this analysis, the results of this study indicate that the correctness of clinicians' diagnoses and their perceptions of the correctness of these diagnoses are, at most, moderately aligned. The correctness and confidence of faculty physicians and senior medical residents were aligned about two thirds of the time—and in cases where correctness and confidence were not aligned, these subjects were more likely to be underconfident than overconfident. While faculty subjects demonstrated tendencies toward greater alignment and less frequent overconfidence than residents, these differences were not statistically significant. Students' results were substantially different from those of their more experienced colleagues, as their confidence and correctness were aligned about four fifths of the time and more highly skewed, when nonaligned, toward underconfidence. The alignment between “being correct” and “being confident”—within groups and for all subjects—would be qualitatively characterized as “fair,” as seen by κ coefficients in the range .2 to .4.27
The more conservative second mode of analysis yielded smaller relationships between correctness and confidence, as seen in the τb coefficient for all subjects, which is smaller by a factor of three. For the residents, the relationship between correctness and confidence does not exceed chance expectations when computed without thresholding. Comparison across experience levels reveals the same trend seen in the primary analysis, with students displaying the highest level of alignment.
The greater apparent alignment for the students, under both analytic approaches, may be explained by the difficulty of the cases. The students were probably overmatched by many of these cases, perhaps guessing at diagnoses, and were almost certainly aware that they were overmatched. This is seen in the low proportions of correct diagnoses for students and the low levels of expressed confidence. These skewed distributions would generate alignment between correctness and confidence of 67% by chance alone. While students' alignments exceeded these chance expectations, a better estimate of their concordance between confidence and correctness might be obtained by challenging the students with less difficult cases, making the diagnostic task as difficult for them as it was for the faculty and residents with the cases employed in this study. We do not believe it is valid to conclude from these results that the students are “more aware” than experienced clinicians of when they are right and wrong.
By contrast, residents and faculty correctly diagnosed 44% and 50% of these difficult cases, respectively, and generated distributions of confidence ratings that were less skewed than those of the students. In cases for which these clinicians' correctness and confidence were not aligned, both faculty and residents showed an overall tendency toward underconfidence in their diagnoses. Despite the general tendency toward underconfidence, residents and faculty in this study were overconfident, placing credence in a diagnosis that was in fact incorrect, in 15% (98/938) and 12% (80/928) of cases, respectively. Because these two more experienced groups are directly responsible for patient care, and offered much more accurate diagnoses for these difficult cases, findings for these groups take on a different interpretation and perhaps greater potential significance.
In designing the study, we approached the measurement of “confidence” by grounding it in hypothetical clinical behavior. Rather than asking subjects directly to estimate their confidence levels in either probabilistic or qualitative terms, we asked them for the likelihood of their seeking help in reaching a diagnosis for each case. We considered this measure to be a proxy for “confidence.” Because our intent was to inform the design of decision support systems and medical error reduction efforts generally, we believe that this behavioral approach to assessment of confidence lends validity to our conclusions.
Limitations of this study include restriction of the task to diagnosis. Differences in results may be seen in clinical tasks other than diagnosis, such as determination of appropriate therapy for a problem already diagnosed. The cases, chosen to be very difficult and with definitive findings excluded, certainly generated lower rates of accurate diagnoses than are typically seen in routine clinical practice. Were the cases in this study more routine, this may have affected the measured levels of alignment between confidence and correctness. In addition, this study was conducted in a laboratory setting, using written case synopses, to provide experimental precision and control. While the case synopses contained very large amounts of clinical information, the task environment for these subjects was not the task environment of routine patient care. Clinicians might have been more, or less, confident in their assessments had the cases used in the study been real patients for whom these clinicians were responsible; and in actual practice, physicians may be more likely to consult on difficult cases regardless of their confidence level. While we employed volunteer subjects in this study, the sample sizes at each institution for the resident and faculty groups were large relative to the sizes at each institution of their respective populations, and thus unlikely to be skewed by sampling bias.
The relationships, of “fair” magnitude, between correctness and confidence were seen only after adjusting each subject's confidence ratings to reflect differing interpretations of the confidence scale. The secondary analytic approach, which does not correct individuals' judgments against their own optimal thresholds, results in observed relationships between correctness and confidence that are smaller. Under either set of assumptions, the relationship between confidence and correctness is such that designers of clinical decision support systems cannot assume clinicians to be accurate in their own assessments of when they do and do not require assistance from external knowledge resources.