|Home | About | Journals | Submit | Contact Us | Français|
This study explores the alignment between physicians' confidence in their diagnoses and the “correctness” of these diagnoses, as a function of clinical experience, and whether subjects were prone to over-or underconfidence.
Prospective, counterbalanced experimental design.
Laboratory study conducted under controlled conditions at three academic medical centers.
Seventy-two senior medical students, 72 senior medical residents, and 72 faculty internists.
We created highly detailed, 2-to 4-page synopses of 36 diagnostically challenging medical cases, each with a definitive correct diagnosis. Subjects generated a differential diagnosis for each of 9 assigned cases, and indicated their level of confidence in each diagnosis.
A differential was considered “correct” if the clinically true diagnosis was listed in that subject's hypothesis list. To assess confidence, subjects rated the likelihood that they would, at the time they generated the differential, seek assistance in reaching a diagnosis. Subjects' confidence and correctness were “mildly” aligned (κ=.314 for all subjects, .285 for faculty, .227 for residents, and .349 for students). Residents were overconfident in 41% of cases where their confidence and correctness were not aligned, whereas faculty were overconfident in 36% of such cases and students in 25%.
Even experienced clinicians may be unaware of the correctness of their diagnoses at the time they make them. Medical decision support systems, and other interventions designed to reduce medical errors, cannot rely exclusively on clinicians' perceptions of their needs for such support.
When making a diagnosis, clinicians combine what they personally know and remember with what they can access or look up. While many decisions will be made based on a clinician's own personal knowledge, others will be informed by knowledge that derives from a range of external sources including printed books and journals, communications with professional colleagues, and, increasingly, a range of computer-based knowledge resources.1 In general, the more routine or familiar the problem, the more likely it is that an experienced clinician can “solve it” and decide what to do based on personal knowledge only. This method of decision making uses a minimum of time, which is a scarce and precious resource in health care practice.
Every practitioner's personal knowledge is, however, incomplete in various ways, and decisions based on incorrect, partial, or outdated personal knowledge can result in errors. A recent landmark study2 has documented that medical errors are a significant cause of morbidity and mortality in the United States. Although these errors have a wide range of origins,3 many are caused by a lack of information or knowledge necessary to appropriately diagnose and treat.4 The exponential growth of biomedical knowledge and shortening half-life of any single item of knowledge both suggest that modern medicine will increasingly depend on external knowledge to support practice and reduce errors.5
Still, the advent of modern information technology has not changed the fundamental nature of human problem solving. Diagnostic and therapeutic decisions, for the foreseeable future, will be made by human clinicians, not machines. What has changed in recent years is the potential for computer-based decision support systems (DSSs) to provide relevant and patient-specific external knowledge at the point of care, assembling this knowledge in a way that complements and enhances what the clinician decision maker already knows.6,7 DSSs can function in many ways, ranging from the generation of alerts and reminders to the critiquing of management plans.8–14 Some DSSs “push” information and advice to clinicians whether they request it or not; others offer no advice until it is specifically requested.
The decision support process presupposes the clinician's openness to the knowledge or advice being offered. Clinicians who believe they are correct, or believe they know all they need to know to reach a decision, will be unmotivated to seek additional knowledge and unreceptive to any knowledge or suggestions a DSS presents to them. The literatures of psychology and medical decision making15–18 address the relationship between these subjective beliefs and objective reality. The well-established psychological bias of “anchoring”15 stipulates that all human decision makers are more loyal to their current ideas, and resistant to changing them, than they objectively should be in light of compelling external evidence.
This study addresses a question central to the potential utility and success of clinical decision support. If clinicians' openness to external advice hinges on their confidence in their assessments based on personal knowledge, how valid are these perceptions? Conceptually, there are 4 possible combinations of objective “correctness” of a diagnosis and subjective confidence in it: 2 in which confidence and correctness are aligned and 2 in which they are not. The ideal condition is an alignment of high confidence in a correct diagnosis. Confidence and correctness can also be aligned in the opposing sense: low confidence in a diagnosis that is incorrect. In this state, clinicians are likely to be open to advice and disposed to consult an external knowledge resource. In the “underconfident” state of nonalignment, a clinician with low confidence in a correct diagnosis will be motivated to seek information that will likely confirm an intent to act correctly. However, it is also possible that a consultation with an external resource can talk a clinician out of a correct assessment.19 The other nonaligned state, of greater concern for quality of care, is high confidence in an incorrect diagnosis. In this “overconfident” state, clinicians may not be open or motivated to seek information that could point to a correct assessment.
This work addresses the following specific questions:
One study similar to this one in design and intent,20 but limited to medical students as subjects, found that students were frequently unconfident about correct diagnostic judgments when classifying abnormal heart rhythms. Our preliminary study of this question has found the relationship between correctness and confidence, across a range of training levels, to be modest at best.21
To address these questions, we employed a large dataset originally collected for a study of the impact of decision support systems on the accuracy of clinician diagnoses.19 We developed for this study detailed written synopses of 36 diagnostically challenging cases from patient records at the University of Illinois at Chicago, the University of Michigan, and the University of North Carolina. Each institution contributed 12 cases, each with a firmly established final diagnosis. The 2-to 4-page case synopses were created by three coauthors who are experienced academic internists (PSH, PSF, TMM). The synopses contained comprehensive historical, examination, and diagnostic test information. They did not, however, contain results of definitive tests that would have made the correct diagnosis obvious to most or all clinicians. The cases were divided into 4 approximately equivalent sets balanced by institution, pathophysiology, organ systems, and rated difficulty. Each set, with all patient-and institution-identifying information removed, therefore contained 9 cases, with 3 from each institution.
We then recruited to the study 216 volunteer subjects from these same institutions: 72 fourth-year medical students, 72 second-and third-year internal medicine residents, and 72 general internists with faculty appointments and at least 2 years of postresidency experience (mean, 11 years). Recruitment was balanced so that each institution contributed 24 subjects at each experience level. Each subject was randomly assigned to work the 9 cases comprising 1 of the 4 case sets. Each subject then worked through each of the assigned cases first without, and then with, assistance from an assigned computer-based decision support system. On the first pass through each case, subjects generated a diagnostic hypothesis set with up to 6 items. After generating their diagnostic hypotheses, subjects indicated their perceived confidence in their diagnosis in a manner described below. On the second pass through the case, subjects employed a decision support system to generate diagnostic advice, and again offered a differential diagnosis and confidence ratings. After deleting cases with missing data, the final dataset for this work consisted of 1,911 cases completed by 215 subjects.
Results reported elsewhere19 indicated that the computer-based decision support systems engendered modest but statistically significant improvements in the accuracy of diagnostic hypotheses (overall effect size of .32). The questions addressed by this study, emphasizing the concordance between confidence and correctness under conditions of uncertainty, focus on the first pass through each case where the subjects applied only their personal knowledge to the diagnostic task.
To assess the correctness of each clinician's diagnostic hypothesis set for each case, we employed a binary score (correct or incorrect). We scored a case as correct if the established diagnosis for that case, or a very closely related disease, appeared anywhere in the subject's hypothesis set. Final scoring decisions, to determine whether a closely related disease should be counted as correct, were made by a panel comprised of coauthors PSF, PSH, and TMM. The measure of clinician confidence was the response to the specific question: “How likely is it that you would seek assistance in establishing a diagnosis for this case?”“Assistance” was not limited to that which might be provided by a computer. After generating their diagnostic hypotheses for each case, subjects responded to this question using an ordinal 1 to 4 response scale with anchor points of 1 representing “unlikely” (indicative of high confidence in their diagnosis) and 4 representing “likely” (indicative of low confidence). Because subjects did not receive feedback, they offered their confidence judgments for each case without any definitive knowledge of whether their diagnoses were, in fact, correct. Because they reflect only the subjects' first pass through each case, these confidence judgments were not confounded by any advice subjects might later have received from the decision support systems.
In this study, each data point pairs a subjective confidence assessment on a 4-level ordinal scale with a binary objective correctness score. The structure of this experiment and the resulting data suggested two approaches to analyzing the results. Given that each subject in this study worked 9 cases, and offered confidence ratings on a 1 to 4 scale for each case, interpretations of the meanings of these scale points might be highly consistent for each subject but highly variable across subjects. Our first analytic approach therefore sought to identify an optimal threshold for each subject to distinguish subjective states of “confident” and “unconfident.” This approach addresses the “pooling” problem, identified by Swets and Pickett,22 that would tend to underestimate the magnitude of the relationship between confidence and correctness. Our second analytical approach took the assumption that all subjects made the same subjective interpretation of the confidence scale. This second approach entails a direct analysis of the 2-level by 4-level data with no within-subject thresholding. Qualitatively, the first approach approximates the upper bound on the relationship between confidence and correctness, while the second approach approximates the lower bound.
To implement the first approach, we identified, for each subject, the threshold value along the 1 to 4 scale that maximized the proportion of cases where confidence and correctness were aligned. With reference to Table 1, we sought to find the threshold value that maximized the numbers of cases in the on-diagonal cells. For 58 subjects (27%), we found that maximum alignment was achieved by classifying only ratings of 1 as confident and all other ratings as unconfident; for 105 subjects (49%), maximum alignment was achieved by classifying ratings of 1 or 2 as confident; and for the remaining 52 subjects (24%), maximum alignment was achieved by classifying ratings of 1, 2, or 3 as confident. This finding validated our assumption that subjects varied in their interpretations of the scale points. We then created a dataset for further analysis that consisted, for each case worked by each subject, of a binary correctness score and a binary confidence score calculated using each subject's optimal threshold.
To address the first research question with the first approach, we computed Kendall's τb and κ coefficients to characterize the relationship between subjects' correctness and confidence levels. We then modeled statistically the proportions of cases correctly diagnosed, as a function of confidence (computed as a binary variable as described above), subjects' level of training (faculty, resident, student), and the interaction of confidence and training level. To address the second question, we modeled the proportions of cases in which confidence and correctness were aligned, as a function of training level. To address the third research question, we focused only on those cases in which confidence and correctness were not aligned. We modeled the proportions of cases in which subjects were overconfident (high confidence in an incorrect diagnosis) as a function of training level.
All statistical models used the Generalized Linear Model (GzdLM) procedure23 assuming diagnostic correctness, alignment, and overconfidence to be distributed as Bernoulli variables with a logit link and used naive empirical covariance estimates24 for the model effects to account for the clustering of cases within subjects. Wald statistics were employed to test the observed results against the null condition. Ninety-five percent confidence intervals were calculated by transforming logit scale Wald intervals using naive empirical standard error estimates into percent scale intervals.25 The SPSS for Windows (SPSS Inc., Chicago, IL) and SAS26 Proc GENMOD (SAS Institute Inc., Cary, NC) software were employed for statistical modeling and data analyses.
Our second approach offers a contrasting strategy to address the first and second research questions. To this end, we computed nonparametric correlation coefficients (Kendall's τb) between the 2-level variable of correctness and the 4 levels of confidence from the original data, without thresholding. We computed separate τb coefficients for subjects at each experience level, and for the sample as a whole. Correlations were computed with case as the unit of analysis after exploratory analyses correcting for the nesting of cases within subjects led to negligible changes in the results.
The power of the inferential statistics employed in this analysis was based on the two-tailed t test, as the tests we performed are analogous to testing differences in means on a logit scale. Because our tests are based on a priori unknown marginal cell counts, we halved the sample size to estimate power. For the analyses addressing research question 1, which use all cases, power is greater than .96 to detect a small to moderate effect of .3 standard deviations at an α level of .05. For analyses addressing research Qquestions 2 and 3, analyses that are based on subsets of cases, the analogous statistical power estimate is greater than .81.
Table 1 displays the crosstabulation of correctness of diagnosis and binary levels of confidence (with 95% confidence interval) for all subjects and separately for each clinical experience level, using each subject's optimal threshold to dichotomize the confidence scale. The difficulty of these cases is evident from Table 1, as 760 of 1,911 (40%) were correctly diagnosed by the full set of subjects. Diagnostic accuracy increased monotonically with subjects' clinical experience. The difficulty of the cases is also reflected in the distribution of the confidence ratings, with subjects classified as confident for 583 (31%) of 1,911 cases, after adjustment for varying interpretations of the scale. These confidence levels revealed the same general monotonic relationship with clinical experience. Across the entire sample of subjects, confidence and correctness were aligned for 1,308 of 1,911 cases (68%), corresponding to Kendall's τb=.321 (P <.0001) and a κ value of .314. Alignment was seen in 64% of cases for faculty (τb=.291 [P <.0001]; κ=.285), 63% for residents (τb=.230 [P <.0001]; κ=.227), and 78% for students (τb=.369 [P <.0001]; κ=.349).
Figure 1 offers a graphical portrayal, for each experience level, of the proportions of correct diagnoses as a function of confidence, with 95% confidence intervals. The relationship between correctness and confidence, at each level, is seen in the differences between these proportions.
Wald statistics generated by the statistical model reveal a significant alignment between diagnostic correctness and confidence across all subjects (χ2=199.64, df=1, P <.0001). Significant relationships are also seen between correctness and training level (χ2=20.40, df=2, P <.0001) and in the interaction between confidence and training level (χ2=17.00, df=2, P <.0002). Alignment levels for faculty and residents differ from those of the students (P <.05); and from inspection of Figure 1 it is evident that students' alignment levels are higher than those of faculty or residents.
With reference to the third research question, Table 2 summarizes the case frequencies for which clinicians at each level were correctly confident—where confidence was aligned with correctness—as well as frequencies for the “nonaligned” cases where they were overconfident and underconfident. Students were overconfident in 25% of nonaligned cases, corresponding to 5% of cases they completed. Residents were overconfident in 41% of nonaligned cases, and 15% of cases overall. Faculty physicians were overconfident in 36% of nonaligned cases, and 13% of cases overall.
All subjects were more likely to be underconfident than overconfident (χ2=29.05, P <.0001). Students were found to be more underconfident than residents (Wald statistics: χ2=6.19, df=2, P <.05). All other differences between subjects' experience levels were not significant.
The second approach to analysis yielded Kendall τb measures of association between the binary measure of correctness and the 4-level measure of confidence, computed directly from the study data, without any corrections. For all subjects and cases, we observed τb=.106 (N =1,911 cases; P <.0001). Separately for each level of training, Kendall coefficients are: faculty τb=.103 (n =628; P <.005), residents τb=.041 (n =638; NS), and students τb=.121 (n =645 cases; P <.001). The polarity of the relationship is as would be expected, associating correctness of diagnosis with higher confidence levels. The τb values reported here can be compared with their counterparts, reported above, for the analyses that included threshold correction.
The assumption built into the first analytic strategy, that subjects make internally consistent but personally idiosyncratic interpretations of confidence, generates what may be termed upper-bound estimates of alignment between confidence and correctness. Under the assumptions embedded in this analysis, the results of this study indicate that the correctness of clinicians' diagnoses and their perceptions of the correctness of these diagnoses are, at most, moderately aligned. The correctness and confidence of faculty physicians and senior medical residents were aligned about two thirds of the time—and in cases where correctness and confidence were not aligned, these subjects were more likely to be underconfident than overconfident. While faculty subjects demonstrated tendencies toward greater alignment and less frequent overconfidence than residents, these differences were not statistically significant. Students' results were substantially different from those of their more experienced colleagues, as their confidence and correctness were aligned about four fifths of the time and more highly skewed, when nonaligned, toward underconfidence. The alignment between “being correct” and “being confident”—within groups and for all subjects—would be qualitatively characterized as “fair,” as seen by κ coefficients in the range .2 to .4.27
The more conservative second mode of analysis yielded smaller relationships between correctness and confidence, as seen in the τb coefficient for all subjects, which is smaller by a factor of three. For the residents, the relationship between correctness and confidence does not exceed chance expectations when computed without thresholding. Comparison across experience levels reveals the same trend seen in the primary analysis, with students displaying the highest level of alignment.
The greater apparent alignment for the students, under both analytic approaches, may be explained by the difficulty of the cases. The students were probably overmatched by many of these cases, perhaps guessing at diagnoses, and were almost certainly aware that they were overmatched. This is seen in the low proportions of correct diagnoses for students and the low levels of expressed confidence. These skewed distributions would generate alignment between correctness and confidence of 67% by chance alone. While students' alignments exceeded these chance expectations, a better estimate of their concordance between confidence and correctness might be obtained by challenging the students with less difficult cases, making the diagnostic task as difficult for them as it was for the faculty and residents with the cases employed in this study. We do not believe it is valid to conclude from these results that the students are “more aware” than experienced clinicians of when they are right and wrong.
By contrast, residents and faculty correctly diagnosed 44% and 50% of these difficult cases, respectively, and generated distributions of confidence ratings that were less skewed than those of the students. In cases for which these clinicians' correctness and confidence were not aligned, both faculty and residents showed an overall tendency toward underconfidence in their diagnoses. Despite the general tendency toward underconfidence, residents and faculty in this study were overconfident, placing credence in a diagnosis that was in fact incorrect, in 15% (98/938) and 12% (80/928) of cases, respectively. Because these two more experienced groups are directly responsible for patient care, and offered much more accurate diagnoses for these difficult cases, findings for these groups take on a different interpretation and perhaps greater potential significance.
In designing the study, we approached the measurement of “confidence” by grounding it in hypothetical clinical behavior. Rather than asking subjects directly to estimate their confidence levels in either probabilistic or qualitative terms, we asked them for the likelihood of their seeking help in reaching a diagnosis for each case. We considered this measure to be a proxy for “confidence.” Because our intent was to inform the design of decision support systems and medical error reduction efforts generally, we believe that this behavioral approach to assessment of confidence lends validity to our conclusions.
Limitations of this study include restriction of the task to diagnosis. Differences in results may be seen in clinical tasks other than diagnosis, such as determination of appropriate therapy for a problem already diagnosed. The cases, chosen to be very difficult and with definitive findings excluded, certainly generated lower rates of accurate diagnoses than are typically seen in routine clinical practice. Were the cases in this study more routine, this may have affected the measured levels of alignment between confidence and correctness. In addition, this study was conducted in a laboratory setting, using written case synopses, to provide experimental precision and control. While the case synopses contained very large amounts of clinical information, the task environment for these subjects was not the task environment of routine patient care. Clinicians might have been more, or less, confident in their assessments had the cases used in the study been real patients for whom these clinicians were responsible; and in actual practice, physicians may be more likely to consult on difficult cases regardless of their confidence level. While we employed volunteer subjects in this study, the sample sizes at each institution for the resident and faculty groups were large relative to the sizes at each institution of their respective populations, and thus unlikely to be skewed by sampling bias.
The relationships, of “fair” magnitude, between correctness and confidence were seen only after adjusting each subject's confidence ratings to reflect differing interpretations of the confidence scale. The secondary analytic approach, which does not correct individuals' judgments against their own optimal thresholds, results in observed relationships between correctness and confidence that are smaller. Under either set of assumptions, the relationship between confidence and correctness is such that designers of clinical decision support systems cannot assume clinicians to be accurate in their own assessments of when they do and do not require assistance from external knowledge resources.
This work was supported by grant R01-LM-05630 from the National Library of Medicine.