|Home | About | Journals | Submit | Contact Us | Français|
This study examined the efficacy of a new cognitive debiasing intervention in reducing decision-making errors in the assessment of pediatric bipolar disorder (PBD).
The study was a randomized controlled trial using case vignette methodology. Participants were 137 mental health professionals working in different regions of the US (M=8.6±7.5 years of experience). Participants were randomly assigned to a (1) brief overview of PBD (control condition), or (2) the same brief overview plus a cognitive debiasing intervention (treatment condition) that educated participants about common cognitive pitfalls (e.g., base-rate neglect; search satisficing) and taught corrective strategies (e.g., mnemonics, Bayesian tools). Both groups evaluated four identical case vignettes. Primary outcome measures were clinicians’ diagnoses and treatment decisions. The vignette characters’ race/ethnicity was experimentally manipulated.
Participants in the treatment group showed better overall judgment accuracy, p < .001, and committed significantly fewer decision-making errors, p < .001. Inaccurate and somewhat accurate diagnostic decisions were significantly associated with different treatment and clinical recommendations, particularly in cases where participants missed comorbid conditions, failed to detect the possibility of hypomania or mania in depressed youths, and misdiagnosed classic manic symptoms. In contrast, effects of patient race were negligible.
The cognitive debiasing intervention outperformed the control condition. Examining specific heuristics in cases of PBD may identify especially problematic mismatches between typical habits of thought and characteristics of the disorder. The debiasing intervention was brief and delivered via the Web; it has the potential to generalize and extend to other diagnoses as well as to various practice and training settings.
Pediatric bipolar disorder is an extremely challenging diagnosis (Leibenluft, 2011; Marchand, Wirth, & Simon, 2006; Rettew, Lynch, Achenbach, Dumenci, & Ivanova, 2009). There often are long delays in diagnosing youth with PBD, with symptoms often being misdiagnosed as something else (Marchand et al., 2006). Paradoxically, bipolar also is often a “false positive” diagnosis, clinically assigned and treated when the youth actually has a different disorder (Leibenluft, 2011; Rettew, Lynch, Achenbach, Dumenci, & Ivanova, 2009). In short, clinicians miss true cases of bipolar disorder and diagnose PBD in many cases where the youth does not actually have the disorder. Thus, the risk for cognitive error is especially high in clinicians’ assessment of bipolar symptomotology. The medical decision-making literature has identified over thirty faulty heuristics, biases, and cognitive distortions (Croskerry, 2005; Elstein & Schwartz, 2002). Examining specific heuristics in cases of PBD may identify especially problematic mismatches between typical habits of thought and characteristics of the disorder. Teaching clinicians cognitive debiasing strategies may improve their decision-making and help reduce the frequency of misdiagnosis around PBD.
Several factors challenge diagnostic decisions and reliability of clinical impressions for PBD. First, case presentations appear differently depending on whether a client is experiencing florid mania, severe depression, mixed or a euthymic state (Youngstrom, 2005). Consequently, clinicians can fail to detect hypomania or mania in youth with depression or youth experiencing a euthymic state, neglecting the possibility of PBD in their diagnoses, case conceptualizations, and treatment plans. When this happens, clinicians commit base-rate neglect--failing to adequately take into account the prevalence of a particular disease. Epidemiological studies suggest that bipolar is more common than previously thought (Hasin, Goodwin, Stinson, & Grant, 2005; Merikangas, et al., 2007; Merikangas & Pato, 2009), especially in clinical settings (Biederman, et al., 1996; Blader & Carlson, 2007; Geller, et al., 2002; Youngstrom, Youngstrom, & Starr, 2005). Further, a higher proportion of people with depression will actually have bipolar disorder, particularly child populations (Angst, Sellaro, Stassen, & Gamma, 2005). Both of these trends worsen the impact of base rate neglect.
Second, symptom overlap makes it difficult to discern symptoms of bipolar from other likely suspects in pediatric populations. For example, irritability can be a symptom of bipolar, ADHD, unipolar depression, and ODD. Clinicians are vulnerable to diagnosis momentum (also known as diagnostic creep): the tendency for a particular initial or provisional diagnosis to become established without adequate evidence. Research indicates clinicians overestimate the likelihood of bipolar without adequate evidence (Dubicka, Carlson, Vail, & Harrington, 2008; Jenkins, et al., 2011). They may be framing this judgment quickly on the basis of a few features, and then fail to check whether a sufficient number of criteria are met. Similarly, once the bipolar hypothesis gains momentum, disconfirming evidence may be discounted or ignored. Clinicians, like many people, are prone to resist modifying or discarding ideas despite further evidence to the contrary (Dawes, Faust, & Meehl, 1989; Meehl, 1954).
Third, due to high comorbidity, clinicians may not see a “typical” uncomplicated bipolar presentation (Biederman, Klein, Pine, & Klein, 1998; Youngstrom, Findling, et al., 2005). For example, individuals with bipolar have high rates of ADHD (Galanter & Leibenluft, 2008; Kessler, et al., 2005; Lewinsohn, Klein, & Seeley, 1995; Singh, DelBello, Kowatch, & Strakowski, 2006). As a result of overlapping symptomatology and high comorbidity rates, clinicians are prone to commit search satisficing--when one calls off a search once something is found (i.e., finding something may be satisfactory but not finding everything is suboptimal (see Croskerry, 2002)). In the case of PBD and ADHD, clinicians may “call off the search” when they diagnose one of these disorders, causing them to miss the other disorder for youths who actually meet DSM criteria for both conditions. Clinical “diagnosis as usual” detects one less diagnosis per case on average than found using a more structured interview approach with the same cases (Jensen-Doss, Youngstrom, Youngstrom, Feeny, & Findling, 2014; Rettew, Lynch, Achenbach, Dumenci, & Ivanova, 2009).
Fourth, given the evidence suggesting frequent overdiagnosis in the U.S. (Dubicka, Carlson, Vail, & Harrington, 2008; Soutullo et al., 2005), clinicians may show exaggerated certainty in their decision-making. This tendency is consistent with over-confidence bias, or thinking one knows more than one does. Overconfidence bias can lead to delayed or missed diagnosis and unwarranted interventions (Crosskerry, 2002).
Finally, another challenge is the frequent misdiagnosis of minority populations (Garland et al., 2005). For example, African American pediatric populations are more likely to receive diagnoses of conduct disorder or schizophrenia (DelBello, Lopez-Larson, Soutullo, & Strakowski, 2001; Fabrega, Ulrich, & Mezzich, 1993; Kilgus, Pumariega, & Cuffe, 1995), although currently no research indicates that the risk of developing bipolar disorder varies by race/ethnicity (American Psychiatric Association, 2001; Van Meter, Moreira, & Youngstrom, 2011). Race/ethnicity bias is associated with misdiagnosis, delay in diagnosis and treatment, and stigma.
At present, there is little empirical information about cognitive debiasing strategies in mental health (Galanter & Patel, 2005). Although these strategies are relatively innovative for psychology in the sense that they are not formally organized or taught as they are presently in other medical areas (Croskerry, 2003), experts have repeatedly recommended training mental health professionals in decision-making (Arkes, 1991; Baker-Ericzen, Jenkins, Park, & Garland, 2014). Judgment does not necessarily improve with additional experience (Lueger, 2002; Spengler et al., 2009), and more experienced clinicians may be at greater risk for biased judgment than novice clinicians (Strohmer & Leierer, 2000). Overall, research needs to identify effective ways to teach mental health professionals how to avoid cognitive pitfalls (Regehr & Norman, 1996). Simply telling individuals to avoid particular biases is not sufficient to change behavior (Arkes, 1991; Kurtz & Garfield, 1978).
Several cognitive debiasing strategies significantly reduce diagnostic error in medical settings (Croskerry, 2003; Gigerenzer & Goldstein, 1996). These strategies could potentially help with PBD. Developing insight and awareness involves providing detailed descriptions of known cognitive biases, together with multiple clinical examples illustrating their adverse effects on diagnosis (Croskerry, 2003). In the case of PBD, this would involve education on key heuristics and biases (e.g., base-rate neglect, search satisficing, diagnosis momentum, overconfidence bias and race/ethnicity bias), and using varied clinical scenarios to highlight possible outcomes (e.g., misdiagnosis, delay in diagnosis). Another key strategy, making the task easier, entails providing more information about the specific problem to reduce task difficulty and ambiguity (Croskerry, 2003). Here, mood graphs can illustrate different mood states of bipolar disorder and highlight the diagnostic challenges associated with the heterogeneity of the condition. For example, low or depressed mood can lead to base-rate neglect if clinicians fail to assess for past episodes of hypomania and mania. Mood graphs (e.g., Bieling et al., 2003) help both clinicians and patients develop insight and awareness, as well as communicate more clearly.
Considering alternatives is the procedure of establishing forced consideration of alternative possibilities (Croskerry, 2003). Symptom checklists can help clinicians consider alternative diagnoses by decreasing reliance on intuition in clinical problem solving (Ely, Graber, & Croskerry, 2011). For PBD, symptoms checklists can guard against search satisficing, helping clinicians assess for bipolar disorder in cases of comorbid ADHD and vice versa.
Decreasing reliance on memory entails improving the accuracy of judgments through cognitive aids such as mnemonics (Croskerry, 2003). GRAPES is a helpful mnemonic acronym for symptoms more specific to bipolar disorder, standing for Grandiosity, Racing thoughts, changes in Activity, Pressured speech, Elated mood, and decreased need for Sleep. Decreasing reliance on memory protects against several cognitive errors, particularly diagnosis momentum and overconfidence bias (Arkes, 1981).
Cognitive forcing strategies help avoid predictable bias in particular clinical situations (Croskerry, 2003). This strategy may protect against race/ethnicity bias. Having clinicians routinely assess for mood symptoms in suspected cases of schizophrenia and conduct disorder could decrease misdiagnosis of African Americans.
Statistical methods also can decrease judgment biases and errors (Aegisdottir et al., 2006; Croskerry, 2003; Grove, Zald, Lebow, Snitz, & Nelson, 2000). In fact, “actuarial” statistical prediction methods that use Bayes’ theorem to estimate the probability of a diagnosis based on test findings or clinical observations have been shown to be more accurate and efficient for diagnosing PBD (Jenkins, Youngstrom, Washburn, & Youngstrom, 2011; Jenkins, Youngstrom, Youngstrom, Feeny, & Findling, 2011). Clinicians can implement these methods in ways that require no mathematical computation on the part of the clinician (see Jenkins et al., 2011). Unfortunately, these methods are not routinely taught or used in psychological practice; and even these methods require clinical judgment to maximize their clinical utility (Elstein & Schwartz, 2002; Youngstrom, Freeman, & Jenkins, 2009).
More generally, experts recommend opportunities for metacognition and simulation (Croskerry, 2003). In short, decision-making benefits from stepping back from the current task and examining one's thinking process (i.e., reflecting on one's approach before making a PBD diagnosis or frequently co-occurring condition) as well as engaging in mental rehearsals. The latter involves practice with different clinical scenarios involving PBD and also experience using actuarial methods.
The present study tested the efficacy of a new intervention designed to improve clinical judgment in the assessment of PBD by educating participants about common cognitive pitfalls and training them in recommended debiasing strategies. We hypothesized that participants’ diagnostic judgments in the treatment group would show higher overall decision accuracy than participants’ judgments in the control condition. Additionally, we hypothesized participants in the treatment group would commit significantly fewer diagnostic errors and show significantly higher diagnostic accuracy than participants in the control group on individual vignettes constructed to test for specific cognitive errors (e.g., base rate neglect, search satisficing, diagnosis momentum). A secondary aim explored the impact of diagnostic decisions on treatment decisions. Specifically, we predicted that employing a faulty heuristic (as demonstrated by suboptimal diagnostic decisions) would affect clinicians’ decisions regarding next clinical action, suggesting that heuristics change treatment as well as assessment formulations. An exploratory goal of the study investigated the role of race/ethnicity bias in clinicians’ diagnostic decisions, as the vignette design offered an opportunity to change race while holding all other features constant.
The Institutional Review Board at the University of North Carolina approved this study.
Participants were 137 mental health providers: 56 in the treatment group and 81 in the control group (M = 8.6±7.8 years of practice experience). Eligible participants were licensed clinicians or currently supervised by a licensed mental health professional and had experience treating pediatric patient populations for psychiatric issues. Table 1 provides information about participant demographics. Table 2 provides details about participants’ professional backgrounds. Twenty-three participants reported attending a previous training in the assessment of PBD (e.g., continuing education seminar). Distribution of these participants between the control (n = 13) and treatment (n = 10) conditions was balanced, p > .05. Recruitment took place between 2011 and 2013, ending when the recruitment goal was met. Participants were recruited across the US using study fliers, listserv announcements, and by word-of-mouth. Figure 1 depicts the Trial Flow Diagram.
Four case vignettes were used. Each vignette had features designed to test for a specific cognitive error associated with the assessment of PBD. Vignette characters were experimentally manipulated to test possible race/ethnicity bias; vignettes described youths who were African American half of the time and Caucasian half of the time, with all other details the same.
This vignette described a 14-year-old female with current symptoms characteristic of a major depressive episode. Responses that included depression and inquired about hypo/mania were coded as accurate, whereas responses that included depression but missed the possibility of mania/hypomania were coded as “somewhat accurate” because they missed the recommendation to systematically check for history of hypomania or mania when working with depression, particularly early onset depression (American Psychiatric Association, 2013; McClellan, Kowatch, & Findling, 2007). Inaccurate responses missed depression along with consideration of mania symptoms.
This vignette described a 7-year-old male with mania and ADHD symptoms. Responses that considered both ADHD and a bipolar disorder were coded as accurate, versus responses that included only ADHD or a bipolar disorder, which were coded as somewhat accurate. Inaccurate responses missed ADHD and bipolar disorder.
This vignette portrayed a 10-year-old male with bipolar symptoms, a moderately highly elevated test score on a widely used assessment instrument, and a family history of bipolar disorder. After providing a probable diagnosis and recommending next clinical actions, participants rated the probability of a bipolar diagnosis (from 0 to 100). In an actuarial approach, the combination of assessment findings yields a moderate Bayesian probability of a bipolar diagnosis (i.e., a point estimate of 27% probability).
This vignette focused on an 11-year-old female with classic symptoms of mania including grandiosity, hypersexuality as an example of disinhibited and risky behavior, psychomotor agitation, and distractibility – meeting criteria for at least four “B Criterion” symptoms in addition to the episodic disturbance of mood (per DSM-IV-TR) (American Psychiatric Association, 2001). Consistent with the episodic nature of mood disorders, her symptoms were described as intermittent. This youth met duration criteria for a manic episode (e.g., 1 week or longer), and experienced impairment as a result of her symptoms. This vignette was similar to a vignette used in Dubicka, Carlson, Vail, & Harrington (2008).
Participants reported the youth's probable diagnoses after reading each case vignette. These diagnoses were scored on a 3-point Likert scale of (1 = inaccurate diagnosis, 2 = somewhat accurate, and 3 = accurate diagnosis) using a priori criteria established by expert diagnosticians coding features of the vignette against DSM-IV criteria. Participants could also indicate other diagnoses that they were considering.
“Inaccurate” and “somewhat accurate” diagnoses counted as committing a decision-making error; “accurate” responses did not.
For the vignette evaluating “diagnosis momentum,” participants’ estimates were compared to an actuarial estimate: the Bayesian posterior probability of the likelihood of bipolar disorder (Jenkins et al., 2011; Youngstrom & Duax, 2005; Youngstrom et al., 2009). The Bayesian estimate was 27% (±5%) based on: clinical setting (e.g., the youth was seen in outpatient clinic, so the recommended starting base rate of PBD is 6%); family history information (e.g., the youth had a second degree relative with bipolar disorder, increasing the odds of a bipolar diagnosis by a factor of 2.5), and a test score of 35 on the Parent General Behavior Inventory (PGBI; which is associated with a diagnostic likelihood ratio of 2.3) (Youngstrom et al., 2004). Estimates of 27 or estimates that fell within ±5% of the Bayesian optimal estimate (from 22 to 32) (Sedlmeier & Gigerenzer, 2001) indicated no decision-making error. Estimates that fell outside of a ±5% range (less than 22 or greater than 32) were classified as committing a decision-making error. For diagnostic accuracy, a 3-point response was 27%, a 2-point response was between 22% and 32%, and a 1-point response was less than 22% or greater than 32%.
Overall diagnostic accuracy was a composite measure, averaging diagnostic accuracy and risk estimate accuracy across all case vignettes.
Study administration was web-based, combining Qualtrics (Qualtrics, 2014) and an automated Powerpoint presentation with narration. Clinicians accessed the study through a private Web address and password. Qualtrics randomly assigned participants to the control condition (brief presentation on PBD) or treatment condition (the brief presentation on PBD + the cognitive debiasing intervention). The intervention consisted of two main parts: education on common cognitive pitfalls in the assessment of PBD (e.g., base-rate neglect; search satisficing; diagnosis momentum; race/ethnicity bias); and, training and tools to avoid cognitive pitfalls when assessing for PBD. The training and tools consisted of reminders to consider the alternatives (e.g., symptom checklists, routinely asking “what else might this be”), metacognition (e.g., examining one's own decision-making during and after assessment, prior to making diagnosis), decrease reliance on memory (e.g., mnemonics to remember bipolar symptoms), actuarial approaches (e.g., diagnostic likelihood ratios and Bayesian reasoning), and simulation (e.g., opportunities to practice decision-making with case information).
After watching their respective presentations, participants completed four case vignette exercises. Qualtrics randomized the presentation order of case vignettes. Participants were instructed to: (i) decide the probable diagnosis from a pull down menu of 26 diagnosis options; and, (ii) recommend next clinical action (e.g., more assessment, psychotherapy, medication, no treatment, other). Participants could select one or two initial treatment methods (Currin, Schmidt, & Waller, 2007). After clinicians selected a probable diagnosis, they were asked if they were considering any additional diagnoses. This provided participants an opportunity to report any comorbid disorders and provide differential diagnosis. Additional diagnoses were incorporated in coding diagnostic accuracy (as described above).
Data were screened to ensure quality and to check standard statistical assumptions. The amount of missing data was small (<2%), and data were excluded listwise (Allison, 2002). Multiple linear regression models tested treatment condition, controlling for years of clinical experience, as predictors of participants’ overall performance both in terms of the total number of decision-making errors and the accuracy of diagnostic decisions across all vignettes. Chi-square examined: (1) the association between vignette characters’ race/ethnicity and participants’ diagnostic accuracy, and (2) associations between decision-making error and practice implications (regardless of condition status). For individual vignettes, logistic regression tested whether treatment status changed the probability of decision-making errors for each vignette. Polytomous Universal Model (PLUM) regression modeled diagnostic accuracy, with years of clinical experience, professional title, and condition status as the IVs. ANCOVA compared risk estimates between treatment groups, controlling for clinical experience. Sensitivity analyses examined whether the results changed if not controlling for years of experience, or also controlling for type of professional.
There was adequate power to detect effects for all primary analyses. A sensitivity analysis using G*Power (Faul, Erdfelder, Buchnerm & Langm 2009) indicated 80% power to detect effect sizes of f2= .07 or larger (Cohen, 1988) for the given sample size (N = 137) and alpha = .05, 2-tailed. Cohen described effect sizes of f2 .02 as “small,” .15 as “medium,” and .35 as “large.” Statistical analyses were performed using IBM Statistical Package for the Social Sciences (SPSS) Version 21.0. We reported effect sizes as phi for χ2 with 1 df, Nagelkerke R2 for multiple df logistic and polytomous regressions, R2 for multiple regressions, and Cohen's d when comparing means.
The treatment and control groups showed no significant differences, p > .05, on any demographic or professional variables with the exception of their primary professional activities, χ2 (5) = 12.29, p = .03. See Tables 1 and and22.
Clinicians in the intervention condition made significantly fewer errors across all vignettes (M=1.5±1.2 versus M=3.0±.09), with a large effect size, R2 = .36, p < .001 and Cohen's d = 1.67. Intervention remained significant when controlling for experience; experience and professional type did not contribute to prediction of errors.
For diagnostic accuracy, assignment to the training condition made a significant unique contribution even after controlling for years of experience, R2 = .36, p < .001. Average accuracy was 9.9±1.7 in the training condition versus 7.6±1.2 in the control, d = 1.08, p < .0005. Years of clinical experience or professional type did not make a significant unique contribution, ps > .18.
Using a stringent definition, 23% in the treatment group committed a decision-making error, versus 69% in the control group, χ2 (1) = 29.11, phi = .46, p < .001. In terms of diagnostic accuracy, the overall chi-square for the PLUM regression was significant, χ2 (3) = 34.04, p < .001, with a Nagelkerke R2 = .28. Treatment significantly improved diagnostic accuracy in the expected direction, B = −2.23 (1), p < .001. Years of clinical experience and professional type did not make significant unique contributions, ps > .05.
For next action, 82% of participants chose assessment, 69% therapy, 26% medication, and 3% no further clinical action necessary. When participants considered the possibility of hypo/mania in their diagnostic reasoning, they were significantly more likely to recommend further assessment, χ2 (1) = 8.04, phi = .24, p < .05. There were no significant associations between decision making error status and participants’ likelihood to recommend therapy, p > .05, or medication, p > .05. Chi-squared revealed no significant association between decision-making error and antidepressant medication, p = .08.
In the treatment group, 29% committed a decision-making error, versus 59% in the control group, χ2 (1) = 12.83, phi = .31, p < .001. The overall chi-square for the PLUM regression was significant, χ2 (3) = 11.66, p = .009, with a Nagelkerke R2 = .11. Debiasing treatment significantly improved diagnostic accuracy, B = −1.26 (1), p = .001; years of clinical experience did not make a significant unique contribution, p > .05, nor did professional title, p > .05.
For this vignette, 82% participants selected more assessment, 45% medication, and 38% therapy. When participants failed to consider a comorbid diagnosis of bipolar disorder or ADHD, they were significantly more likely to recommend medication or a medication referral, χ2 (1) = 13.09, phi = .31, p < .001. When participants detected the possibility of a comorbid condition, they were less likely to recommend therapy, χ2 (1) = 4.08, phi = .17, p > .05, and more likely to recommend additional assessment χ2 (1) = 17.08, phi = .35, p > .001.
Participants in the treatment condition showed significantly lower diagnosis momentum, p < .001, even when controlling for years of experience, F(2,128) = 8.39, p < .001, R2 = .12. In the control group, participants’ estimates of the probability of a bipolar diagnosis (M = 46.4, SD = 20.7) were significantly higher than 27% for this case (t(80) = 8.44, p < .001). Control group participants’ estimates also were significantly higher than the “near miss” upper bound of 32% (27%+5%), p < .001.
The treatment group's estimates (M = 34.2, SD = 15.4) tended to be higher than the Bayesian estimate as well, t(55) = 3.50, p < .05. However, the treatment group's average was not significantly different than the “close” limit, t(55) = 1.07, p = .29. The cognitive debiasing intervention lead to less cognitive error in synthesizing the case information, X2(1) = 49.36, phi = .60, p < .001, based on logistic regression predicting estimates in the “close” range. There was no significant association between decision-making error and clinical actions, p > .05; 93% of all participants chose therapy, 48% assessment, and 28% medication.
The intervention reduced decision-making errors (61% versus 82% in the controls), χ2 (1) = 7.17, phi = .23, p =.007. When controlling for years of experience and professional type, the overall model was significant, χ2 (3) = 8.58, p = .035, Nagelkerke R2 = .07, and the intervention still predicted fewer errors, B = −.98(1), p = .005.
For next clinical action, 87% selected more assessment, 70% therapy, and 26% medication. When participants failed to detect bipolar disorder they were less likely to recommend medication, χ2 (1) = 11.09, phi = .28, p < .005, and more likely to recommend therapy, χ2 (1) = 14.07, phi = .32, p < .001.
The vignette characters’ race/ethnicity was not associated with participants’ diagnostic accuracy, largest χ2 = .28, all p >.05. Effect sizes for the race/ethnicity variable were all small, d = .01 to .28, where .2 is conventionally considered “small.”
The goal of the present study was to see whether a brief training intervention could yield measurable improvement in the accuracy of clinical diagnostic assessment in practicing clinicians working with a standardized set of clinical vignettes. The study used a randomized trial design to test the effects of cognitive debiasing training on clinical decision-making about PBD and frequently co-occurring conditions. We selected pediatric bipolar disorder as the focus because (a) the diagnosis has been controversial, (b) the research base has changed rapidly in this area (e.g., more than 90% of 7700+ hits in PubMed were published in the last 15 years), (c) there are big differences in training around the topic (Dubicka et al., 2008), leading to pockets of very high or low diagnosis rates, and low overall accuracy comparing clinical diagnoses to research interviews (Jensen-Doss, Youngstrom, Youngstrom, Feeny, & Findling, 2014; Rettew et al., 2009) and (d) we could identify multiple cognitive heuristics likely to play a role in evaluating youths with mood and behavioral issues. Randomization assigned participating mental health professionals in a balanced manner in terms of professional training and experience, and the training intervention produced statistically significant and large effect sizes improving most outcome measures.
We measured overall decision accuracy in two ways: total decision-making errors and diagnostic accuracy ratings across all four vignettes. In both cases, the treatment group performed significantly better than the control group. Participants who received the cognitive debiasing intervention committed fewer decision-making errors and generated more accurate diagnoses for multiple vignettes with various case presentations. Along with providing empirical support for cognitive debiasing interventions in mental health, our findings highlight the complex task clinicians regularly face when presented with new case information. Decades of research document problems associated with clinical judgment (Aegisdottir et al., 2006; Grove et al., 2000; Spengler et al., 2009); the present study offers a promising new approach for improving clinical judgment and equipping clinicians with relatively simple yet effective strategies to decrease the likelihood of cognitive error.
Despite the superiority of the cognitive debiasing intervention, participants in both conditions appeared to struggle the most with the vignette depicting a youth with classic mania. This outcome is surprising given that this vignette, in many ways, was the most straightforward: The portrayed youth met all DSM-IV diagnostic criteria. Also, all participants were primed for PBD just prior to responding to case vignettes by virtue of the automated PowerPoint presentation. It is possible clinicians are prone to seeking more benign diagnoses or explanations for certain symptoms and behaviors (e.g., adjustment disorder), perhaps hesitant to diagnose mania in younger patient populations (i.e., the vignette character was 11 years old). This is concerning given the substantial delay in the community between onset of symptoms and diagnosis for individuals with bipolar illness (Marchand et al., 2006). More research is needed to better understand how the intervention can be tailored to further reduce diagnostic error around more classic presentations of bipolar illness.
A second goal of the present study involved testing the potential influence of decision accuracy on clinician's recommended treatment approaches. Similar to many psychiatric disorders, accurate diagnosis of PBD is important before starting treatment (Kowatch et al., 2005; Weller, Danielyan, & Weller, 2004). Suboptimal diagnostic decisions were significantly associated with different treatment and clinical recommendations, particularly when participants missed comorbid conditions, failed to detect the possibility of hypomania or mania in depressed youths, and misdiagnosed classic manic symptoms. The vignette depicting more classic mania symptoms was diagnostically challenging for clinicians regardless of intervention condition status; but participants were more likely to recommend medication when participants detected bipolar disorder. In this case, an accurate diagnosis aligned with recommended treatment guidelines in the literature (McClellan, Kowatch, Findling, & Work Group on Quality, 2007). Further, for the youth presenting with bipolar and ADHD symptoms, participants who failed to consider a comorbid condition were more likely to recommend medication. In instances where clinicians detected the ADHD but not the bipolar and recommended medication, the treatment plan would be at odds with the recommendations of current practice parameters, which suggest focusing on stabilization of mood first, followed by augmentation with medication addressing residual ADHD symptoms if necessary (McClellan et al., 2007).
In contrast, participants who accurately detected both conditions were more likely to select more assessment, which would be helpful in determining the optimal sequence of interventions – including in this case key decisions about whether to focus first on ADHD symptoms or mood issues. Interestingly, cognitive error, particularly search satisficing, appears associated with premature treatment decisions, while more accurate judgment leads to a more conservative and methodical approach. In other words, when one identifies “warning signs” of a co-occurring condition (e.g., sleep disturbance or episodic mood and energy disturbance in a case of ADHD), it may appropriately slow down the diagnostic process, allowing for more comprehensive evaluation of alternative or comorbid conditions prior to initiating active treatment (e.g., medication) (Youngstrom, Choukas-Bradley, Calhoun, & Jensen-Doss, 2014). Taken together, faulty heuristics appear to affect clinicians’ treatment formulations and may have important implications for patient care, especially in light of research indicating diagnostic accuracy may be an important precursor to successful treatment (Jensen-Doss & Weisz, 2008).
Contrary to study hypotheses, participants’ diagnoses were not affected by the race/ethnicity of the vignette character. Specifically, clinicians’ rates of bipolar diagnoses did not vary as a function of the youth being African American or Caucasian, with all other case information held equal by design. It is worth noting that in addition to describing the youth's race/ethnicity in the case vignettes, we included a picture to make race/ethnicity more salient to study participants.
Our non-significant race/ethnicity findings are intriguing given the tendency in the community for African American youths to be misdiagnosed with conduct disorder or schizophrenia (DelBello et al., 2001; Fabrega et al., 1993; Kilgus et al., 1995). It is possible that the misdiagnosis of African American youth is more complex than the mechanisms probed in the vignette, which manipulated the racial/ethnic label and appearance of the case. Recent ethnographic studies reveal differences in how minority populations describe mood symptoms, often emphasizing behavioral issues versus emotional problems (Carpenter-Song, 2009). Parental beliefs about causes of children's behavior also vary as a function of race/ethnicity, influencing how parents conceptualize and describe their children's problems to clinicians (Yeh et al., 2005). For example, when a family's description of the presenting problem focuses on behavior, clinicians formulate a hypothesis focused on a disruptive behavior disorder and fail to explore mood issues systematically as a competing hypothesis. Thus, cultural differences in description of the presenting problem may trigger availability heuristics, confirmation bias, and search satisficing to produce the discrepant assessment and treatment choice outcomes. The mere experimental manipulation of youths as Caucasian or African American in the present study unlikely captured this phenomenon, although it does suggest that perceptual biases or implicit racism are unlikely to be major factors in misdiagnosis. Additional mixed-methods research could elucidate the role, forms and effects of bias (Carpenter-Song, Whitley, Lawson, Quimby, & Drake, 2011).
Several limitations warrant attention. First, external validity of studies with case vignettes can be limited because vignettes may not fully reflect complex interpersonal situations and contextual pressures present in real life diagnosis. Case vignettes made it possible to sample a broader group of clinicians, and it also made it possible to experimentally manipulate both the treatment and the case's race while holding other clinical features constant. A second limitation was the restricted number of cognitive errors and debiasing techniques the study could feasibly investigate (cf. Croskerry, 2002). We compensated for this limitation by carefully selecting heuristics based on a thorough literature review and conceptualizing which heuristics were likely to be commonplace (Galanter & Patel, 2005). By focusing on heuristics likely to operate in more cases, it is possible to generate good coverage with a smaller number of targets (i.e., Pareto's 80:20 Rule, the “Law of the Vital Few”). Additional research needs to examine other possible heuristics as well as to test a broader array of debiasing strategies. Third, randomization resulted in an unequal number of participants in the treatment and control conditions. Future investigations could benefit from urn randomization or other blocking methods that would promote balanced assignment between arms or within strata (e.g., Project MATCH Research Group, 1997). Additionally, a substantial number of individuals were excluded from participation, with 17 declining and 107 initially viewing the website but not continuing in the study. Due to the online data collection methods employed, it was not possible to examine the impact of study attrition and potential selection bias, such as characteristics of excluded participants. These individuals showed initial interest by viewing the study website but provided limited or no demographic or professional characteristic information. An unknown portion also may have experienced technical difficulties if they were using an older internet browser or version of Adobe Flash Player™. Thus, we know very little about the subset of individuals who chose not to participate.
Further, the possibility of self-selection bias and priming warrant attention. Regarding the former, participants may reflect a subset of clinicians particularly eager to learn new assessment methods, potentially decreasing generalizability of study findings. However, the same mechanism may also have resulted in diminished effect sizes if the participants in the control condition were more assessment savvy than the average clinician. Similarly, all participants were primed for childhood mood disorders regardless of group assignment, potentially increasing participants’ sensitivity to bipolar disorder in their diagnostic decisions. Again, the effect of this priming would be to increase the sensitivity of control participants to bipolar features, akin to enhancing the rate of placebo response in a treatment study, thus attenuating treatment effects. The fact that the debiasing strategies continued to show statistically significant and large treatment effects attests to the promise of the approach. It will be important for future work to examine the extent to which initial treatment effects are retained over time (e.g., recency effects and forgetting). These research questions are intriguing and also crucial in future development, testing, and dissemination of debiasing interventions for mental health professionals.
Additional limitations in terms of generalizability include the small number of male and African American participants, the low representation of clinicians from the Northeast, and a high percentage of participants indicating a cognitive behavioral therapy theoretical orientation, all of which may differ from the general population of practicing clinicians.
Overall, findings showed that a brief (<30 minute) intervention could be delivered via the Web to practicing clinicians. The debiasing intervention produced medium to large gains in the accuracy of diagnostic and treatment decisions across multiple vignettes. Results also were impressive for showing improvement around a complicated and contentious diagnosis – pediatric bipolar disorder.
Future work should focus on cognitive debiasing for improving mental health decision-making in general. Research needs to examine the effectiveness of online teaching resources for mental health practitioners as well as the best media for presenting information to audiences. Several models for Web-based learning courses (e.g., Trauma Focused Cognitive Behavioral Therapy Web, Medical University of South Carolina, http://tfcbt.musc.edu/) are paving the way for scaling up and disseminating mental health interventions (Kazdin & Blase, 2011). When evaluating the effectiveness of web-based interventions, it is important to measure clinicians’ adoption of strategies into practices. The present study demonstrated this with vignettes, but future studies should examine whether this translates into change with subsequent cases. Also, it may be beneficial to explore deeper the reasons behind error-prone behavior in clinical diagnosis. This line of research could inform future theory regarding decision-making and mental health practice. Strategies provided in the cognitive debiasing intervention are likely to generalize to other aspects of patient care, including the assessment of other mental disorders and treatment decisions. Thus, although the current study concentrated on PBD, these strategies may increase clinicians’ awareness of common cognitive-based errors and teach them new ways of thinking. Further examination of generalizability to other challenging psychiatric diagnostic and treatment decisions will be important.
Present findings are necessarily preliminary, as they are based on a single sample, a limited set of vignettes, and a small range of cognitive heuristics, as well as not having data about long term changes in behavior. However, these findings are similar to results of other debiasing interventions in medicine (e.g., Crosskerry, 2013), and the effect sizes were medium or large under what could be considered the assessment analogy of an “efficacy” study with highly controlled conditions. The approach illustrates the potential value of practicing application of the strategies to cases, even vignettes, rather than reading about heuristics and debiasing as abstract principles (Davidow & Levenson, 1993; Galanter & Patel, 2005). This is consistent with current thinking in active learning (Tomcho & Foels, 2012). It suggests that teaching these methods in graduate assessment classes, and then creating opportunities to continue honing them in group supervision, case conferences, workshops, online continuing education exercises, or other venues that foster repeated application and discussion of principles and methods may be worth exploring as ways of extending dissemination, implementation, and retention of gains in practice (Collins, Leffingwell, & Belar, 2007).
Study findings increase understanding of clinicians’ cognitive vulnerabilities; they model a new approach for improving clinical decision-making. Greater awareness of faulty heuristics and using cognitive debiasing strategies improve clinicians’ diagnostic reasoning and result in more accurate treatment decisions.
Dr. E. Youngstrom has received travel support from Bristol-Myers Squibb and consulted with Lundbeck, Otsuka, Western Psychological Services, and Pearson about assessment.
Role of Funding Source: This material is based upon work supported by the North Carolina Translational and Clinical Sciences Institute: NIH grant#: 1UL1TR001111. The funding source did not play a role in the study design, in the collection, analysis, and interpretation of data, in the writing of the manuscript, or in the decision to submit the paper for publication.
Conflict of Interest Disclosures:
Dr. Jenkins has no conflicts of interest to disclose.