|Home | About | Journals | Submit | Contact Us | Français|
Several studies suggest that a prototype matching approach yields diagnoses of comparable validity to the more complex diagnostic algorithms outlined in DSM-IV. Furthermore, clinicians prefer prototype diagnosis of personality disorders (PDs) to the current categorical diagnostic system or alternative dimensional methods. An important extension of this work is to investigate the degree to which clinicians are able to make prototype diagnoses reliably. The aim of this study is to assess the inter-rater reliability of a prototype matching approach to personality diagnosis in clinical practice. Using prototypes derived empirically in prior research, outpatient clinicians diagnosed patients’ personality after an initial evaluation period. External evaluators independently diagnosed the same patients after watching videotapes of the same clinical hours. Inter-rater reliability for prototype diagnosis was high, with a median r = .72. Cross-correlations between disorders were low, with a median r = .01. Clinicians and clinically trained independent observers can assess complex personality constellations with high reliability using a simple prototype matching procedure, even with prototypes that are relatively unfamiliar to them. In light of its demonstrated reliability, efficiency, and versatility, prototype diagnosis appears to be a viable system for DSM-V and ICD-11 with exceptional utility for research and clinical practice.
Despite considerable efforts towards developing and refining diagnostic categories and criteria for the DSM and ICD diagnostic systems, very little research has focused on how best to implement those criteria for making diagnosis in clinical practice. The architects of DSM-III abandoned the approach taken in DSM-I and – II (descriptive paragraphs defining disorders, which clinicians diagnosed as present or absent), which lacked empirically derived diagnostic criteria, reliability across clinicians and sites, and formal decision rules for applying the diagnostic categories to individual patients. After it became clear that most diagnoses are not “classical” categories, in which category membership requires all members of a category (in this case, patients) to share a fixed set of defining features, the architects of subsequent editions of the manual switched to the familiar “polythetic” criterion approach, in which a patient can receive a diagnosis by crossing a threshold of features (e.g., 5 of 9 for Borderline Personality Disorder) that are neither necessary nor sufficient for diagnosis (except for some Axis I disorders, such as post-traumatic stress disorder (PTSD), which require certain features to be present before considering other criteria).
The diagnostic procedures used since DSM-III have improved research diagnosis (using structured interviews) and made possible the explosion of research since 1980. However, they remain mostly untested against alternative approaches, particularly in clinical practice, and their problems have gradually become clear (for a summary, see Ortigo, Bradley, & Westen, 2010; Westen, Heim, Morrison, Patterson, & Campbell, 2002). For example, diagnostic overlap produces spuriously high estimates of comorbidity (with most patients who receive one PD diagnosis by structured interview receiving as many as four to six), and for PDs, as with most other disorders, more patients receive not otherwise specified (NOS) diagnoses. A lack of diagnostic specificity hinders both research and practice. The problem of comorbidity is related to the proliferation of NOS and new diagnoses, because every time researchers place parameters on a category (e.g., number of criteria required for diagnosis), subthreshold or otherwise not-quite-present variants are identified. Equally problematic, the shift to clear diagnostic criteria and cutoffs since DSM-III has not improved inter-rater agreement (reliability) in clinical practice or field trials (Zimmerman, 1994), in part because disorders defined as lists of distinct criteria are difficult to remember, require dichotomous (and unreliable) judgments about each criterion, are cumbersome in clinical practice (and hence tend not to be used), and have not proven useful to practitioners, for whom the question of whether a patient meets four or five criteria for BPD is not particularly relevant (Rottman, Ahn, Sanislow, & Kim, 2009). Further, and related to all of these problems, is the now overwhelming evidence that most disorders are distributed continuously rather than categorically in nature, suggesting the importance of considering dimensional approaches for diagnosis for both Axis I and Axis II disorders (e.g., Brown, Chorpita, & Barlow, 1998; Krueger, et al., 2002; Widiger & Clark, 1999). Indeed, dimensional diagnosis has been targeted as one of the major research priorities for DSM-V, starting with axis II (see Kupfer, First, & Regier, 2002; Rounsaville, et al., 2002), and the use of some form of dimensional system appears almost certain (Skodol & Bender, 2009).
An important question, however, is how to implement dimensional diagnosis. One possibility which has become the norm in PD research is simply to sum the number of diagnostic criteria met for each disorder. The advantage of dimensionalizing current criteria is continuity with the current diagnostic approach. The disadvantage is that clinicians find DSM diagnosis cumbersome already (e.g., Jampala, Sierles, & Taylor, 1988). Expecting them to count criteria across dozens of dimensions is thus unrealistic. In fact, a growing body of literature suggests that clinicians find both categorical diagnosis and a dimensionalized symptom counting approach to be clinically unworkable across several dimensions of clinical utility (Rottman, et al., 2009; Spitzer, Shedler, Westen, & Skodol, 2008; Westen, Shedler, & Bradley, 2006).
Elsewhere we have proposed a prototype matching approach for diagnosis designed to maximize diagnostic accuracy while taking into consideration the cognitive characteristics of human clinicians (Westen & Bradley, 2005; Westen, et al., 2002; Westen & Shedler, 2000). Using this procedure, clinicians rate the overall similarity or “match” between a patient and the prototype using a 5-point scale, taking the prototype as a whole rather than counting individual symptoms (see Figure 1). Prototypes consist of paragraph-long descriptions of each disorder rather than lists of circumscribed criteria. This format permits inclusion of more and richer diagnostic criteria and allows for organization of criteria in ways that facilitate memory. Rather than memorize symptom lists with arbitrary and variable cutoffs across disorders, diagnosticians can form mental representations of coherent syndromes, in which signs and symptoms may be linked by meaningful functional relations (Ahn, 1999).
Prototype diagnosis has several advantages. For example, it generates both categorical and dimensional diagnoses, overcoming a significant limitation of many forms of dimensional diagnosis (Rounsaville, et al., 2002): For purposes of communication, ratings of 4 or 5 denote a categorical diagnosis (“caseness”), and “3” translates to “features” or subthreshold pathology. The method parallels diagnosis in many areas of medicine, where variables such as blood pressure are measured on a continuum but physicians refer to certain ranges as “borderline” or “high.” In addition, a prototype matching method more closely resembles how the brain actually works. Cognitive research science on classification processes indicates that human thinking naturally relies on forms of cognitive prototype matching (Cantor & Genero, 1986; Horowitz, Post, de Sales French, Wallis, & Siegelman, 1981; Horowitz, Wright, Lowenstein, & Parad,1981; Kim & Ahn, 2002).
In a series of recently completed studies of personality, mood, anxiety, eating, and adolescent diagnoses, prototype diagnoses correlated highly with, and have similar correlates to, both categorical and dimensional diagnoses obtained by summing DSM-IV criteria or using self-report measures of specific syndromes (Ortigo, et al., 2010; Westen, et al., 2006). With respect to PDs, across several studies from three different research teams, clinicians who applied different diagnostic approaches to a real patient in their care rated prototype diagnosis substantially more useful, comprehensive, and clinically efficient than DSM-IV diagnosis and various other dimensional alternatives (Rottman, et al., 2009; Spitzer, et al., 2008; Westen, et al., 2006).
An important question, however, is whether clinicians can make prototype diagnoses reliably, particularly in everyday practice, where reliability of diagnosis remains poor. The present study was designed to address this question.
Participants (N=65) were nonpsychotic patients seeking outpatient treatment from a community-based clinic (Hilsenroth, 2007). Patients were 80.0% female, with a mean age of 29.4 (SD 11.6) and GAF of 59.3 (SD 5.4). 75.4% were single, with the remainder married, divorced, or widowed. All had at least one Axis I diagnosis (M = 1.69), the most common being Mood (44.6%), Anxiety (23.1%), and Adjustment (10.8%) Disorders. The majority of patients had a PD diagnosis (63.1%), approximately half with Cluster B and half with either Cluster A or C diagnoses. After complete description of the study to the participants, written and informed consent was obtained.
The clinician-raters who conducted the psychological assessment, feedback sessions, and prototype ratings were advanced doctoral students enrolled in accredited clinical psychology Ph.D. program. Each clinician-rater received a minimum of 3.5 hours of supervision per week (1.5 individually, 2 hours group) by a licensed Clinical Psychologist on the therapeutic assessment model/process, scoring/interpretation of assessment measures, presentation/organization of collaborative feedback, clinical interventions (for the clinicians who conducted the assessment procedure, who in all cases were the patient’s psychotherapist), and review of videotaped case material.
Each patient was assigned to a member of a psychotherapy treatment team in an ecologically valid manner based clinician availability and caseload. Patient evaluations were conducted using standard clinical interviewing methods traditionally used in private practice, but included three meetings totaling approximately 4.5 hours (as well as one independent patient appointment to complete a battery of self-report measures). The procedures were standardized more than is typically the case in clinical practice given that this is both a training clinic and a research clinic focusing on naturalistic psychotherapy research. The assessment procedure was videotaped and included both systematic clinical interviewing about the patient’s life history and symptoms (Westen & Muderrisoglu, 2003, 2006) as well as collaborative feedback (Finn & Tonsager, 1992; Finn & Tonsager, 1997; Fischer, 1994). Further details of the measures, methodology and procedures utilized in this assessment process are described more fully elsewhere (Hilsenroth, 2007; Peters, Hilsenroth, Eudell-Simmons, Blagys, & Handler, 2006).
For the present study, treating clinicians diagnosed each patient’s personality pathology at the end of this assessment procedure. External raters consisted of the same pool of clinicians and in some cases the study supervisor.
Clinicians and independent evaluators diagnosed the patient using a version of the PD prototype rating system depicted in Figure 1. Prototypes were empirically derived from data provided by a large national sample of experienced clinicians who used a 200-item Q-sort instrument to describe a specific PD patient in their care, by applying a statistical procedure (Q factor analysis) to the data to identify empirically distinct diagnostic groupings (Westen & Shedler, 1999). The procedure generated 7 primary diagnoses: dysphoric, antisocial, schizoid, paranoid, histrionic, obsessional, and narcissistic. The dysphoric diagnosis included five subtypes: avoidant, high-functioning depressive, emotionally dysregulated (borderline), and hostile-oppositional (a variant of passive-aggressive PD).
We constructed paragraph-long prototypes of each of diagnosis by weaving together the items (criteria) most empirically descriptive of each into paragraph form (Westen, et al., 2006), grouping together functionally or thematically similar items for ease of clinical use (see Figure 1). The advantage of using these empirically derived diagnostic prototypes for the present purposes is that they were designed to be nonredundant; hence, poor discriminability among the prototypes would not be attributable to comorbidity inherent in the current axis II criterion sets.
Table 1 presents the correlations between clinician and external evaluator prototype ratings for the primary empirically derived disorders, with the exception of Dysphoric (whose five subtypes were rated separately as prototypes). Table 2 presents correlations for the five Dysphoric subtypes. (The rating system at the time of this study had 7 scale points rather than 5; however, for the present study we collapsed the data to 5 scale points for continuity with other research. The findings were essentially identical when we analyzed them using all 7 scale points.)
As can be seen from the two tables, clinicians were able to make highly reliable and discriminating judgments, with median inter-rater reliability for the primary PDs and subtypes of r = .72 and .74, respectively. Median correlations off the diagonal, which represent correlations between disorders designed to be relatively low in overlap, were .17 and .11, respectively. In other words, two independent clinicians tended to see patients much the same way, agreeing on the extent to which they matched the same prototypes and producing low correlations between unrelated diagnoses.
The data provide strong evidence for the inter-rater reliability of prototype diagnosis of PDs in clinical practice. Whereas field trials and inter-interview studies comparing diagnoses made by different structured interviews and questionnaires for the same PDs administered days or a few weeks apart have shown low correlations and even lower kappa coefficients indicating convergent diagnoses (Clark, Livesley, & Morey, 1997; Pilkonis, et al., 1995; Skodol, Oldham, Rosnick, Kellman, & Hyler, 1991), we found high correlations in the range of r = .70 between two assessments made by independent assessors from naturalistic clinical hours. Prototype diagnoses also demonstrated strong differentiation across disorders (what might be called discriminant inter-rater reliability), something not seen in previous research. Although elsewhere we have discussed whether prototype matching might be useful for research diagnosis or for diagnosis of axis I disorders as well (Westen & Bradley, 2005; Westen, et al., 2002), what the data here suggest is that prototype diagnosis offers a promising alternative method for personality diagnosis in clinical practice.
The major limitation of the study is that the sample was relatively small and the clinicians were relatively inexperienced and drawn from the same clinical training pool. Two considerations, however, mitigate these limitations. First, the limitations would favor null findings. For example, inexperienced clinicians would likely have more difficulty using diagnoses other than the more familiar DSM-IV categories as well as the DSM-IV diagnostic approach, and their limited clinical experience would render them less likely to converge on diagnostic impressions following an interviewing procedure that is relatively open-ended, focusing on the patient’s life history as a way of exploring ongoing and enduring personality dynamics. Second, the effects were large and significant, even with this sample size, and the small magnitude of the correlations off the diagonal (i.e., between unrelated or minimally related diagnoses) relative to those on the diagonal (showing diagnostic agreement) clearly demonstrated that even inexperienced clinicians could make highly specific diagnostic judgments when evaluating the same patient using the kinds of data experienced clinicians collect over the course of initial interviews in clinical practice (which the assessment procedure was intended to simulate and standardize).
A second limitation is that clinicians were rating empirically derived diagnoses rather than prototypes of the current axis II disorders. Three considerations, however, limit this concern. First, as with the first limitation, this one would reduce inter-clinician agreement, rendering the findings more conservative, given that clinicians were matching patients to diagnoses with which they were unfamiliar. Second, prior research has found that the four empirically derived disorders tested here that resemble Cluster B disorders have similar correlates as the DSM-IV Cluster B PDs and hence are likely to be a reasonable proxy for them.
Third, although DSM-V appears increasingly likely to include prototype matching as an approach to dimensionalizing PD diagnosis, it is unlikely to use the current diagnoses precisely as they are configured now, given their many limitations, such as comorbidity (Skodol & Bender, 2009). Thus, the approach tested here, with rich, empirically derived prototypes, is at least as likely as a version of the current diagnoses woven into prototype form to resemble prototype diagnosis in DSM-V. Indeed, the prototypes tested here provide a model of the kind of prototypes that might be useful in DSM-V, in that they are empirically derived rather than clinically constructed by committee. Spitzer and colleagues (Spitzer, et al., 2008) used these empirically derived PD prototypes in their comparative study of clinical utility and found that clinicians preferred even these unfamiliar prototypes over prototypes derived from the current axis II disorders because of their clinical richness of description. Westen and colleagues (Westen, et al., 2006) compared the four of these prototypes most comparable to the DSM-IV Cluster B disorders (antisocial-psychopathic, emotionally dysregulated, histrionic, and narcissistic) as well as prototypes of the four Cluster B disorders to DSM-IV diagnoses and found similar results. One of the advantages of prototype diagnosis is that it can also include more, and clinically richer, criteria than the eight to nine criteria per disorder included in the current diagnostic system, because clinicians do not have to make independent judgments on each criterion; rather, they make a single prototypicality judgment on each diagnosis taken as a gestalt.
Finally, a major question often raised about a prototype matching approach to diagnosis is whether it is a “throwback” to DSM-II (i.e., a return to paragraph length diagnoses). That question, however, misses the point. Prototype diagnosis has the parsimony of DSM-II diagnosis but lacks its disadvantages. Although the format of the diagnostic prototypes may superficially resemble the format of the diagnostic paragraphs in the first two editions of the DSM, this approach to diagnosis differs from the early diagnostic manuals in several key respects: (1) the diagnostic criteria (and in this case, the diagnoses themselves) are entirely empirically derived, not rationally or clinically derived, as in DSM-I and – II; (2) the diagnoses are not laden with causal clinical hypotheses of the 1930s and 1940s; and (3) most importantly, clinicians are not making idiosyncratic dichotomous characterizations of patients as either having or not having a disorders, which would likely be as unreliable as dichotomous judgments about prototypes. Rather, clinicians are taking into account all available data and making a judgment of the extent to which the patient matches an empirically derived prototype.
Clinicians are reluctant to implement the existing axis II diagnostic system with its laundry list of symptoms, cumbersome algorithms, overlapping criteria, and descriptive vagaries. Prototype matching, on the other hand, allows for rich descriptions of personality constructs without an exorbitant clinical effort. Using a prototype system, clinicians could briefly and efficiently (within one or two minutes) make an axis II diagnosis, generating a diagnostic profile that indicates for each disorder both the extent to which the patient resembles the prototype and whether the patient matches the prototype strongly enough to receive a categorical diagnosis. Empirically, the results of this study generated extremely high estimates of cross-clinician reliability and lower cross-correlations with unrelated disorders than we have seen in any PD study to date. The prototype diagnostic system utilized in this study offers clinically-rich diagnostic descriptions which are not only reliably observable across clinicians but also highly discriminative. Narrative diagnostic descriptions allow for improved treatment planning and clinical training while reliable and distinctive diagnoses increase the efficiency of clinical communication and co-ordination across providers. Indeed, clinicians find prototype diagnosis preferable to alternative approaches across a range of clinical utility variables including comprehensiveness, ease of implementation, enhancement of treatment planning, and clarity of communication with mental health providers as well as patients (Rottman, et al., 2009; Spitzer, First, & Skodol, 2006).
At this point, given the consistent evidence of the validity, clinical utility (First, et al., 2004), and now inter-clinician reliability of prototype diagnosis for PDs, we would recommend that DSM-V incorporate prototype matching as the primary method of diagnosing personality constellations, given the likelihood that the Work Group appears headed toward maintaining a constellational approach that is likely to be supplemented by other approaches, such as trait diagnosis (Skodol & Bender, 2009). We would also recommend, based on these and other data on Axis I disorders (Ortigo, et al., 2010), that the ICD-11 consider prototype diagnosis for all disorders for clinical practice, and that the architects of both the ICD-11 and DSM-V undertake research to test whether prototype matching may be a workable approach for clinical practice and research for all clinical disorders, not only PDs.
Preparation of this article was supported in part by NIMH grants MH62377, MH62378, and MH078100. 1. Jonathan Shedler and Drew Westen are the copyright holders of the Shedler-Westen Assessment Procedure (SWAP-II)
DREW WESTEN is a clinical, personality, and political psychologist and neuroscientist, and Professor in the Departments of Psychology and Psychiatry at Emory University. He completed his PhD in clinical psychology at the University of Michigan and formerly taught at the University of Michigan, Harvard Medical School, and Boston University. His areas of specialization are in the field of personality disorder diagnosis and treatment, psychopathology, and political decision making.
JARED A. DeFIFE is a clinical psychology research scientist at Emory University and Associate Director of the Laboratory of Personality and Psychopathology. He earned his master’s and PhD degrees in clinical psychology at Adelphi University and was a clinical fellow at Harvard Medical School. He specializes in the study of personality, mood disorders, and psychotherapy.
BEKH BRADLEY is director of the Trauma Recovery Program at the Atlanta Veterans Affairs Medical Center and an Assistant Professor in the Department of Psychiatry at Emory University. He earned his PhD in clinical community psychology at the University of South Carolina and specializes in the study of PTSD, trauma, and personality.
MARK J. HILSENROTH received his PhD in Clinical Psychology from the University of Tennessee and a Diplomate from the American Board of Assessment Psychology (ABAP). He is currently Professor of Psychology at the Derner Institute of Advanced Psychological Studies, Adelphi University. His areas of research interest are personality assessment, training/supervision, psychotherapy process, and treatment outcomes.
Publisher's Disclaimer: The following manuscript is the final accepted manuscript. It has not been subjected to the final copyediting, fact-checking, and proofreading required for formal publication. It is not the definitive, publisher-authenticated version. The American Psychological Association and its Council of Editors disclaim any responsibility or liabilities for errors or omissions of this manuscript version, any version derived from this manuscript by NIH, or other third parties. The published version is available at www.apa.org/pubs/journals/pro
Drew Westen, Departments of Psychology and Psychiatry and Behavioral Sciences, Emory University.
Jared A. DeFife, Department of Psychology, Emory University.
Bekh Bradley, Department of Psychiatry, Emory University, Atlanta Veterans Administration Hospital.
Mark J. Hilsenroth, Derner Institute of Advanced Psychological Studies, Adelphi University.