|Home | About | Journals | Submit | Contact Us | Français|
In psychotherapy, the patient-provider interaction contains the treatment’s active ingredients. However, the technology for analyzing the content of this interaction has not fundamentally changed in decades, limiting both the scale and specificity of psychotherapy research. New methods are required in order to “scale up” to larger evaluation tasks and “drill down” into the raw linguistic data of patient-therapist interactions. In the current paper we demonstrate the utility of statistical text analysis models called topic models for discovering the underlying linguistic structure in psychotherapy. Topic models identify semantic themes (or topics) in a collection of documents (here, transcripts). We used topic models to summarize and visualize 1,553 psychotherapy and drug therapy (i.e., medication management) transcripts. Results showed that topic models identified clinically relevant content, including affective, content, and intervention related topics. In addition, topic models learned to identify specific types of therapist statements associated with treatment related codes (e.g., different treatment approaches, patient-therapist discussions about the therapeutic relationship). Visualizations of semantic similarity across sessions indicate that topic models identify content that discriminates between broad classes of therapy (e.g., cognitive behavioral therapy vs. psychodynamic therapy). Finally, predictive modeling demonstrated that topic model derived features can classify therapy type with a high degree of accuracy. Computational psychotherapy research has the potential to scale up the study of psychotherapy to thousands of sessions at a time, and we conclude by discussing the implications of computational methods such as topic models for the future of psychotherapy research and practice.
“I believe that some aspects of psychoanalytic theory are not presently researchable because the intermediate technology required … does not exist. I mean auxiliaries and methods such as a souped-up, highly developed science of psycholinguistics, and the kind of mathematics that is needed to conduct a rigorous but clinically sensitive and psychoanalytically realistic job of theme tracing in the analytic protocol” (Meehl, 1978, p. 830).
Advances in technology have revolutionized research in much of psychology and healthcare, including major developments in pharmacology, neuroscience, and genetics. Yet, the science of patient-therapist interactions – the core of psychotherapy process research – has remained fundamentally unchanged for 70 years. Patients fill out surveys, or human coders rate some aspect of the interaction. Thus, while psychiatric and psychological guidelines recommend psychotherapy as a first line treatment for a number of mental disorders (APA, 2006), we still know relatively little about how psychotherapy works. As Meehl noted, existing research methods remain limited in their ability to explore the structure of verbal exchanges that are the essence of most psychotherapy. In the current paper, we move towards an answer to Meehl’s request for a “souped up mathematics” to mine the raw linguistic data of psychotherapy interactions. In traditional research on psychotherapy, human judgment and related behavioral coding are the rate-limiting factor. In this paper, we introduce a computational approach to psychotherapy research that is informed by traditional methods (e.g., behavioral coding) but does not rely on them as the primary data source. The key innovation in this computational approach is drawing on methods from computer science and machine learning that allow the direct, statistical analysis of session content, scaling up research to thousands of sessions.
Some estimates indicate that there are over 400 different name brand psychotherapies (Lambert, 2013), each treatment offers a different approach to helping patients with psychological distress. While the clinical rationales and approaches differ, it is not clear that actual practices of these psychotherapies are meaningfully distinct. Potential differences in the process and outcome of psychotherapies have been a focus of psychotherapy science for over a century. As a comparison, there are many different drug therapies. However, the unique ingredients of treatments are chemical (and patentable). Thus, the actual distinctiveness of treatments is known, even if the specific mechanism of action or relative efficacy is not. In psychotherapy, the treatment consists primarily of words, and although cognitive behavioral (CBT) oriented treatments might focus strongly on patient behavior, the treatment is still verbally mediated (Wampold, 2007). Accordingly, scientific classification of treatments is more nebulous. What is not considered a ‘taxon’ of cognitive behavioral therapy may vary widely across experts and practitioners, with some definitions so broad as to include any scientifically justifiable intervention and others restricted to very specific psychological mechanisms (see Baardseth et al., 2013). This ambiguity is quite old, reaching back to debates between Freud and his early followers and can be found in current research comparing various cognitive behavioral psychotherapies and modern variants of psychoanalysis (e.g., psychodynamic psychotherapy; Leichsenring et al., 2013).
Some have argued that differences between psychotherapies are cosmetic (like the difference between generic ibuprofen and Advil) and that the underlying mechanisms of action are common across different approaches (Wampold, 2001). Meta-analyses generally suggest that most treatment approaches are of comparable efficacy (e.g., Benish, Imel, & Wampold, 2008; Imel, Wampold, Miller, & Fleming, 2008), and process studies cast doubt on the relationship between treatment-specific therapist behaviors and patient outcomes (Webb, DeRubeis, & Barber, 2010). One leading addiction researcher commented that, “… there is little evidence that treatments work as purported, suggesting that as of yet, we don’t know much about how brand name therapies work” (Morgenstern & McKay, 2007, p. 87S). Are the 400 hundred psychotherapies we have today unique, medical treatments? Or, are the different psychotherapies largely similar, distinguished by packaging that obscures what are mostly common components?
Given that psychotherapy is a conversation between patient and provider, the distinctiveness of a therapy approach should be found in the words patients and therapists use during their sessions. Yet, this is precisely where we find a fundamental methodological gap in psychotherapy research. The source data and information are linguistic and semantic, but the available tools used to study psychotherapy are not. Research on the active ingredients of psychotherapy has primarily relied on patient or therapist self-report measures (e.g., see reviews of empathy and alliance literature; Elliott, Bohart, Watson, & Greenberg, 2011; Horvath, Del Re, Flűckiger, & Symonds, 2011) or on behavioral coding systems, wherein human “coders” make ratings from audio or video recordings of the intervention session according to a priori theory-specific criteria (Crits-Christoph, Gibbons, & Mukherjeed, 2013).
Attempts at behavioral coding have varied in their depth from general, topographical assessments of the session such as those used in many Cognitive Behavioral Treatments (e.g., did the therapist ask about homework or set an agenda?) to highly detailed utterance level coding systems (e.g., Stiles, Shapiro, & Firth-Cozens, 1988; verbal response modes, Motivational Interviewing Skills Code; Moyers, Miller, & Hendrickson, 2005). However, behavioral coding as a technology has not fundamentally changed since Carl Roger’s first recorded a psychotherapy session in the 1940s (Kirschenbaum, 2004), and coding carries a number of disadvantages. It is extremely time consuming and reliability can be problematic to establish and maintain. In addition, there is no potential for human coding to scale up to larger applications (i.e., coding 1000 sessions takes 1000 times longer than coding 1 session, thus monitoring the quality of psychotherapy in a large scale naturalistic setting is not feasible over time). There is little flexibility – coding systems only code what they code. They must be developed a priori and cannot discover new meaning not specified in advance by the researcher. More substantively, coding systems are by nature extremely reductionistic – reducing the highly complex structure of natural human dialogue to a small number of behavioral codes.
Given these limitations, it is not surprising that the vast majority of raw data from psychotherapy is never analyzed and questions central to psychotherapy science remain either unanswered or impractical to address. Most content analyses of what patients and therapists actually discuss in psychotherapy are restricted to qualitative efforts that can be rich in content but by their nature are small in scope (e.g., Greenberg & Newman, 1996). While qualitative work remains important, the labor intensiveness of closely reading session content means that the vast majority of psychotherapy data is never analyzed. Consequently, the majority of psychotherapy studies are published without any detail as to what the specific conversations between patients and therapists actually entailed. Beyond the general theoretical description of the treatment outlined in manuals, what did the patients and therapists actually say? Are the different psychotherapies we have today linguistically unique? Or, do therapists who provide different name brand therapies say largely similar things? What specific therapist interventions, and in what combination are most predictive of good vs. bad outcomes? These basic questions form the backdrop of every therapist’s work, but have been impractical to consider given the current technology of behavioral coding and qualitative analysis.
A critical task for the next generation of psychotherapy research is to move beyond the use of behavioral coding to mine the raw verbal exchanges that are the core of psychotherapy, including acoustic and semantic content of what is said by patients and therapists. The use of discovery-oriented machine learning procedures offer new ways of exploring and categorizing psychotherapies based on the actual text of the patient and therapist speech.
The amount of data generated every day (e.g., digitized books, email, video, newspapers, blog posts, twitter, electronic medical records, cell phone calls) has expanded exponentially in the last decade with implications for business, government, science, and the humanities (Hilbert & Lopez, 2011). Developments in data-mining procedures have revolutionized our ability to analyze and understand this vast amount information, particularly in the area of text – sometimes called “computational linguistics” or “statistical text classification” (Manning & Schütze, 1999). Google books “n-gram” server (https://books.google.com/ngrams) allows for the evaluation of trends in single words (i.e., unigrams) or word combinations (bigrams, trigrams) in books. A recent paper analyzed words in 4% of all books (5,195,769 volumes), showing that patterns of emotion word use tracked in expected directions with major historical events (e.g., a sad peak during World War II; Acerbi, Lampos, Garnett, & Bentley, 2013).
There is a small literature that demonstrates the utility of computational linguistic approaches for the analysis of psychotherapy data. The majority of these studies rely on human defined computerized dictionaries in which a software program classifies words or sets of words into predefined categories. In an early study Reynes, Martindale, and Dahl (1984) found that “linguistic diversity” was higher in more productive sessions. In addition, Mergenthaler and his colleagues have published several studies demonstrating that emotion and abstraction word usage discriminates between improved and un-improved cases (e.g., Mergenthaler, 2008; see also Anderson, Bein, Pinnell, & Strupp, 1999). Studies that have used dictionary-based strategies hold promise, but also have important limitations. First, perhaps because large corpora of psychotherapy transcripts are hard to find, these studies have generally been limited in scope (n < 100), reducing the value added of a computerized technology that can evaluate a large set of sessions (i.e., 1,000 or 10,000) in a short amount of time. Second, computerized dictionaries are limited by the categories created by humans – the computer cannot ‘learn’ new categories. Finally, dictionaries cannot generally accommodate the effect of context on semantic meaning (e.g., “dark” may reference a mood or the sky at night).
One specific text-mining approach that holds promise for psychotherapy transcript data are topic models (also called, Latent Dirichlet Allocation; Blei, Ng, & Jordan, 2003). Topic models are data-driven, machine learning procedures that seek to identify semantic similarity among groups of words. Similar to factor analysis in which observed item values are functions of underlying dimensions, topic models view the observed words in a passage of text as a mixture of underlying semantic topics. An advantage of topic models is that they construct a linguistic structure from a set of documents inductively, requiring no external input, but can also be utilized in a supervised fashion to learn semantic content associated with particular codes or metadata (where metadata is any data outside of the text itself; Steyvers & Griffiths, 2007). There is recent work using these models to explore the structure of National Institute of Health grant applications (Talley et al., 2011), publications from the Proceedings of the National Academy of Sciences (Griffiths & Steyvers, 2004), articles from the New York Times (Rubin, Chambers, Smyth, & Steyvers, 2011), and the identify of scientific authors (Rosen-Zvi, Chemudugunta, & Griffiths, 2010). Perhaps more strikingly, topic models have been used in the humanities to facilitate “distant reading” in comparative literature such that hypotheses in literary criticism can be tested vis-à-vis the entire corpus of relevant work (e.g., exploring stylistic similarities in poems, see (Kao & Jurafsky, 2012; Kaplan & Blei, 2007).
With a few exceptions, topic models have yet to be applied to psychotherapy data (see Atkins et al., 2012 and also Salvatore et al., 2012 who used a derivative of latent semantic analysis - a forerunner to topic models; Landauer & Dumais, 1997). However, similar to the news articles, novels, and poems noted above, the words used during psychotherapy sessions by patients and therapists can be viewed as a large collection of text with a complex topical structure. The number of words generated during psychotherapy is quite large. A brief course of psychotherapy for a given patient may consist of 5-10 hours of unstructured dialogue including 12,000-15,000 words per hour (approximately 60,000 to 150,000 words, longer courses of treatment over 1 million words). In 2011, a PubMed search revealed 932 citations for psychotherapy clinical trials (out of 10,698 across all years). As a conservative estimate, if we consider: 500 studies per year, 50 participants per study, 5 sessions per participant, and 10,000 words per session, this leads to an estimate of 125M words of psychotherapy text per year from clinical trials alone. Regardless of the specific estimate, it is clear that a huge amount of psychotherapy data is generated every year and that this number is likely to increase. The use of discovery oriented text mining procedures such as topic models could facilitate new ways of exploring and categorizing psychotherapies based on the actual content of the patient and therapist speech (rather than labels established by schools of psychotherapy).
To evaluate the potential of topic models to “learn” the language of psychotherapy, we applied two different types of topic models to transcripts from 1,553 psychotherapy and psychiatric medication management sessions. Our first goal was to verify that topic models would estimate clinically relevant semantic content in our corpus of therapy transcripts. Second, we determined if semi-supervised models could identify semantically distinctive content from different treatment approaches and interventions (e.g., therapist “here and now” process comments about the therapeutic relationship within a session). A third aim was to explore the overall linguistic similarity and distinctiveness of sessions from different treatment types (e.g., psychodynamic vs. humanistic/experiential). Our final goal was to classify treatment types of new psychotherapy sessions automatically, using only the words used during the session.
The data for the current proposal come from two different sources: 1) a large, general psychotherapy corpus that includes sessions from a diverse array of therapies, and 2) a set of transcripts focused on Motivational Interviewing, a specific form of cognitive behavioral psychotherapy for alcohol and substance abuse.
The general corpus holds 1,398 psychotherapy and drug therapy (i.e., medication management) transcripts (approximately 2.0 million talk turns, 8.3 million word tokens including punctuation) pulled from multiple theoretical approaches (e.g., Cognitive Behavioral; Psychoanalysis; Motivational Interviewing; Brief Relational Therapy). The corpus is maintained and updated by the ‘Alexander Street Press’ (http://alexanderstreet.com/) and made available via library subscription. In addition totranscripts, there is associated metadata such as patient ID, therapist ID, limited demographics, session numbers when there was more than a single session, therapeutic approach, patient’s primary symptoms, and a list of subjects discussed in the session.
The list of symptoms and subjects was assigned by publication staff to each transcript, and no inter-rater reliability statistics were available. All labels were derived from the DSM-IV and other primary psychology/psychiatry texts. Many sessions were conducted by prominent psychotherapists who developed particular treatment approaches (e.g., James Bugental, Existential; Albert Ellis, Rational Emotive; Carl Rogers, Person-Centered; William Miller, Motivational Interviewing), and hence may serve as exemplars of these treatment approaches. To facilitate analysis we categorized each psychotherapy session into 1 of 5 treatment categories, 1) Psychodynamic (e.g., psychoanalysis, brief relational therapy, psychoanalytic psychotherapy), 2) Cognitive Behavioral Therapy (e.g., Rationale Emotive Behavior Therapy, Motivational Interviewing, Relaxation Training, etc.), 3) Experiential/Humanistic (e.g., Person Centered, Existential), 4) other (e.g., Adlerian, Reality Therapy, Solution Focused, as well as group, family, and marital therapies), and finally 5) Drug therapy or medication management. However, in some cases, when a label was missing or more than one treatment label was assigned to a session, collateral information in the metadata was used to assign a single specific treatment label (i.e., a well known therapist associated with a specific intervention, reported use of specific interventions, and/or inspection of the raw transcript). If there was no collateral information or an appropriate label could not be determined, the first listed intervention was chosen as the treatment name or the treatment label and category was left messing. In addition to treatment category, analyses used one subject label, “counselor-client relations”. This session-level label (i.e., applied to an entire session) was assigned to a transcript when there was a discussion about the patient-therapist relationship or interaction during the therapy.
We supplemented the general corpus above with a set of MI sessions (n = 148, 30,000 talk turns, 1.0 million word tokens). Transcripts are a subset of sessions from five randomized trials of MI for drug or alcohol problems, including: problematic drinking in college freshman (Tollison, Lee, Neighbors, & Neil, 2008), 21st birthdays and spring break (Neighbors et al., 2012), problematic marijuana use (Lee et al., 2014), and drug use in a public safety-net hospital (Krupski, Joesch, & Dunn, 2012). Each study involved one or more in-person treatment arms that received a single session of MI. Sessions were transcribed as part of ongoing research focused on applying text-mining and speech signal processing methods to MI sessions (see, e.g., Atkins et al., 2014).
The linguistic representation in our analysis consisted of the set of words in each talk turn. A part-of-speech tagger (Toutanova, Klein, & Manning, 2003) was used to analyze the types of words in each talk turn. We kept all nouns, adjectives and verbs and filtered out a number of word classes such as determiners and conjunctions (e.g., “the”, “a”) as well as pronouns. The resulting corpus dramatically reduces the size of the corpus to 1.2M individual words across 223K talk turns. We applied a topic model with 200 topics to this data set, treating each talk turn (either patient or therapist) as a “document.” In the topic modeling literature, the document defines the level at which words with similar themes are grouped together in the raw data. We could define documents in a number of ways (e.g., all words in the session or all words from a specific person), but we have found in previous research within clinical psychology (Atkins et al., 2012) that defining documents by talk turns enhances the interpretability of the resulting topics. In a topic model, each topic is modeled as a probability distribution over words and each document (talk turn) is treated as a mixture over topics. Each topic tends to cluster together words with similar meaning and usage patterns across talk turns. The probability distribution over topics in each talk turn gives an indication of which semantic themes are most prevalent in the talk turn. For further details on topic models (see Atkins et al., 2012).
First, we used topic models to explore what therapists and patients talk about. As noted earlier, topic models estimate underlying dimensions in text, which ideally capture semantically similar content (i.e., the underlying “topics”). Thus, in applying topic models to psychotherapy transcripts, an initial question is whether the models extract relevant semantic content? Table 1 presents 20 selected topics (of 200 total) from an unsupervised topic model applied to all session transcripts (i.e., these topics were generated inductively without any input from the researchers). It is clear that the words in each topic provide semantically related content and capture aspects of the clinical encounter that we might expect therapists and patients to discuss. We have organized topics into four areas, 1) Emotions/Symptoms, 2) Relationships, 3) Treatment, and 4) Miscellaneous. Similar to factor analysis, all labels were supplied by the current authors-the model itself simply numbers them. The top 10 most probable words for each topic are provided along with author generated topic labels to aid interpretation. For example, the emotion category includes several symptom relevant topics. Topic 15 (Depression) includes many of the specific symptom criteria for depression (e.g., sadness, energy, hopelessness; the word “depression” is the 16th most probable word), and topic 149 (Anxiety) includes words relevant to the discussion of a panic attack.
The relationship category illustrates how a topic model can handle differences in meaning depending on context. Topic 146 (Sex) and 60 (Intimacy) include derivatives of the words relationship and sex. In Topic 60, these words occur in the context of words such as closeness, intimacy, connection, and open, suggesting these words had a different implications then when they occur in Topic 60, which includes words such as desire, enjoy, and satisfied. The basic topic model can infer differing meaning of identical words (e.g., play used in reference to theater vs. children) as long as the documents that the words occur in have additional semantic information that would inform the distinction (Griffiths & Steyvers, 2007). In the treatment category, topic 196 includes a number of medication names and is clearly related to discussions of psychopharmacological treatment. Topic 198 (Behavior Patterns) includes words that might be typical in the examination of behavior/thought patterns (e.g., irrational, pattern, behavior, identify). We considering labeled this topic “CBT” given words that might be found in an examination of thoughts in cognitive therapy. However, we found that this topic was actually more prevalent in psychodynamic sessions as compared to CBT sessions. This finding highlights the complexities of topic models. While the model returns a cluster of words, the researcher must infer what the cluster means.
To demonstrate the utility of a topic model in the discovery of language specific to different approaches to psychotherapy, we utilized a ‘labelled’ topic model (Rubin et al., 2011) wherein the model learns language that is associated with a particular label – in the present case a session-level label that identifies the type of psychotherapy (e.g., CBT vs. Psychodynamic). We used the output from this model to identify specific therapist talk turns that were statistically representative of a given label. In the general psychotherapy corpus, there were no labels or codes for talk turns, only for the session as a whole. Given the labels for each session and the heterogeneity of word usage across sessions, the model ‘learns’ which talk turns were most likely to give rise to a particular label for the entire session.
In Table 2, we provide four highly probable talk turns for six different treatments. The depicted statements are what might be considered prototypic therapist utterances for the each treatment. Client-centered talk turns appear to be reflective in nature, while utterances in rationale emotive behavior therapy have a quality of identifying irrational thought patterns. Brief relational interventions focus on here and now experiences, and the selected talk turns for MI were those typical for the brief structured feedback session that therapists were trained to provide in several of the MI clinical trials included in the corpus.
Table 2 presents results from a labeled topic model using psychotherapy type as the label categorizing a session. We explored whether the model could learn more nuanced, psychological labels, focusing on “client-counselor relations” – a code that was used to label sessions that included discussions between client and therapist about their relationship/interaction. As with the identification of therapist talk turns, the client-counselor relations code was assigned to an entire transcript. Consequently, the model must learn to discriminate between language in these sessions that is irrelevant to the label (e.g., general questions, scheduling, pleasantries, other interventions, etc.) and language that involves the client and therapist talking about their relationship. Table 3 provides the five most probable therapist talk turns associated with the client-counselor relations label. Each talk turn is clearly related to a therapist making a comment about the patient-therapist interaction.
In addition to low-level identification of therapist statements, we used topic models to make high-level comparisons related to the linguistic similarity of sessions. How similar are sessions, given the semantic content identified by the topic model? We used the output from the unsupervised topic model to explore the semantic similarity of 1,318 sessions across 4 treatment categories (i.e., Medication Management, Psychodynamic, CBT, Humanistic/Existential). Specifically, it is possible to assign individual words within sessions to one of the 200 topics. The sum of the words in each topic for each session provide a session-level summary of the session’s semantic content – a model-based score on each of 200 topics for each of the 1,318 sessions.1 Given these semantic summaries of each session, we then computed a correlation matrix of each session with every other session. A high correlation between two sessions indicates similar semantic content, defined by the 200 topics of the topic model. Because a 1,318 × 1,318 matrix of correlations would be utterly unreadable, we present the correlation matrix visually using color-encoded values for the correlations.
This style of visualization is referred to as a heatmap, as the initial versions often used red to yellow coloring to note the intensity of the numeric values. In Figure 2 the color scale on the right shows how correlation values are mapped to specific colors: Orange and red pixels represent highly correlated sessions, and blue and green pixels indicate little correlation in topic frequencies. The correlation matrix was purposefully organized by treatment category. We have highlighted several highly correlated blocks of sessions that represent, (a) a set of highly structured motivational interviewing feedback sessions from a clinical trial, (b) a large number of sessions from a single case of psychoanalysis, and (c) several sessions from a single case of client-centered therapy. Sessions within treatment category are generally more correlated than outside of category (e.g., medication management sessions generally have similar topic loadings that are heavily driven by drug names, dosing schedules, etc.). However, correlations across psychodynamic and humanistic/experiential session were often moderate such that it is difficult to separate them from visual inspection of the plot. In addition, there are pockets of sessions that are correlated across categories. For example, the zoomed in portion of the heat map depicted in the lower right portion of Figure 2 highlights several psychodynamic and cognitive-behavioral sessions that had very similar topic loadings. Interestingly, several of these sessions had both CBT and brief relational therapy labels, suggesting that the model was sensitive to potential overlap in content that was identified by the human raters who created the database.
Figure 3 is an alternative visual representation that highlights the semantic similarities and differences across sessions, called a multidimensional scaling (MDS; Cox & Cox, 2000) plot. Using the same session-level topic scores from the correlation matrix above, MDS treats each session’s 200 values as a set of coordinates (in a 200 dimension, mathematical space). Thus, the topic model-based semantic scoring can be used to define distance values of each session from every other session within a 200 dimension semantic space. Somewhat similar to factor analysis, MDS finds an optimal, lower dimensional space that best represents the overall distance matrix; Figure 3 plots the results of the MDS. Each color-coded dot represents a single session. There was separation between treatment types such that treatment classes were broadly grouped together. However, there was variability within treatment approaches. For example, one set of CBT sessions (denoted in red) are notably different from other sessions. These are the structured MI sessions that all focus on drug or alcohol problems. Other CBT sessions are much more similar to other treatment approaches, and interestingly, appear to lie in between the highly structured medication management sessions and much less structured experiential sessions. In addition, we highlighted one medication management session that was distinct from the other medication management sessions, located much closer to experiential psychotherapy sessions. An inspection of this transcript revealed that there was no direct discussion of medications or dosage, potentially indicating a medication provider who focused on providing psychotherapy rather than checking medication dosage and side effects.
The previous results are exploratory visualizations demonstrating how semantic content from a topic model could distinguish categories of psychotherapy. Our final analysis examined how accurately the 200 topics could discriminate these four classes of psychotherapy sessions, using a type of multinomial logistic regression. We used a machine learning regression model called a random forest model using the 200 topics as predictors (Breiman, 2001). Random forest models are a type of ensemble learner, in which many regressions are fit simultaneously and then aggregated into a single, overall prediction model.2 The prediction accuracy of the model is tested using sessions that were not used during the training phase. This is a type of cross-validation in which the prediction accuracy of a model is tested on data points that were not included in the model creation. The overall, cross-validated classification error rate was 13.3%, showing strong predictive ability of the topic model-based predictors. As we saw in the earlier visualizations, the semantic information identified by the topic model is highly discriminative of the classes of psychotherapy. Table 4 shows the specific types of errors that the model makes (called a confusion matrix). The rows contain the true psychotherapy categories, and the columns have the model predictions. The counts along the main diagonal indicate correct classifications by the model and off-diagonal elements are errors. Not surprisingly, the model is most accurate at identifying medication management sessions but is also quite accurate with experiential psychotherapy. It is less accurate with CBT and Psychodynamic sessions, which are more likely to be confused as experiential psychotherapy. This makes clinical sense as the hallmarks of good experiential psychotherapy are reflective listening skills, which are common (though not as strongly emphasized) to CBT and Psychodynamic treatments.
We used a specific computational method, topic models, to explore the linguistic structure of psychotherapy. Without any user input, these models discovered sensible topics representing the issues that therapists and patients discuss, and facilitated a high level representation of the linguistic similarity of sessions wherein we could identify specific cases, potentially overlapping content across treatment approaches, as well as outlier sessions. By including human-generated session labels, topic models learned therapist statements associated with different treatment approaches and interventions, including therapist comments about therapeutic relationship, which are often considered among the more complex interventions in the therapist repertoire. Using only the words spoken by patients and therapists, the topic model classified treatment sessions with a high degree of accuracy.
While the present study represents – what we believe is – the largest comparative study of linguistic content from psychotherapy ever conducted, there are important limitations that we will discuss prior to highlighting potential implications. First, in terms of the data, the combined general psychotherapy and MI corpus is very heterogeneous along several dimensions (e.g., treatment approach, topics of discussion, etc.), but it is certainly not a random sample of general psychotherapy and they were not necessarily collected for research purposes. While the diversity of the corpus facilitates the examination of differences between approaches, the database is also highly unbalanced. There is an over-representation of select cases (over 200 sessions from 1 case), and relatively few sessions from many approaches. For example, CBT is relatively under-represented relative to its empirical standing in modern psychotherapy research, and much of the CBT are Motivational Interviewing sessions that may not be representative of other more modal CBT interventions (e.g., Prolonged Exposure, Cognitive Therapy for depression). As a result linguistic differences between treatments may be confounded with other differences in the selected sessions not related to approach (i.e., therapists, symptoms, idiosyncratic patient factors, etc.). The labeling of sessions was not done with standard adherence manuals, such that no estimates of reliability are possible. There is no symptom severity or diagnostic data beyond session level labels that indicate that depression was discussed in a session. There is no audio, which is clearly important to the evaluation of psychotherapy.
The model itself contains a number of important limitations. First, the topic model we used did not include information regarding the temporal ordering of words and talk turns. This is common to most topic models, which make a “bag of words” assumption that word order is not critical. For most prior applications (e.g., news articles and scientific abstracts), this may be a reasonable assumption, but for spoken language it is clearly quite tenuous. In addition, while the removal of specific words like pronouns reduces the complexity of the data, it is likely that these words are quite in important in psychotherapy and general human interactions (Williams-Baucom, Atkins, Sevier, Eldridge, & Christensen, 2010). The model was also restricted to text and did not have access to the acoustic aspects of these treatment interactions, which are also important (Imel et al., 2014). Future studies should incorporate the above features.
Transcription is a limitation of expanding this work. To use these methods researchers would be required to transcribe thousands of sessions from clinical trials. While this is an important practical limitation, we believe the primary reason that transcription remains uncommon is that the methods available to analyze transcript data in psychotherapy are labor intensive. In comparison to the cost of a clinical trial, the cost of basic transcription is minimal and could proceed in parallel to the clinical trial. Thus while transcription would add costs to clinical trials, the costs would be trivial compared to the potential long-term scientific impact of retaining the raw ingredients that were involved in the change process. It is also important to note that automated speech recognition (ASR) techniques continue to improve, and may someday completely eliminate the need for human transcription entirely.
The primary implications of the topic model and other associated machine learning approaches will be in, 1) targeted evaluation of questions in clinical trials that compare specific therapies, and 2) exploration of very large scale naturalistic datasets that capture variability in psychotherapy as actually practiced.
First, consider a recent large (n = 495) clinical trial comparing psychodynamic psychotherapy to CBT for social anxiety disorder (Leichsenring et al., 2013). Both treatments were better than wait-list. Between treatment comparisons were generally equivocal (e.g., CBT had somewhat larger remission rates, but response rates were not significantly different, no differences met clinically significant benchmarks set a priori). Differences between therapists (5-7% of variance in outcomes) were larger than treatment effects (1-3% of variance in outcomes). As is typical with large-scale psychotherapy clinical trials, there have already been published comments (Clark, 2013) and rejoinders (Leichsenring & Salzer, 2013) on possible explanations for the findings wherein Clark raised questions about the implementation of the CBT and Leichsenring reported that the competence of psychodynamic therapists may not have been ideal. In addition, (Leichsenring & Salzer, 2013) noted that CBT therapists used more dynamic interventions than dynamic therapists used CBT related interventions, raising questions about the internal validity of the trial. It is also possible that specific types of statements not specific to either intervention were responsible for between therapist differences in outcomes.
As with other large psychotherapy clinical trials (e.g., Elkin, 1989), the debate will likely continue. However, a fundamental problem remains. While all treatment sessions were recorded, comparisons of adherence and competence were based on a total of 50 sessions (Leichsenring & Salzer, 2013). As the mean number sessions for a patient was 25, and 416 patients received either CBT or psychodynamic treatment, the trial consisted of over 10,000 sessions (7 times more sessions than included in this paper). Analyses of what actually happened in this trial are driven by ½ of 1% of all available sessions. This sample size is typical and understandable given the labor intensiveness of behavioral coding. However, given the centrality of treatment mechanism questions to the field of psychotherapy, we look forward to more thorough analyses of process questions with computational methods. For example, researchers could conduct original human coding of subsets of sessions and use this data to train topic models that might examine a larger collection of sessions. This research may ultimately lead to more definitive answers regarding what actually happens during patient-therapist interactions and what specific therapist behaviors predict treatment outcomes within and across specific treatments.
Funding agencies may consider requiring archives of audio and transcripts for sessions in clinical trials such that they can be used in later research. While there are privacy concerns that would need to be addressed in such a procedure, there is simply no other way for researchers to adequately evaluate what happened in the treatment. While manuals exist, these prescriptive books are not sufficient to capture the complexity of what happens during the clinical encounter. To truly understand the mechanisms of psychotherapy we must begin to contend with the sheer complexity and volume of linguistic data that is created during our work.
More practically, topic models could be used as adjuncts to training and fidelity monitoring in clinical trials or naturalistic settings, automatically highlighting outlier sessions or noting particular therapist interventions that were inconsistent with the specified treatment approach. In naturalistic settings, topic models could be used as a quantitatively derived aid to the traditional qualitative, report based models of supervision. In combination with speech recognition, and selective human coding, one could imagine extremely large psychotherapy process studies (e.g., 100,000 sessions), that avoid confidentiality concerns by evaluating session content without requiring humans to listen directly to all sessions. Studies of this size could be positioned to discover specific processes that are involved in successful vs. non-successful cases.
We design treatments, package them in books and hope that trained providers implement them in a way that is faithful to the theory and makes sense for a given patient. This implementation often involves many hours of emotional, unstructured dialogue. Specifically, the patient-provider interaction contains much of the treatment’s active ingredients. The conversation is not simply a means of developing rapport and conducting an assessment to yield a diagnosis – it is the treatment. As a result, the questions of interest to psychotherapy researchers are complex and imbedded in extremely large speech corpora. Research questions may include understanding the unfolding of intricate psychoanalytic concepts over a large number of sessions, the cultivation of accurate empathy, or the competent use of cognitive restructuring to examine an accurately identified irrational thought. Moreover there is continued hope that a grand rapprochement may be possible wherein more general theories of psychotherapy process can replace and improve upon the traditional encampments that have characterized the scope of psychotherapy research for two generations.
Despite the fundamentally linguistic nature of these questions, most of the raw data in psychotherapy is never subjected to empirical scrutiny. The bulk of psychotherapy process research utilizes patient self-report or observer ratings of provider behavior. These methods have been available for decades and have yielded important insights about the nature of psychotherapy. However, existing methods are simply not sufficient to analyze data of this size and complexity, limiting both the nuance and scale of questions that psychotherapy researchers can address. There remains an almost lawful tension between the scope and the richness of our research. One can do a very large psychotherapy study, but the data will be restricted to utilization counts and self-report measures of treatment process and clinical outcomes. Alternatively, one can do detailed behavioral coding of sessions to evaluation therapist adherence, or qualitative work to extract themes, but the size of these studies is necessarily limited do to labor intensiveness of the work. Machine learning procedures such as the topic models used in the current study offer an opportunity to strike a balance between these poles, extracting complex information (e.g., discussions of the therapeutic relationship) on a large scale.
Most thinking about how technology will revolution psychotherapy focus on the digitization of treatment itself (i.e., computer based treatments, mobile apps, see Silverman, 2013). Many worry about how the ‘low tech’ field of psychotherapy will adjust to this world, while more optimistic commentaries expect the technological mediation of human interaction will simply provide more grist for the mill – albeit in a different form (Tao, 2014). However, we are poised for parallel technological revolution in psychotherapy where advanced computational methods like the machine learning approach described in this article may ultimately support, query, and expand the complex, messy beauty of a therapist and patient talking.
Funding for the preparation of this manuscript was provided by National Institute of Drug Abuse (NIDA) of the National Institutes of Health under award number R34/DA034860, the National Institute on Alcohol Abuse and Alcoholism (NIAAA) under award number R01/AA018673, and a special initiative grant from the College of Education, University of Utah. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or University of Utah. The authors would also like to thank Alexander Street Press for consultation related to the analysis of psychotherapy transcripts.
1The scores were also divided through by total number of words per session so that sessions with different lengths did not skew the results.
2For the present analyses, we created 2,000 new datasets, each with 1,318 sessions sampled with replacement from the original sessions. Next, on each of the 2,000 samples a classification and regression tree model is fit, but only using a subset of the total predictors. Thirty predictors were selected randomly within each bootstrap-generated dataset. This process results in 2,000 sets of regression results, which are then combined into an overall prediction equation.