|Home | About | Journals | Submit | Contact Us | Français|
Suicide is the second leading cause of death among 25–34 year olds and the third leading cause of death among 15–25 year olds in the United States. In the Emergency Department, where suicidal patients often present, estimating the risk of repeated attempts is generally left to clinical judgment. This paper presents our second attempt to determine the role of computational algorithms in understanding a suicidal patient’s thoughts, as represented by suicide notes. We focus on developing methods of natural language processing that distinguish between genuine and elicited suicide notes. We hypothesize that machine learning algorithms can categorize suicide notes as well as mental health professionals and psychiatric physician trainees do. The data used are comprised of suicide notes from 33 suicide completers and matched to 33 elicited notes from healthy control group members. Eleven mental health professionals and 31 psychiatric trainees were asked to decide if a note was genuine or elicited. Their decisions were compared to nine different machine-learning algorithms. The results indicate that trainees accurately classified notes 49% of the time, mental health professionals accurately classified notes 63% of the time, and the best machine learning algorithm accurately classified the notes 78% of the time. This is an important step in developing an evidence-based predictor of repeated suicide attempts because it shows that natural language processing can aid in distinguishing between classes of suicidal notes.
It is estimated that each year 800,000 die by suicide worldwide.1 In the United States, suicide ranks second as the leading cause of death among 25–34-year olds and the third leading cause of death among 15–25-year olds.1 The challenge in a clinical setting is to predict the likelihood of a serious repeated attempt.2 This challenge is exacerbated by the heterogeneity of patients and clinical judgment. Two evidence-based, risk assessment tools that have shown conceptual success, but we know of none has been translated into standard medical practice.3,4 The long-term goal of our research is to develop and implement an evidence-based tool for measuring the likelihood of repeated suicide attempts.
To gain insight into the suicidal frame of mind, researchers have suggested analyzing national mortality statistics, psychological autopsies, nonfatal suicide attempts and documents such as suicide notes.5 Early research6,7 on suicide notes usually used an anecdotal approach incorporating descriptive information.8 Subsequent methods, based on Frederick’s analytical approach have used content, classification, and theoretical-conceptual analysis. Content classification extracts explicit in formation from a suicide note, e.g. length of the message, words, and parts of speech. On the other hand, classification schemes use data such as age, sex, marital status, educational level, employment status and mental disorder.9,10,11–13 It has been suggested that simple classification analysis has its limitations,14 but comparison of note-writers with non-note-writers has consistently found no differences.15
Only a very few studies have used Theoretical-Conceptual Analysis8 despite the assertion in the first formal study of suicide notes that such an analysis has much promise.5 To address this paucity, Leenaars introduced a method that permits a theoretical analysis of suicide notes, increases the effectiveness of controls, and fosters development of some theoretical insights into problem of suicide.13,16–18 He developed a cross-cultural model that consists of intrapsychic and interpersonal cluster themes. The intrapsychic cluster includes unbearable psychological pain (UP); cognitive constriction (CC) indirect expressions (IE), e.g. ambivalence, unconscious processes; inability to adjust (IA), or psychopathology interpersonal grouping that include: disturbed interpersonal relations (IR), rejection-aggression (RA); and identification-egression (IEG) or escape.19 Subsequent research on suicide notes have supported the utility of such research indicating that both content and psychological processes are critical to prediction.20,21
Using computational methods to study suicide notes is not new,14 but applying advanced algorithms to clinical care of suicidal patients is. Recent computer analysis compared structural characteristics (average sentence length, parts of speech) with content variables (length of communication, instructions, active state, explanation provided, locus of control) in their predictive values.22 Another approach focused on semantic content of words used in suicide notes by grouping words into linguistic variables (e.g. positive, negative emotions, hearing, references to people, time, religion).23
Content, classification and theoretical-conceptual analyzes have discovered many features that can be used to assess suicide risk. Yet few features are consistent among research protocols. Most of them overlap in meaning, and some are contradictory. These earlier studies, however, were unable to take advantage of current machine learning methods for feature extraction. Our preliminary studies show that it is possible to create machine-learning models that mix content and theoretical-conceptual features and classify suicide notes with higher accuracy than mental health professionals.24 In this study we research if this trend is consistent when psychiatric physician trainees participate.
This section describe the study’s methods. It has the following components: experimental design, data, feature selection, expert classification, word mending, annotation, and machine learning.
This is a cross-sectional design to test the hypothesis that machine learning algorithms can classify suicide notes as well as or better than practicing mental health professionals and psychiatry physician trainees. This study is approved by our Institutional Review Board (#2009–0664).
Completer and elicited notes were transcribed from Clues to Suicide.5 The transcribed data were then reviewed for errors and omissions. Sixty-six notes were divided into two groups: 33 completers and 33 elicitors. To create an elicitor note, Shneidman, asked individuals to write a note as if they were going to commit suicide. The groups were matched by gender (male), race (white), religion (Protestant), nationality (United States citizens); ages ranged between ages 25 and 59 years. Anyone suspected of the having a personality disorder or a tendency toward morbid thoughts was asked to write about the happiest day of their life. These notes were then discarded and the individual was not enrolled in the study.
Each completor note was paired with its elicited counterpart. The paired notes were then randomly ordered and presented to 11 mental health professionals (psychiatrists, emergency room physicians with mental health training, and psychiatric social workers) and 30 psychiatry trainees who had between one and four years of post-MD training and one psychiatric fellow (five years of training) who were asked to classify the notes as either genuine or and elicited notes. There results were compared to the learning models described below.
Feature selection, also called variable selection is a data reduction technique for selecting the most relevant features of a learning models. As irrelevant and redundant features are removed the model’s accuracy increases. Multiple methods for feature selection were tested: bag-of-words, latent semantic analysis and heterogeneous selection. Ultimately, heterogeneous selection was used. To reduce co-linearity, highly correlated features were removed. To increase the certainty that a feature was not randomly selected, that feature had to appear in at least 10% of the documents.25 Finally, after preparing each bootstrap sample, only 66 features with highest information gain were selected. Information gain can intuitively be interpreted as measuring the reduction of uncertainty.26,27 Table 1 shows the feature selection and reduction processes. An initial feature space of 1,063 variables was reduced to 66. Thus, the final matrix contains 66 documents and 66 features.
Tokenization is the first step in Natural Language Processing (NLP) analysis. It identifies those basic units which need not be decomposed in subsequent analysis and prepares for analysis like word checking, ambiguity checking and disambiguation.2 This was done using an internally developed Perl program. Next, using the Penn-Treebank tag set. Using The Lingua-EN-Tagger-0.13, 2004 module, 18 part of speech tags were added to the feature space. This tagging is necessary to establish the relationship of a particular word to a particular concept
The Flesch and Kincaid readability scores produced a high information gain and were included in the feature space. These scores are designed to indicate comprehension difficulty. They include an ease of reading and text-grade level calculation.28,29 Computation of the Flesch and Kincaid indexes was completed by adding the Lingua::EN::Fathom module to our Perl program.
Each suicide note was annotated with emotional concepts. Developing an ontology to organize these concepts required both the Pubmed queries and expert literature reviews. Using the Pubmed queries, a frequency analysis of the keywords in 2,000 suicide related manuscripts was conducted. Expert review of those keywords yielded 166 suicide related manuscripts that contained suicide emotional concepts. These emotional concepts were allocated to 19 different classes. Three mental health professionals then reviewed each of the 66 notes and assigned the emotional concepts found in those notes to the appropriate classes. For example, the emotional concept of guilt was assigned to the class of emotional states.
There are multiple general types of machine learning: unsupervised, semi-supervised and supervised. Semi-supervised methods use both labeled and unlabeled data and is efficient when labeling data is expensive, which leads to small data sets. For this research the semi-supervised approach was selected mainly because the labeled data are small. Using the Waikato Environment for Knowledge Analysis (Weka) collections of data mining algorithms, we compared a number of machine learning methods.30 Those germane to this research were organized into five categories:
Some features have been used in previous studies.22,23,31 This study extends the previous work by creating a heterogeneous, multidimensional feature space. To do so, the following algorithms were used to extract and quantify the relevant content features:
Features came from different sources which led to their numeric values being in different ranges. To remedy this, feature values were normalized to a maximum value of one. In the end, a matrix with 66 documents and 49 features whose values ranged 0 and 1 was created. Since there are fewer features than documents additional features selection was not applicable.
A decision tree classifier is represented as a tree. Every node of the tree is represented by a list of possible decisions. The decision about the next branch is based on a single feature response. Leaves of the tree are represented by the decisions about which class should be assigned to a single document. For decision tree analysis the following algorithms:
The classifier algorithm is represented by a set of logical implications. If a condition for a document is true, then a class is assigned. Conditions are composed of a set of feature responses to OR-ed or AND-ed together. These rules can also be viewed as a simplified representation of a decision tree. For classification analysis the following algorithms are used:
Classifiers can be written down as mathematical equations. Decision trees and rules cannot. There are two classifiers in this category. For functional models the following algorithms are used:
Classifiers in this category do no real work until classification time. It is done by reviewing every instance in the training set separately. Only one algorithm is used in this category:
Classifiers use Bayes theorem and the assumption of independence of features. Only one algorithm is used in this category:
Bootstrapping is used to estimate classifier performance.35 Bootstrapping has been shown to provide stable estimates.36 It is the practice of estimating properties of a classifier by measuring them when sampling from an approximating distribution. The advantage of bootstrapping over other analytical methods is its simplicity. Derivation of standard error estimates and confidence intervals for complex estimators of complex parameters is straightforward. The disadvantage of bootstrapping is that while (under some conditions) it is asymptotically consistent, it does not provide general finite sample guarantees and it has a tendency to be overly optimistic. The apparent simplicity may conceal the fact that important assumptions are being made when undertaking the bootstrap analysis, e.g. independence of samples whereas these would be more formally stated in other approaches.37
In the case of mental health professionals and psychiatry trainees ratings the result is a simple weighted average and so bootstrapping is not necessary. In the case of machine-categorization estimation, a random sample of 66 documents with replacements is drawn to create a training set. After that the model is tested against all 66 documents. The 632+ method is used for bias correction.35 Each bootstrap estimate had 100 samples redrawn. The procedure was repeated 25 times to calculate the stability of the estimate.36 Thus, there was total 25 × 100 samples with replacement drawn.
Previous computer analysis of suicide notes have used t-test, chi-square or ANOVA statistics to show the best features that can discriminate between two categories. In our case, the Kolmogorov-Smirnov-test showed that not all features are normally distributed. Hence, a Wilcoxon-test is used to calculate the difference in distribution shift of a feature in elicited and in genuine notes.
Overall features most often selected for the machine algorithms included:
Words: am, and, are, Betty, but, could, did, do, everything, for, good, goodbye, had, have, he, her, I, in, is, it, Jones, leave, life, longer, love, Mary, more, mother, my, n’t, now, Smith, so, that, the, things, this, to, Tom, was, with, and you. While it is reasonable to suggest that the anonymized proper nouns like Jones, Mary, Tom and Smith should not be part of the feature space, they were included because they can act as proxies for individual names;
Part of speech tagging:Cardinal number (CD), Determiner (DET), preposition or subordinating conjunction (IN), Adjective (JJ), Adjective superlative (JJS), Modal (MD), Noun, singular or mass (NN), Proper noun plural (NNP), Noun plural (NNS), Personal pronoun (PP), Prepositional phrase (PP), Personal pronoun (PRP), Possessive pronoun (PRPS), Adverb (RB), Verb, base form (VB), Verb, past participle (VBN), Verb, non-3rd person singular present (VBP), and Verb, 3rd person singular present (VBZ);
Readability: the Flesch Reading Ease score is a 100-point scale, with higher scores easier to read. The Flesch-Kincaid Grade Level is a number that corresponds with grade level; and
Emotions: giving things away, hopeless, regret and sorrow.
The human raters relied on the ontology shown in Figure 1: Suicide Ontology. This ontology is much more extensive than the four emotions aliquoted by the information gain function.
Feature selection and data reduction are listed in Table 1: Feature Selection Process and Results. Information gain was calculated only for the training data in each bootstrapped sample. From the initial 1063 possible features 66 were ultimately selected based on information gain and frequency. They included words, parts of speech, concepts and r eading scores.
Table 2: Genuine and Elicited Notes Descriptive Statistics provides mean and standard deviation of a number of note characteristics. It shows fifteen features with the smallest p-values in two sample Wilcoxon-tests. Hypothesis testing as a feature selection is only one of many methods and may not always describe the data accurately.38 Some machine learning algorithms, like LMT, have feature selection embedded. It is worth looking if different feature selection algorithms give same results. This is possible only for some simple problems.
Table 3: Human & Machine Raters in after 25 × bootstraps shows that mental health providers perform better than psychiatry trainees, but not as well as the best machine learning algorithms. For the psychiatry trainees, their overall categorization was roughly equal to the flip of a coin. Mental health providers were significantly better than trainees. They accurately classified notes about 63% of time.
Table 3 compares the machine classification algorithms with psychiatry trainees and mental health providers. On average, the best machine algorithm (Logistic Model Trees) performed significantly better than the mental health providers. All the algorithms did significantly better than the psychiatry trainees. Nine of ten machine algorithms performed significantly better than mental health providers.
Performance of different machine learning algorithms is complementary. J48/PART (0.640/0.645) suggests that a tree representation is mediocre for the data. On the other hand Linear SMO (SVM)/LMT (0.705/0.744) suggests that there is some linear separability of the two categories. In addition, a logistic regression outperformed linear support vector machines, i.e. LMT (0.744).
Table 4: Logistic Model Tree. When all features and all suicide notes are used for training and LMT is trained on the entire data set, there was only one leaf with two linear functions that categorize suicide notes. There are only three features shared by the Wilcoxon-test and LMT (maximal frequency of a word, Flesch-Kincaid grade level, cardinal number frequency). Two of the equations in Table 4 misclassified only four documents. Features selected by LMT describe sentences (number of words, depth of the parsed tree), whereas hypotheses testing selected features that describe different aspects of the notes.
Table 4 can be difficult to read and so we offer the following example as an explination. Figure 2: Hyperspace Definition shows a three dimensional cube, or a hyperspace with three features. Axis z represent the Flesch-Kinkaid reading score, Axis y represents the MLS method, and axis x represents the MDS method. In this case, the difference each methods computes creates a hyperplane. This hyperplane is shown in the center of the defined hypercube. Those features above the hyperplane are labeled with a “+”. In our case this represents genuine notes. Those features below the hyperspace are labeled “−”. In our case, this represents elicited notes.
The purpose of this research was to understand how well different machine learning algorithms performed compared to humans who were asked to distinguish between elicited and genuine suicide notes. We confirmed that, at least in part, machine algorithms could do as well as humans. We speculate here on the possible reasons.
One possible explanation can be found in psychological phenomenology. Psychological phenomenology focuses on the experience of the subject. True, the term experience is a complex concept, but in principal an experience is not directly observable by an external observer:39 a mental health providers cannot truly observe the internal pain of a suicidal patient. What then gives insight into how the mental health, psychiatry trainees and machine algorithms experience the act of classifying genuine and elicited suicide notes? We propose that it is the features. Figure 2 show what features the humans used for classification. There are four classes and 40 emotional concepts. The machine algorithms include four emotional concepts, 42 specific words (none emotional), and readability scores. Considering these selections it is reasonable that the human raters focused on content, while the machine algorithms focused on structure.
The results of this research has a number of potential applications. One potential application is that using machine algorithms to discriminating between genuine and elicited suicidal notes has important clinical and forensic implications, especially as it relates to advanced decision support. The findings also suggest that algorithms such as the one used in this study may have applications for the prospective clinical assessment of psychiatric patients suffering not only from suicidal ideation or intent, but also homicidal impulses which are vital to predict. Finally, this study can have relevant applications for distinguishing malingerers who feign psychiatric illness for ulterior motives.
Finally, addressing one item would enhance the strength of the study’s generalizability; that is, the sample size. We understand that 66 notes is low. To our knowledge, however, this is the only data set that lends itself to this type of research.
We acknowledge E.S. Shneidman for access to the data and for comment and guidance. We acknowledge the divisions of Biomedical Informatics, Emergency Medicine and Psychiatry at Cincinnati Children’s Hospital Medical Center, University of Cincinnati and the Ohio Third Frontier program for their generous support of this work. This work is funded from the Ohio Third Frontier program.
This is an open access article. Unrestricted non-commercial use is permitted provided the original work is properly cited.
This manuscript has been read and approved by all authors. This paper is unique and is not under consideration by any other publication and has not been published elsewhere. The authors and peer reviewers of this paper report no conflicts of interest. The authors confirm that they have permission to reproduce any copyrighted material.