Search tips
Search criteria 


Logo of biiLibertas AcademicaJournal Home
Biomed Inform Insights. 2012; 5(Suppl 1): 51–60.
Published online Jan 30, 2012. doi:  10.4137/BII.S8972
PMCID: PMC3409489
Emotion Detection in Suicide Notes using Maximum Entropy Classification
Richard Wicentowski1 and Matthew R. Sydes2
1Swarthmore College, Computer Science Department, Swarthmore, PA, USA
2Medical Research Council Clinical Trials Unit, London, UK
Corresponding author email: richardw/at/
An ensemble of supervised maximum entropy classifiers can accurately detect and identify sentiments expressed in suicide notes. Using lexical and syntactic features extracted from a training set of externally annotated suicide notes, we trained separate classifiers for each of fifteen pre-specified emotions. This formed part of the 2011 i2b2 NLP Shared Task, Track 2. The precision and recall of these classifiers related strongly with the number of occurrences of each emotion in the training data. Evaluating on previously unseen test data, our best system achieved an F1 score of 0.534.
Keywords: natural language processing, text analysis, emotion classification, suicide notes
The goal of this study was to identify emotional content expressed in suicide notes. This was undertaken as part of the 2011 i2b2 NLP Shared Task,1 Track 2. The organizers of the shared task created a fixed inventory of fifteen emotions (Table 1).
Table 1.
Table 1.
List of the annotated emotions in the dataset, as well as the frequency of occurrence of each of the emotions in the training set and the test set and the annotation guidelines. The log of the test/train ratio illustrates whether the training set represented (more ...)
An external team first collated the notes and subsequently anonymized them so that no personal or identifiable data remained; names and addresses were substituted with a small selection of alternatives. Subsequently, volunteers who had a strong emotional connection to someone who had committed suicide provided a sentence-level annotation of the suicide notes such that every sentence was either assigned one or more emotions or, more often, was left unlabeled. Sentences were labeled with a particular emotion only when a sufficient number of annotators agreed upon the annotation; hence, an unlabeled sentence does not necessarily indicate a complete agreement that these emotions were absent from the sentence nor does a labeled sentence necessarily indicate incontrovertible evidence that an emotion was present in the sentence.
Participants in the shared task were provided with 600 annotated notes as training data. The task organizers asked participants to optimize their systems according to F1 score: the harmonic mean of precision and recall.
Two of our systems attempted to achieve maximal F1 score, as instructed, by balancing precision and recall. Our third system attempted to achieve higher precision at the expense of lower recall. In the sections below, we describe our activities with the annotated training data before focusing on our performance with the test data.
In the task presented here, each sentence of a suicide note was classified as exhibiting zero or more emotions, annotated from a pre-defined set of 15 “emotions”. The emotions are shown in Table 1 and include two non-emotional labels: “information” and “instructions”. In addition, the label we refer to as “happiness” included both “happiness” and “peacefulness”. Many well-studied natural language processing (NLP) classification tasks (part-of-speech tagging, word-sense disambiguation, spelling correction, named entity detection, sentiment analysis and spam filtering, to name a few) also draw their labels from a pre-defined label inventory. However, the classifier in those tasks must assign exactly one label to each information unit whereas here we must provide zero or more labels. While this difference is worth noting, it does not preclude us from investigating off-the-shelf implementations of classifiers used for other NLP tasks. It simply requires recasting into the more familiar problem where the classifier assigns exactly one label to each sentence. We accomplish this by training 15 different classifiers, one for each emotion.
Each classifier performs a binary labeling for that emotion (estimating it as present or absent). We then transform the output of each classifier so that the labeling from classifier e is either the set {e}, if the emotion is present, or the null set, if the emotion is not present. Doing so allows us to form the final classification for each sentence by taking the union of the sets assigned by each individual classifier. If the union of these sets for a particular sentence is the null set, we say the sentence is unlabeled.
In Table 1, we show the number and percentage of sentences that were annotated with each label (“Labeled sentences”). Since each sentence could have been labeled with up to 15 labels, we may wish to consider the number of total labels assigned (“Labels assigned”). This value exceeds the number of labeled sentences since 14% of labeled sentences had multiple labels. There were 4,633 sentences in the training set, each of which could have received up to 15 different labels. Of the potential 69,495 labels that could have been assigned, 2,522 (3.63%) were assigned.
Maximum entropy classifier and features
For each emotion, we trained a maximum entropy classifier2 using only the training data supplied as part of the shared task, then we applied the trained classifier to the test data. Maximum entropy classifiers have been widely used in NLP classification tasks, for example in part-of-speech tagging3 and in named-entity recognition.4 In this work, we made use of the freely available Stanford Classifier.5 In addition to using words (unigrams) as features, we experimented with a wide variety of additional classifier options, preprocessing options, and feature types:
  • Sigma: The classifier uses sigma as a parameter to specify the strength of the prior. By default, this value is 1.0. Values less than 1.0 indicate a stronger prior. We experimented with values between 0.25 and 2.0. All results presented below use a sigma value of 0.5, which was set based on evidence from cross-validation.
  • Word shape: One of the features that can be automatically generated by the Stanford Classifier is word shape. This feature is used to conflate words that “look” the same. For example, all 4-letter words beginning with a capital letter could both be represented as “Xxxx” and all two-digit monetary amounts could be represented as “$dd”. Based on the recommendation found in the classifier’s documentation, we used the “chris4” shape algorithm in our experiments. Details of the “chris4” algorithm can be found in the source code for the classifier.
  • Contraction normalization: A large number of common contractions were present in the dataset with different spellings. For example, the word “can’t” was present as “can’t”, “cant”, “can’t”, “can’t” and “cann*t”. The asterisk was a common substitute for an apostrophe in many cases found in the training data (eg, “doesn*t”, “don*t”, “who*s”, etc). We developed rules to cover a range of possible misspellings for common contractions following the misspelled examples we found in the training data in order to accommodate misspellings which might be expected in the test data.
  • Spelling correction: For any word not found in a large, clean wordlist, our spelling correction algorithm chose the most frequently occurring alternative as ranked by the unigram counts derived from Google’s index.6 Alternatives were generated by applying single character insertions, deletions, substitutions and transpositions. We included “space” as a valid insertion character given the large number of tokens in the dataset that were the result of two words joined together without a space (eg, “thepapers”, “knowit”). We added special rules to handle a few common, and uncommon, errors (eg, “ys” → “ies”, “x” → “ct”) after examining the results of this algorithm on the training data.
  • Morphological features: We experimented with using both character n-grams and word-stemming algorithms to complement our feature set. Character n-grams included only prefix and suffix n-grams ranging from 2-grams to 6-grams. These character n-grams were used to capture simple morphological processes. We also used the NLTK7 implementation of the Snowball stemmer* to conflate morphologically related words.
  • Word bigrams and trigrams: In addition to unigram features, we also used word bigrams and word trigrams as features. More specifically, token bigrams and trigrams, as we did not distinguish between words and punctuation in creating these n-grams.
  • Dependency relations: Although the use of dependency relations is not new in this field,8 their use is far less common than the other features described above. We were motivated to include dependency relations based on our understanding of the guidelines provided to annotators (Table 1). For example, the guidelines for “forgiveness” state that the author of the note should be forgiving someone, not asking for forgiveness. Therefore, it is relevant to identify dependencies such as “nsubj(i, forgive)” as positive examples of the forgiveness label, whereas “dobj(forgive, me)” would be a negative example. We used the Stanford Parser9 to generate these dependencies, extracting “ collapsed dependencies”. Collapsed dependencies combine dependency relations so that the resulting relations only contain content words, omitting prepositions and conjunctions, for example. Note that when spelling correction and contraction normalization are used as features, the parser receives this corrected and normalized text as input; otherwise, the original notes are used.
  • Variable dependency relations: We also experimented with substituting individual words with variables in these dependency relations, expecting to conflate uncommon patterns in order to decrease data sparseness. For example, this method would conflate “dobj(love, him)”, “dobj(love, her)” and “dobj(love, Jane”) into “dobj(love, x)”. We did not perform any entity detection on the arguments in the dependency relations, so relations such as “dobj(love, life)” were also conflated into “dobj(love, x)”.
The Results section presents the performance of these features using five-fold cross-validation on the training data. In explaining the features used in this presentation, we use the key presented in Table 2.
Table 2.
Table 2.
Key to the features used in the classifiers. For example, the notation “W1–1/C0–0/L-/M-/D-/O-” indicates a feature set using only unigrams as features and “W1–2/C0–0/L+/M+/Dd/Oe” indicates (more ...)
  • Confidence tuning: In a binary classification setting, the Stanford Classifier yields results ranging from 0.0 (very strong likelihood sentence is unlabeled) to 1.0 (very strong likelihood sentence is labeled), using 0.5 as the dichotomous split point on whether or not to label a sentence. Using cross-validation, we were able to fine-tune this split point to optimize F1 (always by increasing recall at the expense of precision). In our first submitted system, we fine-tuned a single split point value for use with all emotion classifiers. In the other two systems, we individualized the split point for each emotion classifier.
  • Minimal annotation: Inspection of the training data revealed that there were very few notes that had no annotations at all, although most contained sentences that had no annotations. Therefore, we ensured that in each of our submitted systems, every note had at least one annotated sentence. If the classifiers yielded a note with no annotations, we found the sentence-emotion pair with the highest confidence and then labeled that sentence with the emotion.
  • Skipping hard-to-predict emotions: On cross-validation (see Results for details), our F1 performance was hurt by low precision and low recall on the emotions which occurred infrequently in the training data (see Discussion for comment). Therefore, in each of our submitted systems we did not include the results of the classifiers associated with the nine lower frequency emotions. For these systems, we estimated only the higher frequency emotions which had performed well in cross-validation experiments: “instructions”, “hopelessness”, “love”, “information”, “guilt” and “thankfulness”.
  • Emotionless classification: Utilizing the same feature set described above, we trained a binary classifier to identify sentences that were labeled with any emotion, regardless of the actual emotion with which they were annotated. We call this the emotionless classifier, since it differentiates between sentences “with emotion” and “emotionless” sentences. We combined the output of the emotionless classifier with each of the individual emotion classifiers described above. Given a sentence that was identified as having some emotion by the emotionless classifier, we slightly increased the confidence of each of the individual emotion classifiers for that sentence. We used cross-validation to fine tune the increase amount.
  • Memorize labels: The training data contained sentences that were identical to sentences in the test data. When using this heuristic, we copied the label from the identical training data sentence onto the matching test sentence.
  • Classifier combinations: We trained individual classifiers based on each of the 48 combinations of the features listed above. We then attempted to use logistic regression to identify small sets of combination-classifiers that were most orthogonal; we combined these classifiers to try to improve on the performance of the standalone classifiers. The combination was done by linearly combining the confidence values of each classifier on a sentence-by-sentence basis. Each classifier was equally weighted in this linear combination, so the final confidence value was always equal to the mean of the confidence scores of the individual classifiers. This method failed to achieve adequate results to be included in any of our submissions.
Systems used for submission
As part of the evaluation exercise, participants were allowed to submit the results from up to three classifiers, with the understanding that the single best classifier, as determined by F1 score, would be the system used to provide a ranking of all participants. Below is a description of the three systems that we developed and submitted:
  • System 1: Our first system, S1, used W1–2/C0–0/L+/M+/Dd, the feature set that performed best on cross-validation, with sigma set to 0.5, confidence tuned to 0.198, skipping hard-to-predict emotions.
  • System 2: Our second system, S2, used the same feature set as S1, with sigma set to 0.5 and skipping hard-to-predict emotions. But this system used split points tuned specifically for each emotion and ensured at least one annotation per note. Sentences that were labeled by an emotionless classifier (same features, sigma = 0.5) had the confidence of their labelings increased by 0.05 for all emotions. We then applied the Memorize labels heuristic.
  • System 3: Our final system, S3, was identical to S2 except that the split point was the untuned value of 0.5.
In this section, we present two sets of results. We detail the performance of each feature used by our classifiers using 5-fold cross-validation methods on the training data. Then, we present the performance of our three submitted systems on the test data and compare that to system performance on cross-validated training data.
Cross-validated results and feature selection
Confidence tuning has a pronounced effect on the results regardless of the feature set used; therefore, we begin by presenting results using our default 0.5 split point and the tuned split point found using cross-validation on two representative feature sets. Table 3 sets out the true and false positives (TP and FP) and false negatives (FN) for these models, together with the precision (P), recall (R) and F1 score. True negatives are not shown in any of the models as they do not impact the F1 score. Comparing the first and third rows, we can see how introducing bigrams, spelling correction, stemming and dependencies greatly reduced the false positives but at the cost of fewer true positives and more false negatives. The impact on F1 was still beneficial. The second and fourth rows show how fine-tuning the split point could further improve the functioning of these classifiers.
Table 3.
Table 3.
For every feature set, fine tuning a single split point for all 15 emotion classifiers yielded a decrease in precision, an increase in recall and a marked increase in F1 in the cross-validated training set. Two representative feature sets are illustrated (more ...)
Given the across-the-board improvement due to split point optimization, Table 4 presents the results of using each of the features individually (relative to using only using word unigrams) using only this optimization. Table 5 shows the performance of the single best classifier using cross-validation. In Table 6, we present the mean change in precision, recall and F1 observed when each feature was added to our classifier, averaging across all permutations of the other features. Since the overall F1 score is the mean across all possible other sets of features, one can view the values in Table 6 as an approximation of the additive amount of increase in F1 that can be obtained by adding each feature to the standard classifier. Table 7 presents the performance for each of our three submitted systems on both the training and test data.
Table 4.
Table 4.
Relative performance difference in the cross-validated training set of including a single new feature relative to the baseline system that used only unigrams, evaluated using the “Oa” optimization, sorted by F1. See Table 2 for an explanation (more ...)
Table 5.
Table 5.
Performance of the best single classifier, W1–2/C0–0/L+/M+/Dd, using 5-fold cross-validation on the training data, with a single optimized split point = 0.198 and sigma = 0.5, sorted by F1 compared with the performance of the same classifier (more ...)
Table 6.
Table 6.
Average relative performance difference on cross-validated training data between classifiers trained with and without the feature listed, sorted from least effective to most effective. Positive values indicate that, on average, the classifier improved (more ...)
Table 7.
Table 7.
Performance of the three classifiers used for official submissions. Sorted by F1 on the test data, results are presented for both using 5-fold cross-validation on the training data and on the test data. Note that the Memorize labels heuristic, used in (more ...)
All of the systems we submitted achieved results in line with the performance we obtained using cross-validation on the training set. Since, for testing, we could train on the full data set (instead of 80% as in the 5-fold cross-validation setup), we expected a small improvement in our F1 score on the test data; this was realized for our first (S1) and third (S3) systems, but not in our second (S2) system. One possibility is that the individualized optimization of split points on a per-emotion basis used in S2 led to overtraining.
We trained our classifiers to function well specifically in terms of the F1 score as this was the scoring goal of the i2b2 competition. The F1 score is the harmonic mean of precision and recall. In other settings, precision is known as sensitivity; it captures the proportion of annotated emotions that the classifier detects. Recall also uses the correct positive annotations, focusing on the proportion of positive estimates which were correct; this is known in many other settings, notably health research, as positive predictive value. In those same settings, precision (sensitivity) is supplemented, not by recall but by specificity which focuses on the non-annotation of a label. Here, specificity would be the proportion of sentences that should not be annotated with a given emotional label correctly not having that label applied. The use of F1 score as a summary in this setting completely ignores the true negatives (ie, those that were correctly not labeled with a given emotion). We believe that these true negatives were actually important. With this exercise we were effectively running a series of yes/no annotation exercises on the same dataset. The clear majority of sentences were not labelled with each given emotion: annotated sentences for each given emotion were quite rare in both the training and test datasets. The correct answer was to not label with each particular emotion in most instances. It is interesting to note that the specificity was extremely high (and, therefore, very encouraging) for all of the classifiers that we developed.
The annotated datasets were the gold standard in these exercises and reflected the time and effort of many people. Each sentence was annotated by at least three annotators, assigning annotations to sentences only when two or more annotators agreed. However, there were many instances in both the training and test dataset where we found that emotions were inappropriately applied or omitted by the annotators, as detected by our classifiers. For example, the guidelines state that a sentence should be annotated with “forgiveness” if the author is forgiving someone, not if the author is asking for forgiveness. Yet, the sentence “Forgive me for this rash act but I alone did it.” was wrongly annotated with “forgiveness” in the training set by the annotators.
In addition, there were instances where exactly same sentence did not attract the same emotion from the annotators. In the training data, some sentences appeared multiple times across notes. For example, the sentence “I love you.” appeared in 7 training sentences: 5 times annotated with “love” and 2 left unannotated. Our Memorize Labels heuristic relied on the fact that sentences appearing multiple times across the test and training data would be labeled consistently. However, there were at least 15 sentences in the test data that appeared in the training data with different annotations. Although the particular context of the sentence could affect the labeling, there were two pairs of notes appearing in both the test and training data that were nearly identical. These pairs of notes were the result of a single author writing two notes to different people. One note was in the training data, one in the test data. Even here we find inconsistencies. In one pair, 17 annotations were made to the note found in the training data, yet only 7 annotations were made to the note in the test data. For example, “I want to wear my red and black dress [at my funeral]” was annotated as “instructions” in the training data and was left unannotated in the test data.
The suicide notes provided in the data set were transcriptions of hand-written notes. These notes contained many spelling errors and tokenization inconsistencies. It was unclear where these errors originate, but we suspect some are genuine errors from the author and others were transcription errors in preparing the data sets eg, ”3333 Burnet Ave” is sometimes ”3333 Burent Ave”; other such errors could have been introduced. Our spelling correction algorithm fixed minor errors (eg, “sufering” → “suffering”; “attemp” → “attempt”; “beond” → “beyond”) but failed to correct more complex errors that involved more than a single substitution, transposition, insertion or deletion. For example, “capsuls” was amended by our system as “capsule” instead of “capsules”. Additionally, many spelling errors involved words (often spelled somewhat phonetically) that could not be corrected at all by this simple method: “hemorige” (“hemorrhage”), “disponded” (“despondent), “rearenge” (“rearrange”). Some mispelled words were “corrected” erroneously. There were also instances where the algorithm corrected words that weren’t incorrect. For example, changing the abbreviations “appt” (appointment) → “apt” and “tel” (telephone) → “tell”. Some of these errors may have been addressed more accurately by using an n-gram language model to estimate the best possible correction.10 For example, the phrase “get bettery charged” should have been corrected as “get battery charged” but was actually changed to “get better charged”.
Introducing dependency relations (Dd) into the model provided a large boost to the overall system performance: the second largest increase in precision, the second largest increase in recall, and the second largest increase in overall F1. The variable dependencies feature (Dv) conflated dependencies such as “dobj(blame, John)” and “dobj(blame, Mary)” into “dobj(blame, x)”. We expected that this could help with data sparsity issues and we demonstrated large gains in recall when using this feature. Unfortunately, this also introduced a large drop in precision. These variable dependencies introduced much noise, possibly because we were not differentiating between the arguments of the dependencies we were conflating. For example, the annotation guidelines for the blame label state that the author of the note should have been blaming someone. However, conflating “dobj(blame, money)” and “dobj(blame, weight)” with “dobj(blame, Mary)” is unhelpful given these guidelines. Had we used entity detection to determine that both Mary and John were people, we could have constructed “dobj(blame, PERSON)”, separating those examples from “dobj(blame, THING)” and potentially improving on our performance.
As one might expect, each classifier did particularly poorly on the emotions that occurred infrequently in the training data. Indeed, the performance of the classifiers for these emotions was so poor that we had better results simply ignoring these emotions rather than include them in our final labeling. Improving our performance on these emotions should be the focus of the continuing development of this work; we suspect that additional training data would have aided.
Our attempts to use classifier combinations were only partially successful. We have demonstrated that introducing the emotionless classifier to boost the confidence of our labelings provided a large increase in recall; however, this yielded a large decrease in precision, with the F1 score remaining largely unchanged. We also explored using logistic regression to select a panel of orthogonal classifiers with combinations of features that might better balance precision and recall. The efforts to select small panels from 48 combinations of features using regression models were not feasible in the time available to us but may warrant further investigation. An exhaustive evaluation of all pairs and triples of classifier combinations found that no combination of two or three classifiers outperformed the best standalone classifier.
In developing our classifiers, we tried to consider the practical applications of the findings from this exercise.11 The loss of any life is sad, and the early termination of one’s own life particularly so. There is no doubt in our minds that the sentiments expressed in the suicide notes must have been present prior to the actual time of suicide. We consider that there may have been previous efforts to express these emotions to other people. We anticipate that one might consider employing an automatic detection algorithm on social networking platforms. This could review posts and activate access to support networks. However, only systems with very high precision would be of any practical value: high precision is more important than high recall because we would not wish to propose interventions unless we were extremely confident in our predictions.
Towards that goal, our final system, S3, attempted to achieve high precision at the expense of recall, while minimizing the impact on F1 score. Potentially, we could have looked to further improve precision with detrimental effects on recall and F1 score, though it is encouraging that we achieved high precision while maintaining an F1 score similar to the mean of all systems submitted to this shared task. Further research might be required to consider whether the sentiments expressed on suicide notes are truly expressed previously. Further information on the age, gender and physical and psychiatric health of these people may be of value.
Previous work in suicide note authorship detection used structural and grammatical features such as the number of paragraphs in the note, the number of misspellings, and the depth of parse tree.12 Our intuition was that these features would not have been useful here, though we corrected spelling errors and used dependency relations from a parser. Structural and grammatical features should be investigated further.
We have shown that it is possible to construct classifiers that can perform well on this task and we have shown that dependency relations can be used effectively as features in these classifiers. We presented a classifier that performs at high levels of precision in order to be useful as a mechanism to propose intervention. We are confident that further improvements could be achieved using off-the-shelf components as done here.
We acknowledge the efforts of the i2b2 organizers and all of the brave efforts of the volunteers who kindly annotated the extensive and sensitive dataset.
Author(s) have provided signed confirmations to the publisher of their compliance with all applicable legal and ethical obligations in respect to declaration of conflicts of interest, funding, authorship and contributorship, and compliance with ethical requirements in respect to treatment of human and animal test subjects. If this article contains identifiable human subject(s) author(s) were required to supply signed patient consent prior to publication. Author(s) have confirmed that the published article is unique and not under consideration nor published by any other publication and that they have consent to reproduce any copyrighted material. The peer reviewers declared no conflicts of interest.
See the Dependencies Manual accompanying the Stanford Parser for more details.
1. Pestian JP, Matykiewicz P, Linn-Gust M, et al. Sentiment analysis of suicide notes: A shared task. Biomedical Informatics Insights. 2012;5(Suppl. 1):3–16. [PMC free article] [PubMed]
2. Nigam K, Lafferty J, McCallum A. Using maximum entropy for text classification. IJCAI-99 Workshop on Machine Learning for Information Filtering. 1999:61–7.
3. Ratnaparkhi A. A maximum entropy model for part-of-speech tagging. Proceedings of the First Conference on Empirical Methods in Natural Language Processing; 1996.
4. Chieu HL, Ng HT. Named entity recognition with a maximum entropy approach. Proceedings of CoNLL-2003; 2003. pp. 160–3.
5. Manning CD, Klein D. Optimization, maxent models, and conditional estimation without magic. HLT-NAACL Tutorial. 2003.
6. Brants T, Franz A. Web 1T 5-gram, ver. 1. LDC2006T13, Linguistic Data Consortium. Philadelphia: 2006.
7. Bird S, Loper E, Klein E. O’Reilly Media Inc. Natural Language Processing with Python. 2009
8. Nastase V. Unsupervised all-words word sense disambiguation with grammatical dependenices. IJCNLP. 2008:757–62.
9. de Marneffe MC, MacCartney B, Manning CD. Generating typed dependency parses from phrase structure parses. LREC. 2006 2006.
10. Brill E, Moore RC. An improved error model for noisy channel spelling correction. Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics; 2000. pp. 286–93.
11. Foster T. Suicide note themes and suicide prevention. International journal of psychiatry in medicine. 2003;33(4):323–31. [PubMed]
12. Pestian J, Nasrallah H, Matykiewicz P, Bennett A, Leenaars A. Suicide note classification using natural language processing: A content analysis. Biomedical Informatics Insights. 2010;2010(3):19–28. [PMC free article] [PubMed]
Articles from Biomedical Informatics Insights are provided here courtesy of
Libertas Academica