|Home | About | Journals | Submit | Contact Us | Français|
We describe and evaluate an automated approach used as part of the i2b2 2011 challenge to identify and categorise statements in suicide notes into one of 15 topics, including Love, Guilt, Thankfulness, Hopelessness and Instructions. The approach combines a set of lexico-syntactic rules with a set of models derived by machine learning from a training dataset. The machine learning models rely on named entities, lexical, lexico-semantic and presentation features, as well as the rules that are applicable to a given statement. On a testing set of 300 suicide notes, the approach showed the overall best micro F-measure of up to 53.36%. The best precision achieved was 67.17% when only rules are used, whereas best recall of 50.57% was with integrated rules and machine learning. While some topics (eg, Sorrow, Anger, Blame) prove challenging, the performance for relatively frequent (eg, Love) and well-scoped categories (eg, Thankfulness) was comparatively higher (precision between 68% and 79%), suggesting that automated text mining approaches can be effective in topic categorisation of suicide notes.
Automated processing and categorisation of subjective and affective statements (eg, in blogs, tweets, suicide notes) have been both a challenging and hot topic in the humanities and text mining communities in the last decade, in particular with the development of Web 2.0 technologies. Several methods have been developed to automatically identify main messages, sentiments and opinions presented in such environments, across different domains and communities.1–4
The need for quantitative and computational processing of suicide notes in particular has been highlighted as a way to identify any risks of (repeat) attempts as presented in Web 2.0 sources.5,6 The aim of the i2b2 Medical NLP challenge 2011 (Track II) was to classify statements in de-identified suicide notes into 15 different topics (see the challenge description7 for detailed description of the task). These topics included eleven emotions (Hopelessness, Love, Guilt, Thankfulness, Anger, Sorrow, Hopefulness, Happiness Peacefulness, Fear, Pride, Forgiveness) and four other categories such as Instructions, Information, Blame and Abuse. The task was to categorise each line (roughly corresponding to a sentence) into one or more of these topics, or to tag them as referring to none of these topics.
Previous work on computational analysis of suicide notes has been concerned with content analysis (eg, distribution of positive, negative and emotion words).6,8,9 Various discriminative features (eg, emotional concepts, part-of-speech tags (POS), readability scores, etc.) have been considered.6,8 A comparative study between a set of automatic classification algorithms and human counterparts to distinguish genuine suicide notes from simulated ones showed promising results, as nine out of ten machine classification algorithms outperformed the human counterparts (a team of 11 mental health professionals) in distinguishing genuine notes from elicited ones.8
Our approach to the task was a hybrid method that integrates rule-based and machine learning (ML) predictions into a topic-categorisation module. Rule-based predictions combined lexical and syntactic patterns with common expressions empirically associated with a given category. The machine learning module consists of a set of classifiers built using a set of features to provide sentence-level predictions. The two prediction modules were combined using three different approaches, which corresponded to the three runs submitted to the challenge. Our best run combined predictions based on rules and all ML scores, resulting in an (micro) F-measure of 53.36%, with 50.47% recall and 56.61% precision. The following sections describe the method in more detail and provide discussions of the results.
An analysis of a set of 300 suicide notes that had been provided by the organisers of the i2b2 2011 challenge7 as the training data revealed that most lines consisted of single sentences, but that there were cases where several sentences appeared in the same line. We further noted that lines that had several topic categories attached to them were likely to have either multi-focal sentences (ie, sentences that contain several statements, either about related or unrelated issues) or indeed several separate sentences. Therefore, the general idea underlying our approach was to determine topic categories for each of the sentences in a note and then integrate sentence-level predictions at the line level.
The system developed for the topic-categorisation consists of four major modules: (A) pre-processing, (B) rule-based predictions, (C) ML-driven predictions and (D) result integration module. Figure 1 provides a detailed system architecture diagram.
In Run 1, the goal was to optimise recall. For a given line, we therefore collected all topic categories returned by all the rules that fired for any of its sentences and all the predictions returned by applying the ML models to them. Based on empirical evidence from the training data, we decided not to use the ML model built for the Information category, given its relatively poor performance (ie, no improvements) compared to other “large” categories (see Table 4). Similarly, we decided not to use the rule-based features in the ML models, again based on the results of experimentation on the training data (the best recall was achieved when the rule-based features were omitted; data not shown).
In Run 2, we considered only predictions returned by the rule-based module. The goal for this run was to optimise precision. Final predictions for a given line included all categories returned by the rules from each sentence in that line.
In Run 3 we used all the initial results from the rule-base module (as in Run 2) and a single best prediction from the four ML models (again, the Information category was omitted). In this run, the ML models included the rule-based features (as opposed to Run 1). Overall, we aimed at optimising the F-measure with this run. Predicted labels of the ML models were ranked by prediction confidence as provided by RapidMiner, and the label with the highest confidence was used. These values where comparable since all of the ML models were based on the same approach (Naive Bayes).
The task was evaluated on a test dataset containing another set of 300 suicide notes as provided by the organisers. The “gold” annotation topic categories were manually provided for each line by three annotators. The organisers estimated the inter-annotation agreement as 0.546 (using Krippendorff’s alpha coefficient).7
The system performance was primarily estimated using the micro-averaged F-measure, which averages the results across all annotations (line-level). The test results of our system are given in Table 2. Run 1 gave the best results, with the highest F-measure (53.36%) and the highest recall (50.47%). As expected, the best precision (67.17%) was achieved in Run 2, with all predictions coming from the rule-based module. Run 3 was an attempt to compromise between the first two runs, as reflected by the results (but it failed to get the best F-score).
Category-specific results are given in Table 3. We note that the “large” categories (such as Instructions, Hopelessness, Love and Guilt) have reasonably high and comparable performance, with Love consistently showing the best results (F-measure of 67.34%). The exception is the Information category (F-measure of 29.73%), probably due to the very broad scope of this topic. The results for mid- and low-frequency categories relied on rules only, and typically showed poor performance, with notable exceptions of Thankfulness (F-measure of 72.53%) and Happiness_ peacefulness (F-measure of 53.85%). Still, the rules (Run 2) overall provided relatively high precision (67.17%).
Run 3 attempted to optimise F-measure, but the drop in recall was significant probably due to (1) excluding the less confident predictions from the ML models, and (2) using the ML models with rule-based features, which proved to increase precision but have the reverse effect on recall (data not shown). Table 3 also shows the macro-averaged results (averaged over topic categories), which were significantly lower than the micro-averaged ones, given that there were categories (eg, Sorrow and Abuse) with no correct predictions.
When compared to the results on the training dataset (see Table 4), there are drops in the overall micro F-measure of between 6.38 and 7.86 percentage points. There were differences in the performance drops for specific categories: while Love performed mostly consistently (drop of 3%–5%), performance for the Information category dropped between 14 and 19 percentage points, indicating again the wider scope of this category that has not been captured by rules or ML approaches (likely due to lexical variability and limitations of our topic dictionaries). There were also significant drops in performance for Guilt, in particular in the runs that included ML-based predictions, indicating again that the models have not generalised well (see Table 5 for FP and FN examples).
While the rules (Run 2) did not fail for some of the “large” categories (Hopelessness, Love and Guilt), there were significant drops for Instructions (a large category) and Information (a wide scope) when compared to the training data. As expected, the rules developed for the mid- and low-frequency categories in principle did not show consistent performance. Notable exceptions are Thankfulness (one of the “easiest” categories to predict) and Happiness_ peacefulness, both of which provided even better performance on the test dataset than on the training data.
We also note that the overall drop in precision for Run 2 (rules only) between the two datasets was significant and even larger than (expected) drop in recall, indicating some confusion between categories (eg, between Instructions and Information; see Table 6). In many cases, the difference between an Instruction and Information is very subtle and requires sophisticated processing (eg, ‘you will find my body’). Information additionally showed a high degree of lexical variability, which was difficult to “capture” with rules or with the ML models. Instructions did show more syntactic constraints, which resulted in reasonable performance overall.
Another example where the rule-based approach showed a significant drop in precision (from 81% to 24%) was the Blame category (see Table 7 for examples). An inherent limitation of our rule-based approach was reliance on topic-specific dictionaries mainly derived from the dataset. Our manual analysis for Blame did not come up with any specific lexical constraints, which made the rules less productive. In addition, a number of FP cases were due to confusion with Guilt (see tables 5 and and77 for some examples) as with Information and Instructions, the differences can be very subtle.
Tables 3 and and44 show that our approach could profile the Thankfulness and Love categories relatively well, whereas Sorrow and Anger, as well as Abuse proved to be challenging, with virtually no or very few correct predictions in the test dataset. In addition to the training data and examples being scarce for these categories (very few rules and basically no category-specific dictionary, see Table 1), it also seems that wider and deeper affective processing is needed to identify the subtle lexical expression of grief, sadness, disappointment, anger etc. (see Table 8 for some examples). Of course, the task proved to be challenging even for human annotators (Krippendorff’s alpha coefficient of 0.546), with many gold standard annotations that could be considered as questionable or at least inconsistent. This is particularly the case with muti-focal sentences, where many labels seems to be missing (for example, ‘My mind seems to have goen a blank, Forgive me. I love you all. so much.’ is not labelled as Love; ‘(signed) John My wisfe is Mary Jane Johnson 3333 Burnet Ave. Cincinnati, Ohio OH-636-2051 Call her first’ was annotated only as Instructions, but not as Information).
In the current approach, we did not try to split individual multi-focal sentences apart and process the parts individually (of course, all sentences in a given line were processed separately). Instead, we hypothesised that we could collect the results from each of the separate ML models and all of the triggered rules at the sentence level, and thus produce multi-label annotations (both at the sentence and consequently at the line level). For example, the sentence ‘Wonderful woman, I love you but can’t take this any longer.’ triggered two rules (one for Love and one for Hopelessness); the ML models for those two classes also gave positive predictions, while the other two ML models predicted the Other label. This resulted in the final prediction for the sentence consisted of both Love and Hopelessness labels. Still, future work may explore if splitting multi-focal sentences would provide better precision, given that some weak evidence in separate parts of the multi-focal sentence could be combined by an ML model to provide (incorrect) higher confidence and thus result in an FP. However, the experiments on both the training and testing data have shown that there was no “over-generation” of labels. The rules were built to have high precision, so in most cases only one rule fired per sentence and cases with more then two fired rules were very rare. An analysis of the ML results revealed that in the majority of cases only one of the four ML models predicted their respective categories for a given sentence. Cases where more than one ML predictions were made seem to be related to multi-focal sentences, and our best results were achieved with all ML predictions taken into account (run 1).
Identification of topics expressed in suicide notes proved to be a challenging task for both manual and automated analyses. Our approach to the prediction of topic categories relied on combining hand-crafted rules (which included both lexical, syntactic and lexico-semantic components) and various features used in the ML models (which included lexical, lexico-semantic and presentation features, and named entities and rules that were linked to corresponding sentences). The results showed reasonable performance for frequent and relatively well-scoped topics (eg, Thankfulness, Love, Instructions), whereas infrequent and non-focused categories (eg, Sorrow, Anger, Blame, Information) proved to be challenging. Future work will need to be informed by a detailed error analysis and in particular further investigations in prediction confusions between various topic categories. The effects of particular features (eg, presentation, named entities, etc.) on performance will also need to be further explored. Still, the current approach not only indicates the limits of the component technologies, but also demonstrates the potentials of combining or selecting different approaches for different topic categories.
This work was partially supported by a PhD scholarship from the School of Computer Science, University of Manchester and EPSRC (to AD) and the Serbian Ministry of Education and Science (projects III47003, III44006, to AK, GN).
Author(s) have provided signed confirmations to the publisher of their compliance with all applicable legal and ethical obligations in respect to declaration of conflicts of interest, funding, authorship and contributorship, and compliance with ethical requirements in respect to treatment of human and animal test subjects. If this article contains identifiable human subject(s) author(s) were required to supply signed patient consent prior to publication. Author(s) have confirmed that the published article is unique and not under consideration nor published by any other publication and that they have consent to reproduce any copyrighted material. The peer reviewers declared no conflicts of interest.