The task was evaluated on a test dataset containing another set of 300 suicide notes as provided by the organisers. The “gold” annotation topic categories were manually provided for each line by three annotators. The organisers estimated the inter-annotation agreement as 0.546 (using Krippendorff’s alpha coefficient).
7The system performance was primarily estimated using the micro-averaged F-measure, which averages the results across all annotations (line-level). The test results of our system are given in . Run 1 gave the best results, with the highest F-measure (53.36%) and the highest recall (50.47%). As expected, the best precision (67.17%) was achieved in Run 2, with all predictions coming from the rule-based module. Run 3 was an attempt to compromise between the first two runs, as reflected by the results (but it failed to get the best F-score).
| Table 2.Micro-averaged results on the test data. |
Category-specific results are given in . We note that the “large” categories (such as Instructions, Hopelessness, Love and Guilt) have reasonably high and comparable performance, with Love consistently showing the best results (F-measure of 67.34%). The exception is the Information category (F-measure of 29.73%), probably due to the very broad scope of this topic. The results for mid- and low-frequency categories relied on rules only, and typically showed poor performance, with notable exceptions of Thankfulness (F-measure of 72.53%) and Happiness_ peacefulness (F-measure of 53.85%). Still, the rules (Run 2) overall provided relatively high precision (67.17%).
| Table 3.Per-category performance on the test data. |
Run 3 attempted to optimise F-measure, but the drop in recall was significant probably due to (1) excluding the less confident predictions from the ML models, and (2) using the ML models with rule-based features, which proved to increase precision but have the reverse effect on recall (data not shown). also shows the macro-averaged results (averaged over topic categories), which were significantly lower than the micro-averaged ones, given that there were categories (eg, Sorrow and Abuse) with no correct predictions.
When compared to the results on the training dataset (see ), there are drops in the overall micro F-measure of between 6.38 and 7.86 percentage points. There were differences in the performance drops for specific categories: while Love performed mostly consistently (drop of 3%–5%), performance for the Information category dropped between 14 and 19 percentage points, indicating again the wider scope of this category that has not been captured by rules or ML approaches (likely due to lexical variability and limitations of our topic dictionaries). There were also significant drops in performance for Guilt, in particular in the runs that included ML-based predictions, indicating again that the models have not generalised well (see for FP and FN examples).
| Table 5.Examples of FPs and FNs for Guilt. |
While the rules (Run 2) did not fail for some of the “large” categories (Hopelessness, Love and Guilt), there were significant drops for Instructions (a large category) and Information (a wide scope) when compared to the training data. As expected, the rules developed for the mid- and low-frequency categories in principle did not show consistent performance. Notable exceptions are Thankfulness (one of the “easiest” categories to predict) and Happiness_ peacefulness, both of which provided even better performance on the test dataset than on the training data.
We also note that the overall drop in precision for Run 2 (rules only) between the two datasets was significant and even larger than (expected) drop in recall, indicating some confusion between categories (eg, between Instructions and Information; see ). In many cases, the difference between an Instruction and Information is very subtle and requires sophisticated processing (eg, ‘you will find my body’). Information additionally showed a high degree of lexical variability, which was difficult to “capture” with rules or with the ML models. Instructions did show more syntactic constraints, which resulted in reasonable performance overall.
| Table 6.Examples of confusion between Instructions and Information. |
Another example where the rule-based approach showed a significant drop in precision (from 81% to 24%) was the Blame category (see for examples). An inherent limitation of our rule-based approach was reliance on topic-specific dictionaries mainly derived from the dataset. Our manual analysis for Blame did not come up with any specific lexical constraints, which made the rules less productive. In addition, a number of FP cases were due to confusion with Guilt (see and for some examples) as with Information and Instructions, the differences can be very subtle.
| Table 7.Example FPs and FNs for the Blame category. |
and show that our approach could profile the Thankfulness and Love categories relatively well, whereas Sorrow and Anger, as well as Abuse proved to be challenging, with virtually no or very few correct predictions in the test dataset. In addition to the training data and examples being scarce for these categories (very few rules and basically no category-specific dictionary, see ), it also seems that wider and deeper affective processing is needed to identify the subtle lexical expression of grief, sadness, disappointment, anger etc. (see for some examples). Of course, the task proved to be challenging even for human annotators (Krippendorff’s alpha coefficient of 0.546), with many gold standard annotations that could be considered as questionable or at least inconsistent. This is particularly the case with muti-focal sentences, where many labels seems to be missing (for example, ‘My mind seems to have goen a blank, Forgive me. I love you all. so much.’ is not labelled as Love; ‘(signed) John My wisfe is Mary Jane Johnson 3333 Burnet Ave. Cincinnati, Ohio OH-636-2051 Call her first’ was annotated only as Instructions, but not as Information).
| Table 8.Example of FPs and FNs for the Sorrow and Anger categories. |
In the current approach, we did not try to split individual multi-focal sentences apart and process the parts individually (of course, all sentences in a given line were processed separately). Instead, we hypothesised that we could collect the results from each of the separate ML models and all of the triggered rules at the sentence level, and thus produce multi-label annotations (both at the sentence and consequently at the line level). For example, the sentence ‘Wonderful woman, I love you but can’t take this any longer.’ triggered two rules (one for Love and one for Hopelessness); the ML models for those two classes also gave positive predictions, while the other two ML models predicted the Other label. This resulted in the final prediction for the sentence consisted of both Love and Hopelessness labels. Still, future work may explore if splitting multi-focal sentences would provide better precision, given that some weak evidence in separate parts of the multi-focal sentence could be combined by an ML model to provide (incorrect) higher confidence and thus result in an FP. However, the experiments on both the training and testing data have shown that there was no “over-generation” of labels. The rules were built to have high precision, so in most cases only one rule fired per sentence and cases with more then two fired rules were very rare. An analysis of the ML results revealed that in the majority of cases only one of the four ML models predicted their respective categories for a given sentence. Cases where more than one ML predictions were made seem to be related to multi-focal sentences, and our best results were achieved with all ML predictions taken into account (run 1).