shows the outcome of three different approaches each with three different feature configurations (all features, ngrams, and all features without negation) on the training data. We obtained the best F-measure from the ‘Binary All’ approach, which took the union of individual binary SVM and CRF classifiers trained with all features. From the multi-class classifier we can see that CRF performed better than SVM, which suggests that the sequence of sentences and categories does play a role in emotion detection in the suicide notes.
We also analysed the individual performance of the ‘Binary All’ approach for each category when all features were used and how it was influenced in the presence of a single feature (each of GR (grammatical triple), Subject, Verb, and Negation). shows the result of the ‘Binary All’ classification for each category and the same result combined with manual rules. For all major categories (instructions-thankfulness) the best results were obtained for the combination of all features but the difference between ngrams and all features is not very big, ranging from 1%–4%. For rare categories (anger to forgiveness) the combination with manual rules outperforms the machine learning classifier-only approach. However, this combination with manual rules reduces overall performance by increasing FP rather than reducing FN in the case of all, ngram and unigram features. In the case of the GR, Subject, Verb and Negation features, where using only ML classifiers produced no results, performance for rare categories increased.
| Table 3.Binary classification result for each class. |
| Table 4.Binary classification result combined with manual rules for each class. |
We believe that we could improve our results by finding better ways of combining classifiers, perhaps through stacking or joint inference, techniques which achieved the highest results in BioNLP 2011.
15 Judging from manual rule-only results for the rare categories, we believe that a hybrid system which combines machine learning predictions for the major categories with manual rule-only results for the rare categories could boost recognition performance.
For the test data submission, we first chose the best feature model from each category of classifiers (ie, multi-class, binary and hybrid binary) and applied it to the test data. We separately applied the manual rules to the test data. We then combined the output of each model with the manual rules so that the manual rules applied only to sentences where the classifiers had made no predictions. The results we obtained from the scoring website are in .
Data inconsistency
We observed several inconsistencies in the training data, a factor which we believe led to decreased performance in the resulting machine learning and rule-based approaches. Structurally, it is noticeable that the “sentence-level” annotation often transcends sentences. For example, The cards were just stacked against me. Honey Get insurance on furniture soon as you can is included as one sentence, while clearly should have been separated into two. This “sentence” received a double annotation (hopelessness and instructions), which would have been separate annotations if the sentences had been properly separated. The converse also appears.
There were also inconsistencies in the annotation of categories to the sentences. We observed significant ambiguities between the (most voluminous) non-emotional categories ‘information’ and ‘instructions’. Some sentences were annotated with both categories, such as John my books are up under the cash register. This sentence does contain information, but to the casual reader, it is not obvious what is instructional in this sentence. Conversely, In case anything happens please call my attorney—John Johnson—9999 3333 Burnet Ave is annotated with both but appears solely instructional. Yet other sentences annotated with only one of the two appeared to us to have had the incorrect choice of category applied, while some sentences appeared to contain information or instructions but were un-annotated.
Within emotional categories, several sentences which are very similar were inconsistently annotated, for example the phrase phrase God forgive me was annotated as ‘guilt’ (sometimes combined with ‘hopelessness’) in several sentences including My God forgive me for all of my mistakes, but makes one appearance with no annotation and one (separate note) as ‘instructions’ for May God forgive me. Take care of them, and another as ‘hopefulness’ in May God forgive me, and I pray that I mite be with my wife for ever when we leave this earth.
Annotation Guidelines
Sentence-level annotation of classification categories in free text is an intrinsically difficult task, and quality of annotations need to be ensured by interannotator agreement values. In an ideal corpus for machine learning, consistency in annotation is required. Emotion language is deeply ambiguous and open to diverse interpretations. Furthermore, a sentence might express that the writer was feeling a certain way when they wrote the text, although this is not in itself explicit in the text. Many of the sentences which were labelled with anger were labelled as such because the tone of the sentence seemed angry, not because anger was explicitly mentioned: the word “angry” does not appear even once in the corpus of 69 annotated sentences. On the other hand, the statement
I was always afraid to … directly expresses fear, although that might not be what the author was experiencing at the time that they wrote the sentence. To achieve consistency, annotation guidelines should clarify intended scenarios for different categories. A relevant project in this area is the
emotion ontology which is being developed to facilitate annotation of emotions in text.
18 Such an ontology is not an annotation scheme in itself, but provides
definitions which can be used for definitive disambiguation between similar categories, such as ‘blame’ and ‘anger’. An ontology specifically for suicide note annotation is proposed and used in.
3 It includes some of the same emotion categories used for annotation in this challenge, although it is more extensive, including categories such as ‘self aggression’ and ‘helplessness’. However, it is not clear in
3 whether the ontology terms are accompanied by disambiguating definitions. Annotation guidelines should also clarify the objective of the natural language processing. On the one hand, if the purpose is to obtain the best performance from an NLP system for emotion identification in itself, the emotions with low prevalence can be regarded as essentially irrelevant. On the other hand, if the objective of the task is to study emotions in the context of suicide, even low-prevalent emotions may bear scientific interest.
It is of general interest that the principal emotion found in this suicide note corpus is hopelessness. This can be compared to the result of,
3 who find that the most relevant emotion categories for detecting genuine notes are: giving things away, hopeless, regret and sorrow. However, detecting emotions such as hopelessness in human text is inherently plagued by the flexibility of words such as “hope” and “wish”. Both
3 and
4 find a surprising role for structural features in real suicide notes—which are not obviously emotional in nature. A parallel in the current task is that the highest prevalence is instructions in the notes. It would be surprising, however, if the same features worked equally well for such non-emotional content as for detecting the emotional sentences.