|Home | About | Journals | Submit | Contact Us | Français|
This paper describes a system for automatic emotion classification, developed for the 2011 i2b2 Natural Language Processing Challenge, Track 2. The objective of the shared task was to label suicide notes with 15 relevant emotions on the sentence level. Our system uses 15 SVM models (one for each emotion) using the combination of features that was found to perform best on a given emotion. Features included lemmas and trigram bag of words, and information from semantic resources such as WordNet, SentiWordNet and subjectivity clues. The best-performing system labeled 7 of the 15 emotions and achieved an F-score of 53.31% on the test data.
The second track of the 2011 i2b2 Natural Language Processing Challenge1 is a shared task on emotion classification. Its aim is to automatically annotate suicide notes with a set of emotions. The data used for the challenge consisted of a training set of 600 suicide notes, and a test set of 300 notes.
The notes were annotated by three human annotators, who were asked to assign 15 relevant labels to each sentence in a note. As a result, each sentence could be annotated with none, one or more of the following labels: abuse, anger, blame, fear, forgiveness, guilt, happiness, hopefulness, hopelessness, information, instructions, love, pride, sorrow, thankfulness. If at least two annotators agreed on an annotation, it was retained. Inter-annotator agreement was measured with Krippendorff’s α coefficient with Dice’s coincidence index, and was 0.546 at the sentence level.1
On average, notes were 7.7 sentences long and contained 132.5 tokens (17.2 tokens per sentence) in the training set, and 7.0 sentences long with 121.5 tokens (17.5 tokens per sentence) in the test set. The distribution of the labels in both sets is presented in Table 1.
The operational unit of the task is the sentence. A successful system would accurately predict for each sentence which, if any, emotions are present. Because the 15 emotion labels are not mutually exclusive, there are 152 = 225 possible label combinations. We therefore decided to use 15 binary classifiers that each determined whether or not to assign a specific emotion, and combined their outputs.
Shallow inspection of the data showed that most emotions were strongly lexicalized. We hypothesized that classifiers would perform adequately with a feature set that generalized lexical information and included subjectivity information from external resources.
The data was first preprocessed with the Memory-Based Shallow Parser (MBSP) for Python v1.4,2 which provided lemmas and part-of-speech tags. The following features were extracted from the training data:
All experiments were done with Support Vector Machine (SVM) classifiers. A standard SVM is a supervised learning classifier for binary classification. It learns from the training instances by mapping them to a high-dimensional feature space using a kernel function, and constructing a hyperplane along which they can be separated into the two classes, the decision boundary. Unseen instances are mapped to the feature space, and labeled depending on their position with respect to the decision boundary. The distance from the instance perpendicular to the hyperplane can be used as a measure of classification certainty.
SVM-Light5 was used in our experiments, through the pysvmlighta Python binding. SVM-Light outputs a floating point number for unseen instances: its sign designates the position, its absolute value the distance relative to the decision boundary. Bootstrap resampling6 was used to determine for each classifier which decision threshold maximized F-score.
We experimentally determined the best-performing combination of features for each of the 15 emotions. For the majority of the emotions, lemma and trigram bag of words proved to be indispensable features. For 6 emotions, these features alone yield the best results, while for another 7 emotions, classifiers achieve the best scores with the addition of subjectivity clues. Only 2 emotions benefit from WordNet and Senti-WordNet information.
Data sparsity is a problem for some emotions. This has a direct influence on classifier performance, given that they use supervised learning and rely on positive examples of a class to learn from. All the best classifiers for emotions with an incidence of less than 20% (average number of annotations per 1000 sentences in the training data) have an F-score below 21.0, and all the best classifiers for emotions with an incidence above 20% score above 28.0. Emotions with an incidence of over 40% all score above 40.0, along with thankfulness, which proves easily learnable despite a low incidence of 20.3%. It is likely that classifier performance for the low-incidence emotions would rise considerably if more training data were obtained, without the need for new features.
In order to produce the final system output, the output each emotion’s classifier is combined into one output file. Global system performance is calculated in terms of micro-averaged F-score, which is computed globally over all annotations, whereas macro-averaged F-scores would be computed over each emotion first, and then averaged over the 15 emotions.
Because micro-averaged F-score gives equal weight to each annotation, good performance on majority classes is important, because they have a larger number of annotations and therefore influence the global F-score more. Similarly, rare emotions, if predicted correctly, only bring a small positive contribution to overall F-score. However, if there is a lot of noise in the predictions due to high recall with low precision, minority classes can have a substantial negative influence on global F-score.
For this reason, we tried leaving out annotations of rare emotions, on which our classifiers performed poorly, and determined experimentally which pruned set of emotions yielded the best overall result on the training data set. This was achieved by leaving out emotions with an frequency of less than 2% (in the training set), resulting in output containing only 7 emotions: blame, guilt, hopelessness, information, instructions, love and thankfulness. The test data was then processed with classifiers trained on all the training data, using the appropriate feature set and threshold per emotion. Two versions of its output were submitted: one containing all emotions, the other containing the same pruned set of emotions.
Table 2 presents the overall F-scores for all emotions and for the best-performing pruned set of emotions, both on the training data and on the test data. Pruning the output resulted in an increase in micro-averaged F-score of 1.93 percentage points on the training data, and 2.12 percentage points on the test data.
The scores on the test data are 2.08 and 2.27 percentage points higher than the scores on the training data, for all emotions and pruned emotions, respectively. These increases indicate that there was no overfitting problem with the classifiers.
This paper described experiments with lexico-semantic features for emotion classification in suicide notes. The results suggested that such features perform well, but suffer from data sparseness. This could be remedied by collecting more training examples for rare emotions.
An alley for future work would be to investigate the effect of applying spelling correction as a preprocessing step. Given the amount of spelling errors in the data, and the dependence of our classifiers on lexical features such as lemmas and trigrams, data sparsity could be significantly reduced by correcting spelling mistakes.
Deeper semantic analysis of suicide notes could also yield informative features for emotion classification. Furthermore, classifiers might benefit from features that model negation and modality. Simple bag of word features alone do not take into account such modification that may flip the meaning of significant word sequences.
Author(s) have provided signed confirmations to the publisher of their compliance with all applicable legal and ethical obligations in respect to declaration of conflicts of interest, funding, authorship and contributorship, and compliance with ethical requirements in respect to treatment of human and animal test subjects. If this article contains identifiable human subject(s) author(s) were required to supply signed patient consent prior to publication. Author(s) have confirmed that the published article is unique and not under consideration nor published by any other publication and that they have consent to reproduce any copyrighted material. The peer reviewers declared no conflicts of interest.