|Home | About | Journals | Submit | Contact Us | Français|
We study the discrimination of emotions annotated in free texts at the sentence level: a sentence can either be associated with no emotion (neutral) or multiple labels of emotion. The proposed system relies on three characteristics. We implement an early fusion of grams of increasing orders transposing an approach successfully employed in the related task of opinion mining. We apply a filtering process that consists in extracting frequent n-grams and making use of the Shannon’s entropy measure to respectively maintain dictionaries at balanced sizes and keep emotion specific features. Finally the overall system is implemented as a 2-step decision process: a first classifier discriminates between neutral and emotion bearing sentences, then one classifier per emotion is applied on emotion bearing sentences. The final decision is given by the classifier holding the maximum confidence. Results obtained on the testing set are promising.
While opinion mining (the study of opinions as positive, negative or neutral in free texts) has received great attention over the past few years,1 less work has been performed in the field of emotion mining that aims at identifying emotion labels, as for instance “anger”, “love” or “hate”. The lack of consensus in emotion models, the difficulty to annotate data-sets as well as the complexity of analyzing emotion expressions in free texts strongly participate in this phenomenon. The success of opinion mining can be explained by the availability of Internet user ratings as well as the simplicity of opinion representations: the task of opinion classification is often tackled as a classical binary classification task.
I2B2’s challenge track 2 consists in learning to discriminate emotions labels in free texts.2 To this aim, participants are provided with a training set made of 600 suicide notes annotated at the sentence level according to M = 15 predefined emotion labels (see Table 1 for a complete list). Sentences in the learning set are associated with zero or multiple emotion labels, Table 1 gives the distribution of the labels over the whole dataset. We observe that sentences labeled with more than one emotion represent approximately 7% of the whole data-set and up to 5 emotions are labeled at maximum. Micro averaged F1 score is employed to evaluate submitted systems over a testing set composed of 300 notes. To our knowledge, it is the first challenge on emotion classification particularly focused on machine learning; SemEval 2007 proposed a track (task 14)3 consisting in classifying news headlines for several emotions, but due to the small size of the training set, purely linguistic approaches were strongly favored.
We propose a system based on the early fusion of n-grams of increasing orders for representing sentences. Early fusion is the process of merging information from different sources in the input examples. In other words it is the process of taking into account features from different sources at the vector level. Fusion performed at the classifier level is called late fusion, at the similarity function level, intermediate fusion.4 Here, each order, ie each n value, defines a specific representation of a sentence, a decision surface is then learned in the space made of the concatenation of these representations.
The motivation behind the use of grams of higher orders is to mix features with increasing lengths for representing expressions of emotions. While unigrams are widely employed for representing documents in the classical text classification task, they do not seem to provide enough description in the case of sentiment analysis. By fusing grams of increasing orders, one is able to make use of richer features to describe naturally complex and subtle expressions of emotions. An interesting example is the negation which plays an important role in the detection of emotions’ patterns. For instance, given the unigram “bad”, the change in polarity held by the expression “not bad” is captured by bigrams. More subtle constructs like “not really bad” are represented by trigrams and higher orders can capture even more complex and subtle expressions.
Given a specific gram’s order n, we refer to the set of all unique n-grams in the training set as a dictionary Dn. We must note that the higher the order, the more likely are features to appear uniquely in the dataset and the larger the size of the resulting dictionary. When performing early fusion based on increasing grams’ orders, one must therefore consider a feature selection process in order to maintain the different dictionaries at balanced sizes. In this paper we make use of two criteria: we extract frequent n-grams which occur more than a given threshold and we select emotion specific features among these frequent n-grams according to their Shannon’s entropy measures.
The rest of this paper is organized as follow. Related work is presented in Section 1. We then describe our system: sentences are first lemmatized (to this aim we employ TreeTagger)5 then represented as binary feature vectors made of the fusion of increasing grams’ orders (in the vector, 1 indicates the presence of a feature, 0 indicates its absence). In Section 2 we introduce a method for filtering frequent n-grams based on the Shannon’s entropy measure, leading to dictionaries specific to each emotion label and each gram’s order. The learning of the models is described in Section 3. The decision process is implemented as a 2-step algorithm: a neutral vs. emotion classifier is applied to the pre-processed sentences, sentences recognized as bearing emotions are further ran through M different classifiers, one for each emotion (we adopt the classical one vs. all strategy). Finally, we present the results obtained on the testing set composed of 300 notes in Section 4. Conclusion and perspectives of this work are given in Section 5.
Internet user reviews have been extensively studied in the task of opinion mining. Authors tackle the task of sentiment analysis as a binary classification task (positive vs. negative opinions). An early work shows that learning binary vectors of unigrams with linear SVMs produces the most accurate classifiers6. In their experiments, the authors find that adding bigrams for representing texts leads to a drop in performance. A second study7 shows that concatenating bigrams and trigrams to the unigrams vector of representation does improve performance on the condition that the number of unigrams, bigrams and trigrams are maintained at balanced sizes. The authors make use of the weighted likelihood ratio in order to select the best k features from each bigrams and trigrams dictionary. By concatenating the filtered bigrams and trigrams to the original unigrams the authors achieve results competing with state of the art methods in opinion mining. Another study8 shows that on very large datasets, making use of n-grams up to n = 6 while keeping dictionaries at balanced sizes does improve performance.
In this paper, we propose to transpose this approach by studying its efficiency in the task of emotion mining. Emotions are more complex and their expressions in text are more subtle than opinion.9 As it is argued by psychologists10, emotions can be segmented in positive and negative emotions, emotion mining may then be regarded as a refinement of opinion mining.
In our setting, sentences are represented as binary feature vectors made of the relevant n-grams of the training set. In this section, we describe a method for extracting and filtering n-grams based on both their frequency in the training set and their discriminative value for each emotion.
The total number of orders employed in the proposed approach is limited by one factor: above a given n value high orders can become less successful depending on datasets. Indeed an n-gram representation with high n may in fact draw full sentences as features for describing the dataset. The resulting features then suffer of a lack of representativity in the whole dataset. An extreme scenario would be a dictionary whose entries correspond to every unique sentence in the dataset. In the experiments described in Section 4 as we compute n-grams representations up to trigrams, we observe that from n = 3, performance is not improved in a cross-validation setting.
As the size of the dictionary drastically increases with grams’ orders we remove every n-gram occuring rarely in the whole dataset. The initial 3 dictionaries D1, D2 and D3 composed of respectively all unique unigrams, bigrams and trigrams are therefore cleaned as to keep entries occuring in 3 sentences or more in the training set.
The cleaned dictionaries still contain many entries, among them many correspond to noise (for example the unigrams “the”, “a”, or “and”) and many are simply not good at discriminating the sentences over the emotion labels. With a view to deal with noisy features one usually employs a “stop word list” whose role is to clean common words out of the dictionaries. While this approach is well suited to unigrams representations, it does not cope with grams of higher orders: defining stop lists for n-grams with n > 1 is far from being intuitive.
Instead, we make use of an information measure: while weighted log-likelihood7 ratios or χ2 scores8 have been studied in this context, we propose a method based on Shannon’s entropy measure to filter grams of any order while keeping emotion discriminative features. Formally, let P be the frequency of occurence of feature f in sentences labeled with emotion e. Shannon’s entropy measures f’s ambiguity with respect to e as:
It can be observed that He reaches its maximum if f is uniformly distributed, ie p = (1 – p) = 0.5 in which case f equally contributes to e and to the other emotions. It reaches its minimum if f is non ambiguous, ie p is close to 0 or 1 in which case f contributes specifically either to the emotion e or to the other emotions.
For each of the 3 cleaned dictionaries, we propose to build one new dictionary per emotion label Dn(e). Taking account of the neutral label, the resulting 3(M + 1) new dictionaries are made of the features whose Shannon’s entropy measure is higher than a threshold εn(e). Each of them is specialized for one emotion label and one gram’s order.
We manually estimate εn(e), based on the performances of the corresponding classifiers as described in Section 4. In our experiments, we find that depending on the dictionary, threshold values comprised between 0.8 and 1 hold the best relevance. It must be noted that dictionaries based on unigrams and on rare emotion labels are associated with threshold values closer to 1. These results are compatible with the intuition that unigrams are less specific than grams of higher orders. They also show that rare emotion labels do not possess a specific set of features.
Given a sentence s, we apply the early fusion strategy: we compute M + 1 different binary feature vectors (e). Each of them is a representation of s specific to one emotion label:
As presented in Section 3 each classifier, for emotion e, is trained on its associated representation (e).
As illustrated on Figure 1, the classification of new sentences is viewed as a 2-step process involving M + 1 classifiers. Firstly, one binary classifier discriminates between neutral and emotion bearing sentences, sentences bearing emotions are then further processed: adopting the classical one vs. all strategy, M classifiers discriminate one emotion label against all other emotion labels. Finally the classifier holding the highest confidence wins over the others (confidence is measured as the distance to the separating hyperplane). It must be noted that the proposed system only produces one emotion label per sentences even though the training set contains multi-labeled sentences.a
We use linear SVMs to learn the classifiers, employing the LIBLINEAR implementation, that solves the L1 regularized SVMs problem in the dual.11 Linear SVMs compute a separating hyperplane based on the scalar product similarity function, they have been shown to stand for state of the art in traditional text classification and to perform well on the related task of opinion mining. In the neutral vs. emotion setting, sentences associated with no emotion account for the positive examples. In the M other settings, no emotion sentences are removed from the training set and the target emotion label accounts for the positive class, all the other emotion labels standing for the negative class. The soft margin parameter C influencing the trade-off between generalization and accuracy is tuned over a grid search: the consecutive powers of 2 are considered from 0 to 10. Depending on the frequency of the positive class in the training set, a 10-fold cross-validation or a 3-fold cross-validation (for infrequent classes) is performed. Also, to deal with imbalanced classes, different costs are introduced for both classes by weighting the supplied C parameter with the corresponding class’ frequency. In the end, the M + 1 classifiers holding the best averaged F1 score are re-trained on the whole dataset in order to produce the final classifiers.
In this section we first present and discuss the results obtained by the best performing individual classifiers: we first consider each gram’s order independantly, then we consider their fusion. Finally, we give the results achieved by the final system on the testing set composed of 300 notes.
Tables 5–7 display the averaged F1 score, precision and recall for the best classifiers (maximizing the F1 score), trained separately on each emotion on respectively unigram, bigram and trigram representations. We must note that in the final system, as a 2-step process is performed, the performances on the emotion labels must be bounded by the performance of the “no emotion” classifier.
We observe that trained separately, grams of lower orders hold better performances than grams of higher orders. It follows our intuition that grams of high orders are more specific and representations relying uniquely on them do not provide enough coverage. Moreover, precision tends to increase on bigrams while recall tends to decrease. However, the gain in precision does not allow the classifiers based on grams of higher orders to discriminate correctly between positive examples and negative examples. This is especially remarkable for the 3 most rare emotions: “Pride” (15 sentences), “Abuse” (9 sentences) and “Forgiveness” (6 sentences). Due to the extreme rarity of these labels in the training set and in spite of the weighting strategy we employed as described in Section 3, the bigrams and trigrams representations alone cannot be exploited to learn an effective classifier (N/A’s in the tables indicate that the SVMs learned a majority vote classifier). Nevertheless, we notice that in some cases grams of high orders stand for the best description: for example trigrams provide a representation far better than unigrams and bigrams at describing the emotion “Sorrow” and, to a lesser extent, the emotion “Hopelessness”.
Despite a general drop in performance over infrequent emotions, we notice that some emotions seem naturally inclined to separate from the others: for example the emotions “Love” and “Thankfulness” do not occur much in the training set, yet they hold good performances on both unigrams and bigrams. This suggests that for some emotions it may exist a specific vocabulary which is easier to identify.
While bigrams capture enriched features at the expense of coverage (higher precision and lower recall), unigrams capture simple and generic features (lower precision and higher recall). The combination of the two representations may therefore lead to a better compromise between precision and recall. Now, because the success of the complex constructs that are captured by trigrams prove to be dependant on emotions, we run further experiments (not reported in this paper) in which trigrams are added to the combination of uni-grams and bigrams at the vector level. We observed that on average it did not significantly improve the performance of the classifiers. We therefore decide to only consider the combination of unigrams and bigrams. Table 2 displays the averaged F1 score, precision and recall for the best classifiers trained on the fusion of unigrams and bigrams. Again, in the final system, the performances on the emotion labels must be bounded by the performance of the “no emotion” classifier.
On average, the combination of unigrams and bigrams holds better performances than each representation taken separately. As expected, the resulting classifiers exploit a better compromise between precision and recall than for each of the representations taken separately. A good example is the emotion “Love” for which the fusion strategy of uni-grams and bigrams improves both precision and recall, leading to a better F1 score. We must note that some emotions like “Instruction” do not take benefit from the fusion. For this particular emotion, bigrams prove less successful than unigrams at holding precision. Therefore, the combination of unigrams and big-rams could not benefit from them. Generally speaking, it seems that for the fusion to hold better performances, both representations need to provide different strong points, either in terms of recall or precision.
In order to gain further insight into the final individual classifiers, we observe the best weighted features in the SVMs models. Table 3 gives the best features of classifiers holding F1 scores higher than 0.3. In the table, we reported the 7 top ranked unigrams as well as the 7 top ranked bigrams. While features like “love” and “thank” are naturally high rated by linear SVMs, more complex patterns emerge: for instance, the unigram “.” ending a sentence combined with the unigram “too” is more relevant to the emotion Thankfulness than the two of them taken separately. We also notice that while identical unigram features can be shared between different classifiers, bigram features remain specific to each emotions.
The final system we prepare for evaluation relies on the fusion of unigrams and bigrams. As described in Section 3, test sentences are first pre-processed then labeled using the 2-step decision process described in Section 3.
It must be noted that the proposed system does not output multiple emotions: for emotion bearing sentences, the classifier with highest confidence wins.
As presented in Table 4, on the testing set composed of 300 notes, it obtains 0.47 on micro averaged F1 score, 0.49 on precision and 0.46 on recall. Among all systems submitted to the I2B2 challenge, the worst micro averaged F1 score is 0.30 while the best is 0.61. The average performance is 0.49 ± 0.07.
In this paper, we presented a system for classifiying sentences’ emotional content relying on 3 characteristics: the early fusion of grams of increasing orders, a method for filtering the grams based on Shannon’s entropy and a 2-step decision process for dealing with neutral sentences. We showed that unigrams only were not sufficient at describing expressions of emotions, naturally complex and subtle. By adding bigram features at the vector levels, we train classifiers holding better performances on average than on each representation separately. In this setting, unigrams seem to boost the recall while bigrams seem to boost the precision of the resulting classifiers. We also show that, by modeling complex constructs, grams of higher orders like trigrams can provide a better description for discriminating emotions. An interesting developement of this work would be to investigate further types of fusions: we believe that combining low level features with external knowledge is relevant for discriminating emotions. In this setting, intermediate fusion allows to combine different similarity functions, each specific to one source of information. Another perspective of this work is to study the problem of multi-labeling, for instance considering aggregation functions other than max. Finally, grams of high order hold better performance for certain emotion, it can be of interest to adopt emotion dependant representations.
This work was supported by the CAP DIGITAL project DOXA funded by DGCIS (N°. DGE 08-2-93-0888).
Author(s) have provided signed confirmations to the publisher of their compliance with all applicable legal and ethical obligations in respect to declaration of conflicts of interest, funding, authorship and contributorship, and compliance with ethical requirements in respect to treatment of human and animal test subjects. If this article contains identifiable human subject(s) author(s) were required to supply signed patient consent prior to publication. Author(s) have confirmed that the published article is unique and not under consideration nor published by any other publication and that they have consent to reproduce any copyrighted material. The peer reviewers declared no conflicts of interest.
aIn the training set, 7% of the sentences are multi labeled.