We experimentally determined the best-performing combination of features for each of the 15 emotions. For the majority of the emotions, lemma and trigram bag of words proved to be indispensable features. For 6 emotions, these features alone yield the best results, while for another 7 emotions, classifiers achieve the best scores with the addition of subjectivity clues. Only 2 emotions benefit from WordNet and Senti-WordNet information.
Data sparsity is a problem for some emotions. This has a direct influence on classifier performance, given that they use supervised learning and rely on positive examples of a class to learn from. All the best classifiers for emotions with an incidence of less than 20% (average number of annotations per 1000 sentences in the training data) have an F-score below 21.0, and all the best classifiers for emotions with an incidence above 20% score above 28.0. Emotions with an incidence of over 40% all score above 40.0, along with thankfulness, which proves easily learnable despite a low incidence of 20.3%. It is likely that classifier performance for the low-incidence emotions would rise considerably if more training data were obtained, without the need for new features.
In order to produce the final system output, the output each emotion’s classifier is combined into one output file. Global system performance is calculated in terms of micro-averaged F-score, which is computed globally over all annotations, whereas macro-averaged F-scores would be computed over each emotion first, and then averaged over the 15 emotions.
Because micro-averaged F-score gives equal weight to each annotation, good performance on majority classes is important, because they have a larger number of annotations and therefore influence the global F-score more. Similarly, rare emotions, if predicted correctly, only bring a small positive contribution to overall F-score. However, if there is a lot of noise in the predictions due to high recall with low precision, minority classes can have a substantial negative influence on global F-score.
For this reason, we tried leaving out annotations of rare emotions, on which our classifiers performed poorly, and determined experimentally which pruned set of emotions yielded the best overall result on the training data set. This was achieved by leaving out emotions with an frequency of less than 2% (in the training set), resulting in output containing only 7 emotions: blame, guilt, hopelessness, information, instructions, love and thankfulness. The test data was then processed with classifiers trained on all the training data, using the appropriate feature set and threshold per emotion. Two versions of its output were submitted: one containing all emotions, the other containing the same pruned set of emotions.
presents the overall F-scores for all emotions and for the best-performing pruned set of emotions, both on the training data and on the test data. Pruning the output resulted in an increase in micro-averaged F-score of 1.93 percentage points on the training data, and 2.12 percentage points on the test data.
Micro-averaged F-scores on the training and test set for all emotions, and the 7 best-performing emotions (pruned).
The scores on the test data are 2.08 and 2.27 percentage points higher than the scores on the training data, for all emotions and pruned emotions, respectively. These increases indicate that there was no overfitting problem with the classifiers.