shows a confusion matrix for our rule-based system. This was created by assigning partial credit when a sentence had multiple emotion labels in the gold standard but the system did not predict all of those. The totals were rounded to improve the readability of the table. Note that this confusion matrix does not include counts of cases where no-emotion was assigned by either the gold standard or the system (and the other disagreed). Thus, this matrix only represents cases where both the gold standard and system assign an emotion.
Confusion matrix—manual rule-based system—excludes no-emotion.
The diagonal total represents true positives (tp) and is equal to 564. Note that the total number of cases where an emotion is assigned both by the gold standard and the system is 793, meaning that accuracy in this case is 71%. This suggests that the significant problem faced by our rule-based system is handling the no-emotion case, and either falsely predicting that no-emotion appears (false positive), or failing to predict an emotion when one should be assigned (false negative). The confusion matrix shows that most errors revolve around the instructions emotion. This is not surprising since it is the most common emotion, and it is also perhaps one of the most general and ambiguous. It is also interesting to note that, for example, guilt and sorrow appear to be frequently confused. The same can be said for instructions and information.
shows a confusion matrix where the no-emotion case was included. This was created by explicitly adding a no-emotion label to the gold standard data and system output. Here you can see that the diagonal total for true positives is raised to 1246, but the total number of emotions assigned is now 2283, meaning that accuracy has fallen to approximately 54.5%.
Confusion matrix—manual rule-based system—includes no-emotion.
In general our rule-based system did reasonably well in identifying no-emotion sentences. Of 988 sentences in the test data with no-emotion assigned, our rule-based system correctly identified 709 of them, for an accuracy of 71%. The effect of this can be seen when scoring our rule-based system relative to the gold standard using the official scoring program (with no emotion tags included). In that case the results were F1 = 0.54175, Precision = 0.53126, Recall = 0.55265, and N = 2351.
However, what the confusion matrix shows very clearly is a significant number of cases where no-emotion isn’t assigned when it should be, and when no-emotion is assigned when there is an emotion. This is clearly the dominant factor in determining the rule-based systems level of performance.
Note that while no-emotion was not included as a tag in the gold standard data, implicitly it exists and contributes to the error rate in those cases of disagreement. The scoring software assigns a false positive when an emotion is assigned when one should not be, and a false negative when no-emotion is assigned when one should be. The confusion matrices above show that these cases dominate the totals of false positives and negatives, much more so than errors caused by assigning the wrong emotion (which are considered false positives by the scoring software).