shows the test performance of different classifiers trained on the cleaned sentences. Overall, the machine leaning approach using both character 8-grams and dependency relations produced higher F-score than either the ad-hoc rule-based approach or a fusion of the ad-hoc rule-based approach and a machine learning approach. Character-based n-grams outperformed word-token n-grams, POS features, and dependency relations. Replacing names, addresses, and dates with single character placeholder only slightly improved performance.
Performance of different classifiers trained on re-segmented training data.
The results show that the single ad-hoc rule-based classifier was the most inefficient one, with an F-score as low as 0.31. The reason for the low F-score is that emotions have to be classified at a very fine level; thus the classifications requires context information, which is not given in matching words or patterns within the target sentence. For example, if “regret” occurs in the context of “I regret that”, it is an indicator for class GUILTY, but not if it appears in “I have no regrets”; “forgive” in “I forgive what you did” is an indicator for class FORGIVENESS, but is an indicator for class GUILTY in “Please forgive me.”
The machine leaning classifiers reported in were all trained with 16 classes including OTHER using re-segmented training data. For each example to be classified, a conditional probability score (P(Class/Input)) is returned for each class. Different thresholds for this score were examined, with higher thresholds producing better precision and lower thresholds higher recall. The best F-score was achieved by setting the threshold to 0.55. As shown in , character-based n-grams outperformed the ad-hoc rule-based classifier by 32% in F-score.
Character-based n-grams and the dependency relation pairs achieved similar performance and a fusion of both result sets yield the best performance, 0.42 in F-score in our experiments. The highest precision, 0.57, was reached by character-based n-grams (with and without placeholders). The highest recall, 0.62, was reached by a fusion of ad-hoc rules and character-based n-grams.
We also trained the machine learning classifiers with the original training data, which has fewer training examples but more words per example than the re-segmented training data. Surprisingly, the performance is slightly better, as shown in . Because of time constraints, we could not re-segment the test data. This probably resulted in a discrepancy between training and test data in the experiment using the re-segmented training data.
Performance of 8-gram character-based classifiers trained on original and re-segmented training data.
We also investigated the influence on using the OTHER label. Since there are 2,460 examples in the class of OTHER, it is the majority label, which, in the machine learning approaches, caused a bias to label an unknown example as OTHER. To avoid this bias, we trained the classifiers using only the 15 classes, without OTHER. We then changed the label to OTHER if the best prediction for an unknown example has a probability score lower than 0.9. The results in show that the approach that uses the OTHER label results in higher precision but also in lower recall so that the overall F-score is lower than the one for the experiment introduces OTHER after classification.
Performance of 8-gram character-based classifiers with and without the OTHER class.