In this section, we discuss the progress made by re-annotating the training data and evaluate the calibration and thresholding phases. We conclude this section by discussing the results obtained in the official test runs. We report on micro F1 scores as well as precision and recall, as these are the main evaluation metrics used in Track 2 of the 2011 Medical NLP Challenge.
As a sanity check, we tested whether our decision to re-annotate complex sentences in the training set—i.e., sentences annotated with multiple emotions—was a good one. shows classification performance before and after re-annotation, in an emotion detection experiment using token unigrams, using the emotion/no-emotion classification scheme. Note that these scores were achieved using an optimized SVM classifier (see below for an account of our parameter optimization results). The results show that using the re-annotated data increased accuracy as well as micro F1, so we chose to use the re-annotated training set for all subsequent steps in the construction of our classification systems.
The effect of re-annotating data on emotion detection performance on training data (optimized scores).
Evaluation of calibration phase
During the calibration phase, our aim was to find optimal feature types and to select the best performing parameters for our SVM classifier. Note that during this phase, we accept the label returned by LibSVM—i.e., the one with highest class probability estimate. Only during the second phase will we apply thresholding on the probability estimates and—potentially—assign multiple labels.
As described in the Methodology section, we experimented with token unigrams, but also consulted lists of emotion-related keywords (+ emo
) and tried including context features (+ context
) in order to check whether these high-level feature types were helpful for emotion detection. shows micro F1
scores with each of these feature types. Although lists of emotion-related keywords are heavily used in emotion detection (see for instance Kim et al, 2010 and Ghazi et al, 2010),12,17
adding them to the token unigrams did not help improve classification performance. Including the emotions occurring in the context of the current sentence did not lead to satisfying results, even though there was some evidence to suggest clusters of same-type emotions (cf. Methodology section).
Classification performance (in micro F1) on training data by feature type, before and after optimization.
For each of the feature types mentioned above, we performed a grid search on LibSVM’s C (cost) and G (gamma parameter in the radial basis function) parameters in order to optimize the classifier’s performance. The overall best scores were obtained after optimization on token unigrams, as shown in . We thus decided to continue working with token unigrams in the thresholding phase.
Visualized in is the grid search applied to token unigrams. While varying the C (on the y-axis) and G (on the x-axis) parameters, we checked whether micro F1 scores increased. The darker colored regions indicate low performance, while the lighter colored regions indicate higher performance. The best performing combination is C = 8 and G = 8, shown on the right-hand side of .
Visualization of the effect of C and G parameter tuning on micro F1, using token unigrams.
Evaluation of thresholding phase
After having determined the best learner parameters and features for single-label classification, we experimented with ways to assign multiple labels to a single sentence. LibSVM provides probability estimates for each of the classes present in training, so instead of simply assigning the most probable emotion label to the test sentence, we can use probability thresholds: any emotion with a probability that exceeds a given threshold will be assigned to the sentence. Where these thresholds lie, however, is dependent on the classification scheme (cf. Methodology section).
The emotion/no-emotion scheme requires two thresholds: one for the emotion classes, and one for the “no emotion” class. shows a surface plot of the micro F1 score for different values of emotion and no-emotion thresholds. The optimal values were 0.19 for the emotion threshold and 0.80 for the no-emotion threshold.
Effect of emotion and no-emotion probability thresholds on training performance (micro F1).
The emotion-only experiment—where the classifier was trained on emotion-carrying sentences only—requires only one threshold. shows the effect of different emotion probability thresholds on precision, recall, and micro F1 scores using this classification scheme. The optimal value for the emotion threshold was 0.49.
Effect of emotion probability thresholds on training performance.
shows precision, recall, and micro F1 scores for the two classification schemes with thresholds, as well as for a naïve variation on the emotion/no-emotion experiment where only the most probable label was assigned (no thresholding). The emotion/no-emotion system trained on both emotion-carrying and “no emotion” sentences performs best, with a micro F1 score of 0.4954.
Multi-label classification performance (in micro F1) on training data, using token unigrams.
Evaluation on test data
For the official test run, we submitted the three systems developed during the second phase. The test set contained 300 suicide notes, annotated in the same fashion as the original training set, but previously unseen by our systems. shows performance of the three systems in terms of precision, recall, and micro F1 score. The n value in the last column indicates the number of emotions predicted by the system.
Multi-label classification performance on test data.
The results shown in are in line with results during development, even to the point where they improve over results obtained during training (cf. ). It is clear that the second system—trained only on the fifteen emotion classes, enhanced with a single emotion threshold—detects more emotions than the last one and considerably more than our first, naïve system. In terms of micro F1, however, the last system—trained on emotions as well as “no emotion” cases—clearly outperformed the other two. This result is higher than the mean performance (0.4875) of the groups participating in the challenge, and only just below the median of 0.5027. The top scoring team’s result was 0.6139.
All of the systems shown in started from training data we re-annotated. With hindsight, if we had continued to work with the original data—with the option of the same sentence having various emotion labels—performance on test data would have been better than the systems we actually submitted. The micro F1 score of 0.5230 (precision = 0.5373; recall = 0.5094; n = 1206) achieved by not re-annotating is an improvement over our best system. Our decision to continue working with re-annotated sentences was supported by results during development, but worked to our disadvantage during the test phase.
Error analysis on the multi-label predictions made by the emotion/no-emotion system in reveals that multiple labels were predicted for 45 sentences (15% of all sentences in the test set). 80% of these decisions were partially correct, meaning that one of the predicted labels was the same as (one of) the gold standard labels. The system achieved a correct multi-label decision in 16% of these cases. The emotion-only system hardly produced any multi-label annotations.
Some of the single labels most confused by the emotion/no-emotion system are “instructions” and “information”, and “instructions” and a “no emotion” prediction. The feeling of “hopelessness” is often given the “no emotion” label, which is unfortunate considering the possible applications of emotion detection, more specifically the prevention of (repeated) suicide attempts. The other two systems, however, perform worse in this respect: they fail to identify any emotion in the test sentences annotated with the “hopelessness” label.