|Home | About | Journals | Submit | Contact Us | Français|
This paper describes the Duluth systems that participated in the Sentiment Analysis track of the i2b2/VA/Cincinnati Children’s 2011 Challenge. The top Duluth system was a rule-based approach derived through manual corpus analysis and the use of measures of association to identify significant ngrams. This performed in the median range of systems, attaining an F-measure of 0.45. The second system was automatically derived from the most frequent bigrams unique to one or two emotions. It achieved an F-measure of 0.36. The third system was the union of the first two, and reached an F-measure of 0.44.
The task in the Sentiment Analysis track of the i2b2/VA/Cincinnati Children’s 2011 Challenge1 was to assign zero or more emotions to each sentence found in a collection of suicide notes. There were 15 possible emotions to choose from, as well as the option of not assigning an emotion. The emotions were (in frequency order): instructions, hopelessness, love, information, guilt, blame, thankfulness, anger, sorrow, hopefulness, happiness/peacefulness, fear, pride, abuse, and forgiveness.
The decision to pursue rule-based methods was made after observing the following about the training data.
Given these circumstances, it seemed that machine learning approaches would do reasonably well on the more frequent emotions, but would probably be unable to handle less common emotions. Also, in inspecting the training data we noticed that there are certain discriminating phrases that are fairly easy to pick out (eg, can’t go on, associated with hopelessness). As a preliminary experiment we decided to manually build a few simple rules to see how well those would perform. Initially these rules focused on instructions and hopelessness, and were quickly able to reach F-measures in the 30s on the training data. As more rules were added performance continued to improve, so development continued until reaching F1 of 49% on the training data, which we were unable to surpass. The resulting system was applied to the test data and achieved an F1 of 45%. In some sense then the creation of a rule-based system can be seen as an exercise in evaluating our intuitions about the data.
This paper continues with a discussion of our rule based method, and then describes our simple attempt to automate the creation of that rule-based system. We include an extensive analysis of the results obtained by these two systems, plus the effect of combining them in various different ways.
The rule-based system consisted of a set of regular expressions associated with each emotion. The regular expressions represented the occurrence of words or short phrases (bigrams or trigrams) in the sentence. Each sentence to be assigned an emotion was matched against these regular expressions. This matching was done based on the frequency ordering of the emotions, where the most frequent emotion was checked first, etc. If the sentence matched the regular expression for an emotion, that emotion was assigned and the next was considered. A limit of 2 emotions per sentence was put in place based on the observation in the training data that most sentences only have zero or one emotion assigned, and that very few have more than 2.
In effect this rule-based system acts much like a series of decision lists. For example, if any of the regular expressions associated with the most frequent emotion instructions occurred, then the sentence would be assigned that emotion, and then go on to check if any of the regular expressions associated with hopelessness (the second most frequent emotion) occurred. We keep doing this until all the regular expressions have been considered or the limit on the number of emotions has been reached. If no regular expression matches the sentence, then no-emotion is assigned.
In constructing the rules the training examples were manually studied and phrases that seemed associated with particular emotions were noted both via intuition and measures of association from the Ngram Statistics Package.3 We manually studied the most associated ngrams according to multiple measures of association, and then determined which of these appeared (according to our intuition) to be unique or particularly indicative of a specific emotion. We focused on single words (based on frequency), bigrams, and trigrams. For bigrams and trigrams we allowed there to be intervening words. After adding a few rules to the system, we would then evaluate it on the training data to see if those rules helped or degraded performance. We iteratively constructed our rule-based system in this fashion over a period of five days (approximately 40 hours was spent in analyzing the training data and developing the system).
Below we summarize the key categories of regular expressions used to identify each emotion. The emotions are listed in their frequency order as found in the training data, which is the order in which they are considered by the system. Note that there are more categories of regular expressions in the more frequent emotions since there was more data from which to draw them. While we added some regular expressions based on our intuition and some simple expansions of synonyms based on WordNet, those appeared to have very minimal impact. Note that these are not the actual phrases in the regular expressions, but rather a gisting or generalization of them. The program that implements these rules is available from the author.a
Each emotion had a relatively small number of regular expressions associated with it—at most there were approximately 30 regular expressions associated with more common emotions, whereas with less frequent ones there were just a few.
After developing the rule-based system described above, the resulting regular expressions generally represented multi—word expressions (bigrams or trigrams) that appeared to be relatively unique to a particular emotion. We developed our own very simple method of supervision that approximated the process that we followed manually. The training data was divided by emotion, and the most frequent bigrams that were unique to one or two emotions were selected as features. These bigrams allowed for a single intervening word, meaning that the bigram could occur in a window of size three. We found that larger window sizes degraded precision rather dramatically. While maintaining a window size of two resulted in very good results on the training data (approx 60% F1) this was very clearly over-fitted. When we increased the window size to 3 the precision on the training data fell to approximately 52% but we felt these rules would generalize more readily since they allowed for more flexible formulations of the features (the words in the bigram could occur together, or with one intervening word). Note that bigrams that were made up entirely of stop words or that occurred only 1 time were automatically excluded from being features.
The bigrams identified as features were converted into regular expressions that allowed for matching with one intervening word. Thereafter the system performs exactly like our rule-based system, where it considers rules in frequency order, and where it assigns up to 2 emotions per sentence. In our submitted system we did not attempt to assign the five least common emotions (shown with * below). The number of rules per emotion that were discovered was significantly more than used in our rule-based system, and are as follows: Instructions (1276 rules), Hopelessness (417 rules), Information (258 rules), Love (158 rules), Guilt (136 rules), Thankful (80 rules), Blame (51 rules), Anger (26 rules), Hopeful (rules 14), Sorrow (9 rules), *Happiness (5 rules), *Fear (4 rules), *Forgive (2 rules), *Pride (0 rules), *Abuse (0 rules).
This lightly supervised method was developed in approximately 10 hours.
The held–out test data consisted of 300 suicide notes, which included 2,086 sentences. In the gold standard tagging of this data (released after systems submitted results), 1,272 emotions were assigned to 1,098 sentences, leaving 988 sentences with no-emotion (47%).
Our first system (Manual Rule-Based) achieved F1 = 0.45269, Precision = 0.45985, Recall = 0.44575, with N = 1,233. This represented a slight decline from the results with the training data (49%) but in general we felt this system generalized well. When compared to other systems in the Challenge (see Table 1) we can see that it (Rule) is slightly lower than the mean system performance (of 31 systems) but is within a standard deviation of that mean.
The second system (Lightly Supervised) achieved F1 = 0.36455, Precision = 0.33644, Recall = 0.39780, with N = 1,504. This represented a significant decline from performance on the training data (which was at F1 52%). From this we concluded that our method was significantly over–trained, and that our manually developed system (which had many fewer rules) was able to capture more essential information that generalized much better.
The third system (Union) simply took the union of the first and second, in the hopes that the two systems would prove to be complementary (since one was manually developed and the other automatically). Unfortunately this did not prove to be the case. It achieved F1 = 0.44305, Precision = 0.34833, Recall = 0.60849, with N = 2,222.
Table 2 shows the distribution of emotions in the gold standard data versus the rule-based system, the lightly supervised system, and the union of those two systems. As a point of comparison, it also shows the result of taking the intersection of the rule-based and lightly supervised system.
We note that the rule-based system found a distribution of emotions very similar to the gold standard, whereas the lightly supervised system deviated a bit more. This suggests that at least the number of rules and their frequency of invocation was approximately correct for the rule-based system, even if the actual assignment of emotions was sometimes incorrect. We note further that the recall of the rule-based and lightly supervised systems were comparable, but they differed significantly with respect to precision. Taking their union had the effect of driving precision down sharply while increasing the recall, resulting in an F1 score approximately equal to the rule-based system. The intersection of the rule-based system and the lightly supervised system attains relatively high precision (57%) but does so at the expense of recall (not surprisingly).
Table 3 shows a confusion matrix for our rule-based system. This was created by assigning partial credit when a sentence had multiple emotion labels in the gold standard but the system did not predict all of those. The totals were rounded to improve the readability of the table. Note that this confusion matrix does not include counts of cases where no-emotion was assigned by either the gold standard or the system (and the other disagreed). Thus, this matrix only represents cases where both the gold standard and system assign an emotion.
The diagonal total represents true positives (tp) and is equal to 564. Note that the total number of cases where an emotion is assigned both by the gold standard and the system is 793, meaning that accuracy in this case is 71%. This suggests that the significant problem faced by our rule-based system is handling the no-emotion case, and either falsely predicting that no-emotion appears (false positive), or failing to predict an emotion when one should be assigned (false negative). The confusion matrix shows that most errors revolve around the instructions emotion. This is not surprising since it is the most common emotion, and it is also perhaps one of the most general and ambiguous. It is also interesting to note that, for example, guilt and sorrow appear to be frequently confused. The same can be said for instructions and information.
Table 4 shows a confusion matrix where the no-emotion case was included. This was created by explicitly adding a no-emotion label to the gold standard data and system output. Here you can see that the diagonal total for true positives is raised to 1246, but the total number of emotions assigned is now 2283, meaning that accuracy has fallen to approximately 54.5%.
In general our rule-based system did reasonably well in identifying no-emotion sentences. Of 988 sentences in the test data with no-emotion assigned, our rule-based system correctly identified 709 of them, for an accuracy of 71%. The effect of this can be seen when scoring our rule-based system relative to the gold standard using the official scoring program (with no emotion tags included). In that case the results were F1 = 0.54175, Precision = 0.53126, Recall = 0.55265, and N = 2351.
However, what the confusion matrix shows very clearly is a significant number of cases where no-emotion isn’t assigned when it should be, and when no-emotion is assigned when there is an emotion. This is clearly the dominant factor in determining the rule-based systems level of performance.
Note that while no-emotion was not included as a tag in the gold standard data, implicitly it exists and contributes to the error rate in those cases of disagreement. The scoring software assigns a false positive when an emotion is assigned when one should not be, and a false negative when no-emotion is assigned when one should be. The confusion matrices above show that these cases dominate the totals of false positives and negatives, much more so than errors caused by assigning the wrong emotion (which are considered false positives by the scoring software).
Tables 5, ,6,6, and and77 show the performance measures per emotion for our rule-based and lightly supervised systems and their union Note that the values in these tables were created by extracting all system or gold standard sentences that are tagged with the given emotion, and then running that subset of the data through the official scoring software. This did not use the no-emotion tagged version of the data created to build the confusion matrices, but rather used the actual submitted results and official gold standard.
In Table 5 we see that overall rule-based performance is comparable to that attained with the instructions and hopelessness emotions. This is not surprising since these are the two most common emotions. We also note that love was predicted very successfully, but that information was done rather poorly. Interestingly, in Table 6 we see that love is predicted with much less accuracy than in the rule-based approach.
There is a considerable body of work dedicated to the study of suicide notes. While we did not conduct an exhaustive review, we did attempt to familiarize ourselves with that literature. In particular we were interested if the rules that we developed for our manual rule-based system correspond to what has been observed in other studies.
Perhaps the most obvious and important issue with suicide notes is using them to determine the motives for a suicide, and attempting to generalize those findings. For example, Lester et al4 studied 262 suicides in Australia, and found that older people were more often motivated by a desire to escape pain and sickness, whereas this was less common in men. This study makes the point that demographic information could be a very useful piece of information in determining emotion or motive in a suicide note. To that end, Ho et al5 found that older people gave more instructions, whereas younger frequently asked for forgiveness.
Shapero6 collected the Birmingham Corpus of Suicide Notes, which contains 286 notes, 212 by males and 74 by females. This corpus also includes 33 genuine and 33 simulated suicide notes from Shneidman.7 Among the goals of this study was to try and characterize the differences between real and simulated notes. One of the observations most relevant to this Challenge is that simulated notes tend to include many fewer instructions.
This paper reports on the results from the Duluth systems that participated in the Sentiment Analysis track of the i2b2/VA/Cincinnati Children’s 2011 Challenge. We found that our manually constructed rule-based system performed significantly better than our lightly supervised system that was intended to try and mimic the human process. In general we observed that correctly identifying sentences that contain no-emotion was critical to performing well in this task, and that our systems had some difficulty with that. This may be in part because we did not specifically try to identify sentences that contained no-emotion, rather we simply assumed that any sentence with no assigned emotion contained no-emotion. In future work we would like to construct specific rules for identifying no-emotion cases, to see if that might reduce both our false negatives and false positives.
bAll confusion matrices were generated using code made available by Berry de Bruijn of the National Research Council of Canada. We thank him for this very significant contribution to this paper.
Author(s) have provided signed confirmations to the publisher of their compliance with all applicable legal and ethical obligations in respect to declaration of conflicts of interest, funding, authorship and contributorship, and compliance with ethical requirements in respect to treatment of human and animal test subjects. If this article contains identifiable human subject(s) author(s) were required to supply signed patient consent prior to publication. Author(s) have confirmed that the published article is unique and not under consideration nor published by any other publication and that they have consent to reproduce any copyrighted material. The peer reviewers declared no conflicts of interest.