As the majority of the target annotations are expressions of emotions, we sought to incorporate information about the psychological and emotional content of the notes by using Linguistic Inquiry and Word Count7
(LIWC). LIWC is a psycholinguistic resource that assigns one or more psychological categories such as positive emotion
to individual words. For example, the word “happy”
would be labeled with the categories positive emotion
. By scanning a text and assigning categories to applicable words in that text, one can derive an aggregate signature of the psychological character of that text. LIWC currently has 80 categories.
In order to perform category assignment for a given sentence, LIWC perform a lexical match against its word to category dictionary. As such, LIWC is essentially performing a look-up, without conducting any part-of-speech identification nor sense disambiguation, and employs a few simple look-aheads to deal with a handful of ambiguous cases. In the authors’ experience, when given a word, LIWC usually presumes that word’s primary part-of-speech and word sense for assigning psychological and emotional dimensions. For example, the categories induced by the word “cold”
, which would correspond to the adjective relating to the physical sensation of lowered temperature, instead of the adjective used to describe a person with little or no emotion, or the noun form used to describe an infection. Despite this apparent deficiency, a previous study found LIWC to better overall for identifying emotions, compared to similar psycholinguistic resources.8
For each sentence, we applied LIWC and used the returned counts directly as features. Because LIWC’s analysis also included explicitly non-emotional categories that may be redundant with information already encoded by the POS tagger, such as the presence of pronouns and prepositions, we used only LIWC categories that contained emotional content.
One of the primary motivations for using psycholinguistic resources such as LIWC is to introduce additional knowledge that could help identify the rarer emotions. As shown in , the top eight emotions account for 90% of all annotations. This leaves the remaining seven emotions at risk of being overpowered, as the optimizer used to train the emotion classifier is likelier to favor the majority classes and neglect emitting the minority classes as hypotheses. We hoped to ameliorate this by introducing potentially strong signals into the featureset that correlate highly with just those minority classes. By doing this, the classifier’s performance on those classes should be improved.
During develpoment, we found that LIWC tended to assign multiple labels to words that would ideally like to be identified using a single label. For example, words commonly associated with the pride emotion consistently mapped to the LIWC affect, posemo, and achieve categories. This may introduce problems with the learner when dealing with another category that also scores high on a subset of those categories, such as affect and posemo. By using a single feature to tie those occurrences together, we hope to produce a stronger signal that the optimizer can use during classifier training. We introduced another feature which looked for specific combinations of LIWC categories over each word, and for matches found the corresponding single feature was added to that instance. We targeted the minority emotions sorrow, pride, and happiness/peacefulness with this feature.
Although LIWC profiles text along 80 dimensions, there are only three of these that we consider clearly relevant to the 8 “emotion” tags of this challenge. Those three categories are affect, negemo (negative emotion) and posemo (positive emotion), and these did not exhibit a very strong correspondance with the target emotions we wish to annotate with. We also found that more often than not, the targeted emotions were usually expressed by phrases instead of the presence of individual words.
To this end, we developed our own custom word and phrase lists that targeted the emotion annotations of interest. Like LIWC, these are applied over a source sentence, and enter as features the number of matches found in that sentence. These lists were developed using the training notes, as well based off of experience.