In order to produce a multi-label classification of emotions, we need to extract a variety of features from the suicide notes. Some of them are explicit, others are implicit. We use statistical techniques to detect both explicit indicators of emotions (words and phrases that directly evoke an emotion) as well as implicit indicators (such as words or topics often associated with an emotion). Both types of indicators are important as language allows for both explicit emotion declaration and more subtle implication of the author’s emotional state. In addition, we use several similarity metrics to evaluate the “emotional distance” between sentences in the training and the testing sets.
Statistical distillation of emotion-bearing phrases from the training data
The most accessible means of learning phrases associated with a given emotion is to discover phrases associated with that emotion in labeled data. This captures both common explicit references to emotion and common implicit phrases that frequently occur in a specific emotion’s context. For example, in the i2b2 training data, the phrase “can’t go on” is highly associated with the emotion Hopelessness, “please forgive me” is associated with Guilt, and “bless you” is associated with Love.
We perform a statistical dependence test for each possible phrase/emotion pairing, where the null hypothesis is that the phrase and emotion are unrelated and only co-occur by coincidence. To calculate this we use pointwise mutual information (PMI):
) is the probability of the phrase x
occurring in a sentence labeled as emotion e
, and p
) is the probability of seeing phrase x
in the training data.
Here we consider a phrase to be any fixed number of tokens (we experimented with 1, 2, and 3-token phrases) as opposed to a syntactic definition of phrase. Examples of top phrases for the most common emotions are shown in . When classifying new sentences, phrases above a given threshold are extracted and matched to their associated emotion, where the exact threshold is specific to the individual feature. For more details on the actual features used, see the Feature types section. Additionally, we experimented with Fisher’s exact test as an alternative to PMI, but while it proved successful in previous work,8
it did not have a positive impact on this task.
Representative phrases chosen from the i2b2 training data by statistical phrase discovery.
Statistical distillation of emotion-bearing words from unlabeled data
In addition to collecting phrases from a small, labeled emotion corpus, words associated with emotions can be collected from a large, unlabeled corpus. Instead of using the manually labeled sentences, we utilize the emotion-evoking terms drawn from WordNet Affect.5
We perform no word sense disambiguation. Rather, WordNet Affect is transformed from a sense-based inventory to a surface-form lexical inventory that matches all possible senses of a word. Sentences containing a term from WordNet Affect are assumed to evoke the emotion that term is associated with in the WordNet Affect ontology (eg, “afraid” evokes Fear
, “ecstatic” evokes Happiness
). Our source of unlabeled data is the English Gigaword corpus,9
which contains over 8.5 million newswire articles. Due to the size of the data, only individual words are considered, as phrases of length two or more would require significantly more processing. PMI is again used to determine the most statistically indicative words.
We manually identified 21 emotions from the WordNet Affect ontology that would best correspond to the i2b2 emotions as well as additional high-level emotions that might prove useful. These chosen emotions are: emotion, mental-state, positive-emotion, negative-emotion, anxiety, liking, dislike, hate, joy (for Happiness), contentment (Pride), love (Love), gratitude (Thankfulness), calmness (Happiness_Peacefulness), positive-fear (Fear), positive-hope (Hopefulness), sorrow (Sorrow), sadness (Sorrow), regret-sorrow (Guilt), anger (Anger), forgiveness (Forgiveness), and despair (Hopelessness).
The primary limitations of this approach are (1) the assumption that sentences containing an emotion term actually evoke that emotion, and (2) emotion in newswire is lexically expressed similar to transcribed suicide notes. However, this approach will find emotion-evoking words not present in the small, labeled i2b2 corpus, and features based on these words (see the Feature types section) do prove useful at detecting emotions in suicide notes. Alternatively, unsupervised detection of topic can also cluster words indicating the same emotions, thus allowing the discovery of many more emotion bearing words. We therefore use topic modeling as a means of discovering additional features.
Related sentences can be clustered using topic modeling
to create clusters based on implicit topical information. Topic modeling techniques, such as latent Dirichlet allocation (LDA),10
can discover cross-document similarities even when sentences have no words in common. We use the MALLET implementation of LDA11
and treat every sentence as its own document.
LDA then considers every sentence as a bag-of-words. It assumes each sentence is associated with a probabilistic mixture of topics, and each topic is composed of a probabilistic mixture of words. For example, one topic might deal with family and contain words such as “love”, “dear”, and “daughter”. Another topic might be more financial in nature and contain words such as “money”, “debt”, and “payment”. With LDA, the granularity of the topics may be adjusted by increasing or decreasing the total number of topics. Additionally, LDA is completely unsupervised, so it can operate over a large amount of unlabeled data.
We used LDA for modeling topics because we believe there is a relationship between topics and emotions. For instance, sentences about health are likely to address the reason for the author’s suicide and convey an emotion like Hopelessness. A sentence discussing financial issues is likely to contain Information. And a sentence topically related to religion is likely to evoke Forgiveness or Thankfulness. contains the results of running LDA on the i2b2 training data with 10 topics (word casing has been removed). As can be seen in the table, common words such as “i”, “you”, and “have” are present in many topics since they do not add much topical information, while words like “bank”, “dollars”, “check”, and “purse” are co-located in a single topic (Topic 1), suggesting financial information. Other topics (eg, topic 7) do not seem to form cohesive topic clusters. This is likely a result of running LDA over sentences instead of documents, as sentences are less likely than documents to have clearly defined topics.
Top words for each topic determined by LDA from the i2b2 training data.
Given an unlabeled sentence, the results of running a topic model can then easily be used to find similar sentences in the training data (see Similarity metrics) and the sentence’s inferred topics can be directly used as features in a classifier (see Feature types). Importantly, LDA’s compact topic representation generalizes well to valid semantic spaces, so if two sentences are in similar topics, they likely evoke similar emotions. Additionally, sentences containing similar emotions can be found through the use of similarity metrics.
Given the importance of implicit information in emotion detection, it is difficult to devise universal rules for what constitutes an emotional statement. Rather, this is often defined empirically by the task’s annotated training data. The relatively low inter-annotator agreement on this data confirms this, reported by the organizers as 0.546. Thus, instead of designing methods that extract information from a sentence so that a classifier may decide what emotion is present, we focus on methods that find similar sentences and their emotions. In this case the classifier’s role is merely to weigh the result of multiple similarity metrics, thus simplifying the learning problem. We experimented with numerous similarity metrics but settled on just three: unweighted token overlap, tf-idf weighted token overlap, and topic similarity.
Unweighted token overlap treats both sentences as bags of words and measures the percentage of tokens the two sentences have in common. In set notation:
are the non-unique words in the two sentences, and |S
| indicates the number of words in the sentence.
Tf-idf weighted overlap is simply a weighted version of token overlap designed to favor rarer words. The weights are assigned using term frequency-inverse document frequency, the standard means of assigning term importance in the field of information retrieval. Term frequency (tf) is simply the number of times the word appears in the sentence. Inverse document frequency (idf) is the inverse of the number of documents a term appears in a given corpus (we use English Gigaword). We smooth the document frequency by assigning a minimum document count of 10 for rare words. This weighting method therefore gives greater importance to rarer words and almost no weight to stop words and punctuation, as they are present in nearly every document.
Topic-based similarity differs from word overlap similarity metrics in that it can find similar sentences that have few or no words in common. LDA assigns topic distributions to both documents (sentences in our case) and words. Typically, topic-based similarity metrics would use the topic distribution associated with the sentence. However, given the short length of sentences relative to the documents that topic models typically use, the sentence topic distribution can be quite noisy. Instead, we average the topic distributions for each word in the sentence in order to get an overall topic distribution. The topic distributions of two sentences are then compared using the inverse Jensen-Shannon divergence:
are the two topic distributions, M
is the average of the two distributions, and KL(A
) is the Kullback-Leibler divergence:
Jensen-Shannon is simply a symmetric extension to Kullback-Leibler. Jensen-Shannon has proved useful in calculating the similarity of two probability distributions in many NLP applications.12
These three metrics are used to compute the most similar sentences to a query sentence.
Features based on these similarity metrics can then use k-nearest neighbor (KNN) style classification in order to indicate the emotions in similar sentences from the training data. Since KNN is a computationally expensive O(n2) operation, we pre-cache all possible sentence distances and the nearest 100 neighbors for each sentence. This caching process takes approximately one hour per similarity metric on a single CPU core. See the Feature types section for more details on these features.
The approaches described in the previous section are integrated into a supervised classification framework, shown in . The exact choice of features is optimized relative to the classifier using an automated feature selection technique.
We utilize a series of binary SVM classifiers13
to perform emotion detection. Each classifier performs independently on a single emotion, resulting in 15 separate binary classifiers. The combination of these separate classifiers can be thought of as a single multi-label classifier, allowing for a sentence to be annotated with zero or more emotions. If multiple binary classifiers return a positive result, the sentence may have more than one emotion; if every binary classifier returns a negative result, the sentence has no emotions. While it would be possible to use separate features for each binary classifier, many of the emotions have very few training instances so this might lead to over-fitting. Additionally, SVMs allow for a bias parameter to be set to weight an individual outcome, which is useful when dealing with rare outcomes as a high frequency outcome will always be chosen over a very low frequency outcome if both outcomes are given equal weight. This would lead to good precision but very low recall for many of the emotions in this task. We set the bias parameters for each outcome to the inverse probability of that outcome in the training data (eg, Blame
constitutes 2.1% of all emotions in the training data, so the positive output in the Blame
classifier has weight 0.979 and the negative output has weight 0.021).
Based on the methods outlined in the Approaches section as well as a bag-of-words baseline, we created the following feature types (or templates):
- SentenceUnigrams: A baseline bag-of-words feature.
- StatisticalLabeledPhrases(phrase_size, threshold): Returns all phrases in the sentence judged to be statistically indicative for any emotion. Parameters specify the phrase size (number of tokens) and the minimum PMI threshold.
- StatisticalUnlabeledWordSum(emotion): Real-valued feature that calculates the sum of word scores for the given emotion from the unlabeled data based on the WordNet Affect ontology. Unlike the feature based on statistical phrases from labeled data, the words from unlabeled data might not be present in the training data and therefore these features must be directly tied to a specific emotion.
- StatisticalUnlabeledWordStrongest(emotion): Real-valued feature that indicates the score of the strongest word for the given emotion instead of the sum.
- TopicScore(topic): The score for a given LDA topic for the sentence.
- MostCommonEmotion(sim_type, num_neighbors): A k-nearest-neighbor feature that indicates the most common emotion from a sentence’s nearest neighbors. All sentences from the training data are considered as potential neighbors. Parameters specify which of the three similarity metrics (unweighted token overlap, tf-idf weighted token overlap, topic similarity) to use as well as the number of neighbors to consider.
- StrongestEmotionScore(emotion, sim type, num neighbors): A real-valued feature that returns the similarity between the current sentence and the nearest neighbor that contains the given emotion. Parameters include the emotion, the similarity measure, and the maximum number of nearest neighbors to consider before returning a similarity of zero.
The feature types previously discussed (along with many more not discussed here) have far too many parameterizations and combinations to manually select the best subset of parametrized features. Rather, we use an automatic feature selection technique known as floating forward feature selection14
or greedy forward/backward. This method iteratively improves the set of features using a greedy selection. Each iteration is composed of a ‘forward’ step, which adds at most one feature, and a ‘ backward’ step, which removes already added features. In the forward step, all unused features (ie, all possible parameterizations of the feature types above) are individually tested in combination with the current set of chosen features. The single feature that improves the cross-validation performance the most is added to the chosen feature set. If no new feature improves the performance, the algorithm terminates. In the backward step, features in the chosen set that hurt cross-validation performance are removed. Intuitively, over time, some features may become redundant or even harmful after new features are added, so pruning the chosen set can improve performance. The result of running feature selection on a 5-fold cross-validation of the training data is shown in . These are the features used in our official submission to the i2b2 emotion detection task. All features from are used in all 15 classifiers. While features such as “StrongestEmotionScore (Anger
, 15)” seem to target a specific emotion, they may be useful in other classifiers as well. Additionally, this allowed us to run our feature selection algorithm just once, using the scores for all 15 classifiers in order to guide the feature selection process, as opposed to running feature selection separately for all 15 emotion classifiers.
Features selected through automatic feature selection.
The feature selector chose features from each of the approaches discussed in the Approaches section, suggesting that they add complementary information. Two features were chosen based on the lexicon built from the labeled data–one uses 2-token phrases and the other uses 3-token phrases. The fact that a higher threshold was chosen for the 3-token phrases suggests that 3-token phrases can be quite noisy, so a higher threshold is necessary to filter all but the most indicative phrases. Two features were also chosen from the lexicon built from the unlabeled data, using the WordNet Affect emotions forgiveness and positive-fear. It is difficult to determine why these two were chosen instead of others, but given that Forgiveness and Fear were two of the rarest annotations in the training data, it is likely that the classifiers for other emotions were able to effectively use these features as well. The topic score for topic 8 was chosen, which is a topic that deals with the author’s actions and thoughts. This likely was useful for distinguishing between the typical emotions and the two command-like emotions Information and Instructions. Finally, six separate features were chosen based on all three of the described similarity metrics. While most of the similarity features deal with specific emotions (such as the two that use Sorrow), they can still be useful for other emotion classifiers as well. For instance, knowing that no similar sentence has a strong Sorrow score can help the positive classification of the emotions Pride and Love. The similarity feature based on None (ie, the sentence has no emotion) was probably useful for making negative classifications for each emotion classifier.