Our system of sentiment classification combines ML algorithms and regular expression pattern rules. It is built upon techniques like named entity recognition, word normalization, POS tagging, and synonyms expansion through Wordnet. These NLP techniques helped to better generalize data and to improve the performance of ML and rule-based systems.
As for the ML approach, the major advantage we observed was that it generalized well across the problem space, when provided with sufficient training data. Our ML system produced better performance on the final test set (F-score of 0.564) than what it achieved in a 10-fold cross-validation on the training corpus (F-score of 0.552). Meanwhile, some emotions have a small number of instances (eg, abuse and forgiveness) and they posed a great challenge to the ML system. Apart from the inherent challenges in identifying human emotions, irregularity of data formatting and annotation posed additional challenges. Our other ML exploration includes stemming of words for token normalization, application of SMOTE14
and also support vector machine (SVM) with unequal error costs for the class imbalance problem, application of a sequential labeling method to exploit cross-line dependency, and incorporation of POS and other syntactic information as features. These attempts either did not result in improved performance or could not be explored fully within the limited time frame during the challenge event.
Some emotions are expressed with relatively explicit indication keywords and simple patterns (eg, love and thankfulness). For these kinds of emotions pattern matching rules seem to be effective. Meanwhile, the other emotions are expressed in various ways and are difficult to generalize phrasal patterns by manual pattern matching rules (eg, hopelessness, guilt, instructions, and information). Although a few description patterns can be found in these emotions (ie, instructions has a description pattern in that the sentence starts with a base verb form; information contains address, phone number, money information, etc.), they are not always correct and do not cover many cases. In addition, the shortcoming of our rule-based system is that it was over-fit to the training data. Therefore the classification performance on the test data was much lower than that of the training data (F-score of 0.628 vs. 0.555 in the training and test sets, respectively).
The Union system produced the highest recall because it merged both systems’ outputs; however, its precision was lower than the others. Although the Union system produced the highest micro average F-score, a simple union of two system outputs did not achieve much gain compared to the ML system.
Besides the challenges pertaining to sentiment analysis, insights we gained from the current exercise include the irregularity and heterogeneous nature of real-life data broadly gathered for clinical NLP applications. We believe these problems are not able to resort only to data, format, and/or annotation standardization because it must be the reality that incoming data are always not well-formatted. NLP systems need to accommodate unexpected vocabulary, formatting and inconsistent annotation to some extent. That being said, regular formatting of text and consistent annotation would be highly desired. Notably, in System 1 (ML emphasis), our re-annotation of the training corpus improved emotion classification by ~1% in micro average F-score on both training and test sets.
In the current corpus, we observed that the exact same sentence can have different emotions assigned to it in the gold standard. For example, the following cases are inconsistently assigned a given emotion (ie, sometimes assigned a given emotion, sometimes not):
“Thanks _NAME_.” <e = “thankfulness”>
“I love you.” <e = “love”>
In addition, similar sentences often disagree with their emotion class. We assume that this is because the feeling of emotion could be subjective, and in addition, it could be affected by the context of the whole document. This fact might cause inconsistency in emotion annotation. The relatively low F-score of this task (mean = 0.4875 of all participated teams in this I2B2 challenge) might also reflect this intrinsic difficulty in emotion assignment. To partially incorporate nearby contexts, we tried to use one previous sentence as well as the current one in ML training, but this did not improve the classification performance.
For some emotions (eg, sorrow, blame, and anger), it seems necessary to understand the overall contextual meaning of the sentence rather than using simple indication keywords. The ML trained without syntactic/semantic features and string pattern matching rules is prone to fail in correctly identifying those emotions.
Even further, some emotions seem to be annotated based on document level understanding rather than handling individual sentences separately. Those emotions are hard to classify correctly unless the system understands the overall context and feeling of the person. The current system trained and tested without considering deep syntactic/semantic aspects would face the difficulty in correctly identifying them.