All of the systems we submitted achieved results in line with the performance we obtained using cross-validation on the training set. Since, for testing, we could train on the full data set (instead of 80% as in the 5-fold cross-validation setup), we expected a small improvement in our F1 score on the test data; this was realized for our first (S1) and third (S3) systems, but not in our second (S2) system. One possibility is that the individualized optimization of split points on a per-emotion basis used in S2 led to overtraining.
We trained our classifiers to function well specifically in terms of the F1 score as this was the scoring goal of the i2b2 competition. The F1 score is the harmonic mean of precision and recall. In other settings, precision is known as sensitivity; it captures the proportion of annotated emotions that the classifier detects. Recall also uses the correct positive annotations, focusing on the proportion of positive estimates which were correct; this is known in many other settings, notably health research, as positive predictive value. In those same settings, precision (sensitivity) is supplemented, not by recall but by specificity which focuses on the non-annotation of a label. Here, specificity would be the proportion of sentences that should not be annotated with a given emotional label correctly not having that label applied. The use of F1 score as a summary in this setting completely ignores the true negatives (ie, those that were correctly not labeled with a given emotion). We believe that these true negatives were actually important. With this exercise we were effectively running a series of yes/no annotation exercises on the same dataset. The clear majority of sentences were not labelled with each given emotion: annotated sentences for each given emotion were quite rare in both the training and test datasets. The correct answer was to not label with each particular emotion in most instances. It is interesting to note that the specificity was extremely high (and, therefore, very encouraging) for all of the classifiers that we developed.
The annotated datasets were the gold standard in these exercises and reflected the time and effort of many people. Each sentence was annotated by at least three annotators, assigning annotations to sentences only when two or more annotators agreed. However, there were many instances in both the training and test dataset where we found that emotions were inappropriately applied or omitted by the annotators, as detected by our classifiers. For example, the guidelines state that a sentence should be annotated with “forgiveness” if the author is forgiving someone, not if the author is asking for forgiveness. Yet, the sentence “Forgive me for this rash act but I alone did it.” was wrongly annotated with “forgiveness” in the training set by the annotators.
In addition, there were instances where exactly same sentence did not attract the same emotion from the annotators. In the training data, some sentences appeared multiple times across notes. For example, the sentence “I love you.” appeared in 7 training sentences: 5 times annotated with “love” and 2 left unannotated. Our Memorize Labels heuristic relied on the fact that sentences appearing multiple times across the test and training data would be labeled consistently. However, there were at least 15 sentences in the test data that appeared in the training data with different annotations. Although the particular context of the sentence could affect the labeling, there were two pairs of notes appearing in both the test and training data that were nearly identical. These pairs of notes were the result of a single author writing two notes to different people. One note was in the training data, one in the test data. Even here we find inconsistencies. In one pair, 17 annotations were made to the note found in the training data, yet only 7 annotations were made to the note in the test data. For example, “I want to wear my red and black dress [at my funeral]” was annotated as “instructions” in the training data and was left unannotated in the test data.
The suicide notes provided in the data set were transcriptions of hand-written notes. These notes contained many spelling errors and tokenization inconsistencies. It was unclear where these errors originate, but we suspect some are genuine errors from the author and others were transcription errors in preparing the data sets eg, ”3333 Burnet Ave” is sometimes ”3333 Burent Ave”; other such errors could have been introduced. Our spelling correction algorithm fixed minor errors (eg, “sufering” → “suffering”; “attemp” → “attempt”; “beond” → “beyond”) but failed to correct more complex errors that involved more than a single substitution, transposition, insertion or deletion. For example, “capsuls” was amended by our system as “capsule” instead of “capsules”. Additionally, many spelling errors involved words (often spelled somewhat phonetically) that could not be corrected at all by this simple method: “hemorige” (“hemorrhage”), “disponded” (“despondent), “rearenge” (“rearrange”). Some mispelled words were “corrected” erroneously. There were also instances where the algorithm corrected words that weren’t incorrect. For example, changing the abbreviations “appt” (appointment) → “apt” and “tel” (telephone) → “tell”. Some of these errors may have been addressed more accurately by using an n-gram language model to estimate the best possible correction.10
For example, the phrase “get bettery
charged” should have been corrected as “get battery
charged” but was actually changed to “get better
Introducing dependency relations (Dd) into the model provided a large boost to the overall system performance: the second largest increase in precision, the second largest increase in recall, and the second largest increase in overall F1. The variable dependencies feature (Dv) conflated dependencies such as “dobj(blame, John)” and “dobj(blame, Mary)” into “dobj(blame, x)”. We expected that this could help with data sparsity issues and we demonstrated large gains in recall when using this feature. Unfortunately, this also introduced a large drop in precision. These variable dependencies introduced much noise, possibly because we were not differentiating between the arguments of the dependencies we were conflating. For example, the annotation guidelines for the blame label state that the author of the note should have been blaming someone. However, conflating “dobj(blame, money)” and “dobj(blame, weight)” with “dobj(blame, Mary)” is unhelpful given these guidelines. Had we used entity detection to determine that both Mary and John were people, we could have constructed “dobj(blame, PERSON)”, separating those examples from “dobj(blame, THING)” and potentially improving on our performance.
As one might expect, each classifier did particularly poorly on the emotions that occurred infrequently in the training data. Indeed, the performance of the classifiers for these emotions was so poor that we had better results simply ignoring these emotions rather than include them in our final labeling. Improving our performance on these emotions should be the focus of the continuing development of this work; we suspect that additional training data would have aided.
Our attempts to use classifier combinations were only partially successful. We have demonstrated that introducing the emotionless classifier to boost the confidence of our labelings provided a large increase in recall; however, this yielded a large decrease in precision, with the F1 score remaining largely unchanged. We also explored using logistic regression to select a panel of orthogonal classifiers with combinations of features that might better balance precision and recall. The efforts to select small panels from 48 combinations of features using regression models were not feasible in the time available to us but may warrant further investigation. An exhaustive evaluation of all pairs and triples of classifier combinations found that no combination of two or three classifiers outperformed the best standalone classifier.
In developing our classifiers, we tried to consider the practical applications of the findings from this exercise.11
The loss of any life is sad, and the early termination of one’s own life particularly so. There is no doubt in our minds that the sentiments expressed in the suicide notes must have been present prior to the actual time of suicide. We consider that there may have been previous efforts to express these emotions to other people. We anticipate that one might consider employing an automatic detection algorithm on social networking platforms. This could review posts and activate access to support networks. However, only systems with very high precision would be of any practical value: high precision is more important than high recall because we would not wish to propose interventions unless we were extremely confident in our predictions.
Towards that goal, our final system, S3, attempted to achieve high precision at the expense of recall, while minimizing the impact on F1 score. Potentially, we could have looked to further improve precision with detrimental effects on recall and F1 score, though it is encouraging that we achieved high precision while maintaining an F1 score similar to the mean of all systems submitted to this shared task. Further research might be required to consider whether the sentiments expressed on suicide notes are truly expressed previously. Further information on the age, gender and physical and psychiatric health of these people may be of value.
Previous work in suicide note authorship detection used structural and grammatical features such as the number of paragraphs in the note, the number of misspellings, and the depth of parse tree.12
Our intuition was that these features would not have been useful here, though we corrected spelling errors and used dependency relations from a parser. Structural and grammatical features should be investigated further.