The training corpus consists of 600 suicide notes hand-annotated, while the test corpus is composed of 300 suicide notes. Those documents are of several kinds, mainly last will and testament. The corpus has been fully de-identified*
(names, dates, address) and tokenized.
Each document from the training corpus is very brief, on average: 7 sentences and 132.5 tokens (mainly words but also punctuation marks) per document. Proportions are similar for the test corpus.
Documents include spelling errors (conctract – poicies). There are a few residual processing errors, more particularly the apostrophe in genitives and abbreviations, where spaces have been introduced (could n’t – Mary’ s) or the apostrophe replaced by a star with missing tokenization (don*t – wasn*t). Sentence segmentation is noisy (several short sentences are sometimes encoded as one single sentence). In the training corpus, 2,173 different sentences have been hand-annotated, among them 302 sentences received several category labels (see ).
Number of sentences for each number of annotations per line in both training and test corpora.
Lines with several annotated emotions are long sentences: the two lines composed of five emotions are between 73 and 82 tokens long. As an example, the longest line (“My Dearest Son Bill : Please forgive mother for taking this way out of my umbearable trouble with your Dad Smith—Son I ’ve loved you and Dad beyond words and have suffered the tortures of hell for Smith but his lies and misconduct to me as a wife is more than I can shoulder any more—Son God has been good to you and mother and please be big and just know that God needs me in rest .”) has been annotated with the five following emotions classes:
abuse, blame, guilt, hopelessness and
love. In , we give the distribution of the annotation among the different categories.
Number of annotations for each category in both training and test corpora.
Here is an example of annotation from the test corpus with its reference annotation.
INPUT FILE: 20080901735_0621.txt
John : I am going to tell you this at the last.
You and John and Mother are what I am thinking—I can’t go on—my life is ruined.
I am ill and heart—broken.
Always I have felt alone and never more alone than now.
Please God forgive me for all my wrong doing.
I am lost and frightened.
God help me,
Bless my son and my mother.
OUTPUT FILE: 20080901735_0621.con.txt
c = “You and John and Mother are what I am thinking—I can’t go on—my life is ruined .” 2:0 2:21||e = “hopelessness”
c = “Always I have felt alone and never more alone than now .” 4:0 4:11|| e = “sorrow”
c = “I am lost and frightened .” 7:0 7:5||e = “fear”
We have found the task to be difficult for the following reasons.
- Multiple labels per sentence. In the following example, the two labels
instructions: were provided by the annotators:
In case of sudden death, I wish to have the City of Cincinnati burn my remains with the least publicity as possible as I am just a sick old man and rest is what I want.
Multiple labeling makes the task more difficult for machine-learning classifiers that normally work with a single label per sample.
- No annotation. When no annotation was assigned to a sentence, two interpretations are possible: either there is no emotion expressed, or there was a disagreement between the annotators. Here is an example, where a note could have been annotated with the
love, but was left without annotation:
I love you all, but I can’t continue to be a burden to you.
The ambiguous “no annotation” assumption adds noise to the training data.
- Fine grained labels. Certain labels have very close meanings and are consequently hard to distinguish from one another. As an example,
instructions, guilt vs. forgiveness, or sorrow vs. hopelessness.
- Unbalanced distribution of labels. Certain labels in the training (and test) set appear much more frequently than others. The most frequent label
instructionsappears 820 times in the training set, while the label
forgiveness appears only 6 times. This makes it all the more difficult to learn rare classes, due to possible biases during the training.
- Lack of additional training data. The task organizers provided the training corpus, however it is extremely difficult to find additional training material. To our knowledge, there is no publicly available text corpora of suicide letters or other similar resources. Construction of such a corpus is also problematic due to the nature of the task and lack of information about the guidelines used by the annotators.