lists the F1 score of the best models from the training data by technique, method, and label. In addition, the micro-averaged F1 score is also provided for each method. Each of the techniques had one of the top three performing methods: rules (0.8396), STM using SVM (0.4630), and weights using DT (0.4206).
Training set F1 score by label and method.
Noteable is the large discrepancy in performance between the rules and the other two techniques. Due to small sample sizes and time constraints, the rules were built using the entire training dataset. Thus, the performance on the training dataset was expected to be overly optimistic to what would be seen on the test dataset. However, the other two techniques both used stratified cross-validation to train and test models. Thus, the training results of STM and weight-based models were assumed to be more in-line with what performance could be expected with test data.
After finding the best models per technique and method, a variety of ensemble models were created and tested. The ensemble models selected from the training set and submitted for the test set are shown in and . shows what methods were included in the ensemble, the cutpoint used, and overall performance measures, whereas breaks down F1
score by label. The first submission used only rules, while the other two submissions used a combination of rules with weights or STM.b
(All ensembles required only a single vote to classify an instance as positive.)
Training and testing performance by submission.
F1 Score by submission and label.
The first submission for the test set demonstrated the rules were overfit, dropping almost 0.50 in F1 score (0.8396 to 0.3408). Many of the rules did not capture the same pattern of word usage in the test set (no true positives were found in eight of the 15 labels), leading to a substantial drop in recall. In addition, many of the patterns found in the training set also applied to sentences without the same specified label, generating a large number of false positives.
The second and third submissions fared better than rules alone, increasing the F1 score by 0.1362 and 0.1615, respectively. Combining rules with weighting methods resulted in two more labels finding true positives (BLAME and FEAR). In addition, five of the seven labels with true positives found by the rules had an increased F1 score.
While the third submission did not result in any additional labels finding true positives, it did perform the best over-all. Submission 3 had the highest F1 score (0.5023) and recall (0.5055) and the second highest precision (0.4992). On an individual level, the third submission outperformed the first submission on six of the seven labels with true positives and the second submission on six of the eight labels with true positives.
The results of the third submission were analyzed for errors. A random sample of up to 50 false positives and 50 false negatives were examined for each label. Overall, a few common themes emerged.
- A clear delineation between various labels was difficult to discern. For instance, sentences incorrectly classified as INFORMATION instead of INSTRUCTIONS and vice versa.
- Complex language usage was not accounted for because our techniques employed shallow text analysis. For instance, errors were found in sentences with sarcasm (eg, “also am sorry you never cared” → ANGER), negation (eg, “... she doesn’t love me ...” → not LOVE), and emotions stated in a general sense rather than expressed by the writer (eg, “... all us good men expect from the woman we love ...” → not LOVE).
- Wide variability in word usage and meaning made uncovering robust and generalizable patterns challenging, especially for rules. Having a document collection that spanned a 70-year period and included writers of heterogeneous backgrounds contributed to the variation.
- Finally, it was unclear why some sentences were or were not assigned to certain labels in the gold standard. It appeared some assignments were based on context from surrounding sentences, but others were not as apparent.