The performance of experiments was measured using three standard measures: precision (P), recall (R) and F-measure (F).
To compare different statistical classifiers, we used bag of 1–4 grams features to train SVM, Naïve Bayes and decision tree boosting classifiers26
using 600 training data with 10-fold cross validation experiments. is the micro-averaged results of the three classifiers. The experiment results demonstrated the results of SVM (tune bias of each category) have the best performance compared with other two classifiers. The results in show that SVM is flexible. After tuning threshold parameter of each category, SVM ranks top in the three classifiers. In this task, SVM was chosen as the classifier for its flexibility.
10-fold cross validation micro-averaged results using bag of 1–4 gram features as baselines (categories abuse, anger, blame and pride are not included to avoid False Positive).
The effect of the framework, ie, dividing fifteen categories into three groups, is shown in . The first row uses preprocessed sentences and 1–4 grams without feature selection, but categories abuse, anger, blame pride and forgiveness are not included to avoid false positives. The second row uses the framework as explained above. The F-measure of the system of dividing categories into three groups is improved by 4.96%.
Micro-averaged results for fifteen categories on test data (Evaluating the use of framework).
The effect of spanning n-gram feature for eight subjective categories is shown in . All systems used bias-tuned SVM as classifiers. The baseline uses 1–4 gram features. When we replace the feature with unigram and four types of spanning n-grams without feature selection, both precision and recall improved by a margin. After feature ranking and selection with the help of LiveJournal corpus, the classification result is further improved.
Micro-averaged results for eight subjective categories on test data (Evaluating spanning n-gram features and feature selection).
The influence of item/location generalization for objective categories is shown in . The baseline uses pure 1–4 gram features. The results show that generalization greatly improves the performance of the two categories. To assess the effectiveness of knowledge from eBay, we conduct experiments with and without item normalization. We can see that the eBay knowledge contributes to a more significant performance gain for information than that for instructions.
Micro-averaged results for objective categories on test data (Evaluating item/location normalization and eBay knowledge).
We submitted three systems for the task. System 1 used the 600 notes provided by i2b2 organizers as training data and 300 notes as testing data. In System 2, extra labeled information
sentences were added for training. In System 3, more sentences from all categories were added on the basis of the system 2. These labeled sentences originated from www.suicideproject.org
, a website where people share stories about their painful thoughts and unbearable life. 220,000 unlabeled sentences from the web as testing data were imported into information
models. We obtained all confidence ranking of the 220,000 sentences. The confidential sentences above the threshold were manually chosen. A total of 268 sentences (158 information
and 110 instructions
) were annotated. It is noted that we added sporadic labeled sentences of other categories when we labeled information
For system 3, we collected posts expressing similar emotions as in suicide notes, and manually labeled sentences following the same schema of this task. A total of 1,814 sentences were labeled.
The last test results were micro-averaged. is the micro-averaged results for sentiment analysis in suicide notes. The experiment results demonstrated that adding extra labeled data can improve the overall performance.
Micro-averaged results for sentiment analysis in suicide notes.