Overall features most often selected for the machine algorithms included:
Words: am, and, are, Betty, but, could, did, do, everything, for, good, goodbye, had, have, he, her, I, in, is, it, Jones, leave, life, longer, love, Mary, more, mother, my, n’t, now, Smith, so, that, the, things, this, to, Tom, was, with, and you. While it is reasonable to suggest that the anonymized proper nouns like Jones, Mary, Tom and Smith should not be part of the feature space, they were included because they can act as proxies for individual names;
Part of speech tagging:Cardinal number (CD), Determiner (DET), preposition or subordinating conjunction (IN), Adjective (JJ), Adjective superlative (JJS), Modal (MD), Noun, singular or mass (NN), Proper noun plural (NNP), Noun plural (NNS), Personal pronoun (PP), Prepositional phrase (PP), Personal pronoun (PRP), Possessive pronoun (PRPS), Adverb (RB), Verb, base form (VB), Verb, past participle (VBN), Verb, non-3rd person singular present (VBP), and Verb, 3rd person singular present (VBZ);
Readability: the Flesch Reading Ease score is a 100-point scale, with higher scores easier to read. The Flesch-Kincaid Grade Level is a number that corresponds with grade level; and
Emotions: giving things away, hopeless, regret and sorrow.
The human raters relied on the ontology shown in : Suicide Ontology. This ontology is much more extensive than the four emotions aliquoted by the information gain function.
Feature selection and data reduction are listed in : Feature Selection Process and Results. Information gain was calculated only for the training data in each bootstrapped sample. From the initial 1063 possible features 66 were ultimately selected based on information gain and frequency. They included words, parts of speech, concepts and r eading scores.
: Genuine and Elicited Notes Descriptive Statistics
provides mean and standard deviation of a number of note characteristics. It shows fifteen features with the smallest p-values in two sample Wilcoxon-tests. Hypothesis testing as a feature selection is only one of many methods and may not always describe the data accurately.38
Some machine learning algorithms, like LMT, have feature selection embedded. It is worth looking if different feature selection algorithms give same results. This is possible only for some simple problems.
Genuine and elicited notes descriptive statistics.
: Human & Machine Raters in after 25 × bootstraps shows that mental health providers perform better than psychiatry trainees, but not as well as the best machine learning algorithms. For the psychiatry trainees, their overall categorization was roughly equal to the flip of a coin. Mental health providers were significantly better than trainees. They accurately classified notes about 63% of time.
Human & Machine raters in after 25 × bootstraps.
compares the machine classification algorithms with psychiatry trainees and mental health providers. On average, the best machine algorithm (Logistic Model Trees) performed significantly better than the mental health providers. All the algorithms did significantly better than the psychiatry trainees. Nine of ten machine algorithms performed significantly better than mental health providers.
Performance of different machine learning algorithms is complementary. J48/PART (0.640/0.645) suggests that a tree representation is mediocre for the data. On the other hand Linear SMO (SVM)/LMT (0.705/0.744) suggests that there is some linear separability of the two categories. In addition, a logistic regression outperformed linear support vector machines, i.e. LMT (0.744).
: Logistic Model Tree. When all features and all suicide notes are used for training and LMT is trained on the entire data set, there was only one leaf with two linear functions that categorize suicide notes. There are only three features shared by the Wilcoxon-test and LMT (maximal frequency of a word, Flesch-Kincaid grade level, cardinal number frequency). Two of the equations in misclassified only four documents. Features selected by LMT describe sentences (number of words, depth of the parsed tree), whereas hypotheses testing selected features that describe different aspects of the notes.
Logistic Model Tree when all features and all suicide notes are used for training.
can be difficult to read and so we offer the following example as an explination. : Hyperspace Definition shows a three dimensional cube, or a hyperspace with three features. Axis z represent the Flesch-Kinkaid reading score, Axis y represents the MLS method, and axis x represents the MDS method. In this case, the difference each methods computes creates a hyperplane. This hyperplane is shown in the center of the defined hypercube. Those features above the hyperplane are labeled with a “+”. In our case this represents genuine notes. Those features below the hyperspace are labeled “−”. In our case, this represents elicited notes.