The given topic classification task is a supervised learning problem, where each sentence is assigned to one or more predefined topics. The first problem associated with this task, or with text categorization in general, is the high dimensionality of feature space, since most often the features chosen are the frequencies of individual words. Such representation suffers from the curse of dimensionality (or Hughes effect),17
where given a fixed size of the training dataset, the predictive power of a machine learning algorithm reduces as the dimensionality increases. In order to reduce the number of features we applied two strategies: generalization and mutual information.
As part of generalization, individual words were mapped to different types of categories. At the most general level, words were mapped to their POS, ie, syntactic information attached to words during linguistic pre-processing. A total of 21 POS classes were used. While, this certainly overgeneralizes individual words, some types of POS information do provide useful clues for topic classification purposes. For example, tokens containing numerical information, and thus tagged as CD (ie, cardinal number), vary widely in terms of their content, which itself does not provide information that can improve the classification. However, their general type alone (ie, CD) provides a useful classification clue, as it was mainly associated with the Information
classes where it usually represents information such as addresses, phone numbers, quantity, dates, etc. Here are some representative examples to illustrate its use in sentences classified as Information
- – All my things are at <CD>3333</CD> Burnet Ave.
- – Can be reached by phone State—<CD>636–2051 </CD>.
- – $ <CD>147.00</CD> in purse.
- – I am paid up there till <CD>01–01–01</CD>, there is $ <CD>145.00</CD> in cash in bank book.
- – My Social Security number is <CD>333–33–3333 </CD>.
At a slightly lower level, we generalized individual words into their lexical domains. Namely, WordNet synsets are organized into 45 lexical domains based on syntactic category and logical groupings.18
Here we provide some examples of words from the verb. possession
domain (ie, verbs of buying, selling, owning, etc.) used in sentences classified as Instructions
- – <possession>Buy</possession> John some clothes or what he needs most.
- – I give my sister power of attorney to <possession> cash</possession> checks.
- – Don’t <possession>pay</possession> any more rent.
- – Don’t <possession>sell</possession> the new house, please.
- – I also told Mr. J. Johnson not to <possession> spend</possession> any more than $300.00 on my burial.
Where we needed more fine-grained groupings of words not provided by WordNet, we assembled our own lexicons to support lexical representation of relevant semantic categories (see ) in addition to lexicons of words related to topics considered (see ).
- – That <health>arthritis</health> and hardening of the <health>arteries</health> are too much for me.
- – Have him notify my <occupation>lawyer <occupation>.
- – May <religious>God</religious>, <family> family</family>, <people>friends</people>, and <name>John</name> forgive me.
- – <instruction>Bury</instruction> me at least of <instruction>expense</instruction>.
- – You have been <anger>mean</anger> and also <anger>cruel</anger>.
- – <love>Dearest</love> <love>darling</love> I <love>love</love> you.
Having generalized individual words in the ways described above, we counted the frequencies of their general categories and used them as features instead of counting the frequencies of individual words. This was done for all generic feature types, apart from the occupation type, where we differentiated between the words as they were shown to be associated with different topics, eg,
- – Fear: I know I should see a <occupation> doctor</occupation> but I’ve been afraid to ask you.
- – Instructions: Have him notify my <occupation> lawyer</occupation> John J. Johnson.
- – Anger: They are gang of <occupation> politicians</occupation> and grafters.
We also used individual words as features for those selected using mutual information,15
an additional strategy we used to reduce the dimensionality of the feature space, to address a potential loss of important information due to overgeneralization. Mutual information, as one of the most effective feature selection mechanisms, was used to reduce the number of words considered by identifying the most informative ones. In this case, individual words were still used as features and their frequencies were counted. This reduced the number of words considered from 4,506 to 153 most informative ones.
In order to identify explicit mentions of a range of different emotions, we used the WordNet-Affect3
lexicon described in the previous section, eg,
- – I’ve tasted the last bitter dregs of <despair>despair </despair>, disillusion, <forlornness>loneliness</forlornness>, <misery>misery</misery>, poverty, strife, confinement, <positive-concern> <negative-concern><distress>worry</distress> </negative-concern></positive-concern>, <grief> grief</grief>, failure, and everything else that could contribute to an ignominious end.
Each word found in the lexicon was mapped to the associated emotion, often expressed by the given word. Where a word was mapped to multiple emotions (eg, worry was mapped to three emotions: positive-concern, negative-concern and distress), all mappings were used, that is—no disambiguation was performed. This type of noise was left to be dealt with or resolved by the machine learning algorithm at a later stage. Each occurrence of an emotion word was used to increase the corresponding feature value. In addition, the hierarchy of emotions was used to generalize emotions at all levels. When a word was mapped to an emotion, we also used its ancestors as features and increased their values too, eg, the word despair was mapped directly to the despair emotion, and indirectly to its ancestors: negative-emotion, emotion, affective-state, mental-state, and root respectively. This allowed the machine algorithm to decide on the optimal subset of emotional features to use as well as the optimal level of granularity used to differentiate between emotions. Finally, only those emotions identified in the training dataset were used as features, which reduced the number of these features from 306 to 58.
To identify the emotional tone of a sentence (ie, its positive or negative polarity) we used SentiWordNet,4
which maps words to their positive and negative scores. The polarity of a sentence was calculated by aggregating the polarity scores of individual words in three ways: finding the maximum score, summing up the scores and averaging them. We used all of the aggregated scores as features, again leaving it to the machine learning algorithm to decide on the most useful features in terms of classification performance. Two other emotive lexicons were used, which simply classify words as being positive or negative.5,6
The positive or negative polarity of a sentence was quantified as the percentage of positive and negative words found in the sentence. We also counted the occurrence of negation words (eg, no, never, hardly
, etc.) to help identify negative tone. In addition, as an emotion represents a psycho-physiological experience of an individual’s state of mind, it is often expressed subjectively from the first person’s point of view and often involves other people, who are the cause or the object of the emotion. We already used four lexicons to support the recognition of different groups of people or their roles (see ), but since people are most generally referred by pronouns, we also counted the occurrence of personal and possessive pronouns to help identify such references. We differentiated between the first person as the potential subject of an emotion and all other persons (except for the gender-neutral ones) as its object.
So far, we used the bag-of-words model in which each sentence is represented as an unordered collection of individual words normalized to their lemmas, ignoring the relationships between them. However, such information, eg, represented through the use of bigrams, may substantially improve the quality of features, thus increasing the overall classification performance.19
Still, matching longer phrases may lead to a decrease in performance due to high dimensionality and low frequency.20
Therefore, it is essential to optimize the choice of more complex features. Instead of matching exact phrases, we opted for regular expressions as a more flexible way of representing relationships between individual words. Such flexibility results in both lower dimensionality and higher frequency of the features, thus avoiding the problem of degrading the performance associated with the introduction of longer phrases into the feature space. We also conflated the rules by associating them with specific topics and aggregating their frequencies to further address the dimensionality and frequency issues. A separate feature was introduced for each topic and its values were calculated as the number of regular expressions matched from the corresponding set. provides examples of regular expressions used to introduce more complex features on top of those based on individual words.
To summarize, each sentence was mapped to its feature vector consisting of different feature types described in .
Features used to represent a sentence.