In this study, we examined the effectiveness of combining certain TM techniques with domain expert knowledge in the case of VAERS for TC purposes; to our knowledge, no previous efforts have been reported for TM and medical TC in VAERS or any other SRS, despite the fact that various methods have been applied before to other data sources showing the potential for AE identification. For example, NLP methods have been applied to discharge hospital summaries17
and other data mining methods to structured EHR data.40–42
Our validated results showed that TM in any level could effectively support TC in VAERS. For example, rule-based, BT, and w
-SVM classifiers appeared to perform well in terms of macro-R, still with some MER cost. A simple calculation over 10 000 reports for two classifiers (eg, w
-SVM and NB for low-level patterns) with MERs of 0.10 and 0.04, respectively, would show an actual difference of 600 misclassified (either as potentially positive or as negative) reports between them. The actual cost in terms of extra workload would be those misclassified as potentially positive (based on our data that would be equal to 494 reports), but the actual cost in terms of safety surveillance would be those misclassified as negative (ie, 106 reports). Based on our error analysis, <7% (7 out of 103) of the true positive cases would be falsely classified as negative for our best performing classifier, that is, the rule-based classifier. We believe that this level of misclassification, in the context of the extensive known limitations of SRS, is probably acceptable, although we hope to engage in future efforts to refine our algorithm to reduce this even more. This further illustrates that one of the important properties of a classifier that is used to identify rare adverse events is high sensitivity because it returns a smaller number of falsely negative reports.
It could be argued that our approach lacks the automated feature extraction aspect, which has been previously reported as a strategy for TC.43
The issue of automatically extracting features that characterize the AE accurately requires care. The problem we are called to solve is the identification of rare or very rare events from the data at hand. Because features need to be related not only statistically but also causally to the outcome, informative feature selection better serves our purposes. The basis for our claim has been the availability of solid standards (ie, Brighton case definitions) that are being used by physicians in their daily practice. Accordingly, the extraction of three feature representations supported the application of our multi-level approach. Thus, we treated the case reports not only as bag-of-lemmas looking for lemmas7
(the bag-of-words approach (stemmed or unstemmed) is rather limited13
) but also as sources of patterns (low- and high-level); we extracted these patterns to examine their role in TC.
Informative feature selection mandated not only the use of Brighton criteria but also MOs' contributions since: (i) certain criteria were not met in the case reports and should be excluded from the feature space, for example, ‘capillary refill time,’ (ii) non-medical words were often used by patients to describe a symptom and should be included, for example, the word ‘funny’ within variations of the phrase ‘my throat felt funny’ to describe ‘itchy throat’ or ‘throat closure,’ and (iii) other words raised a concern for further investigation even though they were not listed in Brighton definitions, for example, ‘epinephrine’ or ‘anaphylaxis.’
Regarding our TM methods, the construction of a controlled dictionary and lexicon is considered laborious, demanding, and costly because it relies on the recruitment of human experts.44
However, the informative development of a flexible and relatively small controlled dictionary/lexicon appeared to be very effective in our study. The same applied to the use of the dedicated semantic tagger. A part-of-speech tagger would assign non-informative tags to a span of text (ie, symptom text in VAERS) that follows no common syntax; it would not support the grammar rules either. The grammar was also built in the same context: to better serve the extraction of the feature representations and facilitate both the rule-based classification and the training of ML classifiers.
Rule-based TC systems have been criticized for the lack of generalizability of their rules, a problem defined as ‘knowledge acquisition bottleneck.’44
However, their value in handling specific conditions should not be ignored, such as in the Obesity NLP Challenge, where the top 10 solutions were rule-based.27
ML methods are not as transparent as the rule-based systems but have been used extensively for TC.44
Our results showed that the rule-based classifier performed slightly better, probably due to the informative feature selection. Either rule-based or ML techniques could be applied to SRS databases and allow better use of human resources by reducing MOs' workload.
It could be argued that ensembles or a cascade of classifiers or even a modified feature space would handle the classification errors. Nevertheless, the principles of our study and the nature of VAERS would require the consideration of certain aspects prior to the application of such strategies. First, the construction of the feature space was based on the domain expert contribution; short of fully automated feature extraction, any alterations (use of new lemmas or introduction of new rules) based on this feature analysis would benefit from consultation with clinical experts to increase the chance that any such modifications would lead to meaningful results. Second, a classification error will be always introduced by the MOs who decide to acquire more information for a ‘suspicious’ report even if it does not meet all the criteria.
The methodology that was described in this paper and the discussion of the related aspects raise the interesting question of generalizability, that is, the transfer of components to the identification of other AEFIs. The development of a broader medical lexicon and a set of basic rules could be suggested to support the extraction of all symptoms related to the main AEFIs, such as Guillain–Barre syndrome and acute disseminated encephalomyelitis. Based on these key components, other advanced rules representing the specific criteria per AEFI (as stated in the Brighton definitions and described by MOs) could classify each report accordingly.
Our study lies partly in the field of text filtering since it investigated ways to automate the classification of streams of reports submitted in an asynchronous way.45
Generally, MOs' intention is twofold: first, to identify the potentially positive and block the negative reports (step 1); second, to further classify those that proved to be positive into more specific categories (step 2), for example, anaphylaxis case reports into levels of diagnostic certainty. This process is similar to the classification of incoming emails as ‘spam’ or ‘non-spam’ and the subsequent categorization of the ‘non-spam’ emails.46–48
Here, any further categorization would require information gathering through the review of medical records that are provided in portable document format (pdf) only. The inherent difficulties related to the production of these files limit their usability and the possibility of utilizing this source remains to be investigated.