Creating the Smoke-blind Dataset
To pursue a computational approach to determining smoking status in the absence of explicit evidence, we first needed to create a set of documents in which smoking information was not present (the smoke-blind dataset). To do this, we first removed from the training set all of the 252 discharge summaries that were labeled as unknown. The remaining 146 records were hand-edited (RW) to remove overt references to smoking.
The Smoke-blind Systems
To classify the smoke-blind data set, we chose to avoid a keyword-based system. Although we could hypothesize a list of keywords that might be present in the discharge summary of a smoker (e.g., hypertension, coronary artery disease, lung cancer), none of these phrases could be used to definitively classify a record. Rather, such phrases provide partial evidence of smoking. In this way, multiple phrases present in the same document can serve as accumulating evidence, increasing the confidence of a classification.
Another reason to avoid the keyword-based method is that some keywords more strongly indicate smoking (lung cancer) than others (hypertension). It is unlikely that a human expert could manually devise a complete list of possible keywords and weight each of these keywords appropriately.
For these two reasons, we wanted a classifier that could learn these phrases and weights from training data. We chose to use a NB classifier, trained on word bigrams found in the smoke-blind data. An NB classifier chooses the label that maximizes the similarity between a record, R, and a class label, Cj, where the similarity is defined as:
The
a priori probability of the class labels
Cj was assumed to be uniform because we were not expecting the evaluation data to have the same underlying distribution as the training data. The conditional probability P(
R|Cj) was based on a bigram language model using modified Kneser-Ney discounting.
4,5 The classifier was used to build two systems. The first system (NB System 1) was trained on the smoke-blind dataset with labels provided as part of the shared task training set. This training set included 80 smoking and 66 nonsmoking records.
The second system (NB System 2) used an expanded training set by supplementing the smoke-blind dataset with the 43 additional records that were part of the shared task’s official test set. We automatically labeled these additional records using our rule-based classifier, knowing that this should be very accurate in providing the true answer, and then made these additional records smoke-blind using the previously described procedure (RW). The combination of these additional records and the original smoke-blind dataset formed a larger training set for this second system with 104 smoking and 83 nonsmoking records.
NB1 System 1 and NB System 2 were evaluated using leave-one-out cross-validation; leave-one-out cross-validation maximizes the size of the training set records while ensuring that the system is not trained on the individual record that is being classified.
These classifiers were trained and evaluated using coarse-grained labels only, folding Past Smoker and Current Smoker into the existing label Smoker. This was necessary because, after removing all evidence of smoking from the patient summaries, it would have been extremely difficult (if not impossible) to recover the temporal information needed to distinguish between a current and a past smoker.
Expert Annotation
Because all explicit smoking cues were removed, it was possible that this smoke-blind dataset would not contain enough information, even for human experts, to confidently predict the label of many records. Therefore, to test the effectiveness of the NB method trained and evaluated on the smoke-blind data, we recruited three human annotators with expert medical knowledge: a statistician experienced in oncology clinical trials (A1), an oncology certified nurse (A2), and an oncology research fellow (A3).
We expected the annotation to be time consuming, so we provided the annotators with only a subset of the 146 smoke-blind summaries: a total of 54 summaries, composed of 34 smokers and 20 nonsmokers.
These three annotators were asked to make educated guesses about smoking status based on their knowledge of health and medicine and their common sense. We provided guidelines worded closely to those used by the task organizers, noting that all direct evidence of tobacco smoking status had been removed and that absence of information about smoking status was not an indication of a nonsmoker.
As was done with the NB task, these annotators were asked to provide only coarse-grained smoking status: Smoker, Nonsmoker, and Unknown, omitting Current Smoker and Past Smoker. It is important to remember that the smoke-blind dataset excluded all summaries labeled as Unknown by the shared task organizers. Therefore, the annotators were not attempting to predict when a record had an Unknown label attached to it; rather, annotators were allowed to provide the label Unknown when they could not determine the smoking status of the patient described in the discharge summary.
We evaluated the performance of each annotator individually, and we obtained a combined answer (Â) by taking a simple plurality of the three annotators’ assessments. We considered Unknown a nonvote, and returned the label Missing when there was no plurality, or when all three annotators chose Unknown (Figure 2).
Analysis
We assessed the performance of the rule-based system, both NB systems, and our human annotators using standard methodology from the fields of natural language processing and medical statistics to calculate recall (sensitivity), precision (positive predictive value), specificity, and F-measure.
We submitted the maximum-permitted three entries to the i2b2 Shared Task.
1,6 The first entry labeled the test dataset of 104 records using the rule-based classifier. The second entry used NB System 1, and the third entry used NB System 2. The performance results of these entries are discussed below in the Results section.