summarizes our results (see also Table 3 and Table 4 available as a JAMIA on-line data supplement at
www.jamia.org). The best F-score from our informal evaluation is 92.64. The F-score baseline for this set is 63.31. We limited our runs on Set 2 to three to avoid overtraining/overfitting on the test data. The three runs differed slightly in the features used to build the Layer 3 model for classifying past smoker, current smoker, and smoker instances, and in the features for negation detection used in the Layer 2 model for discovering nonsmoker instances. However, the F-scores turned out similar.
| Table 2Table 2 Best Results. Informal evaluation set up: training on Set 1; testing on Set 2.Formal evaluation set up: training on Set 1 and Set 2; testing on Set 3. (Numbers in brackets are 95% exact confidence intervals) |
For the formal evaluation and the final i2b2 submission, our models were built from Set 1 and Set 2 and were run on the official I2B2 test set (Set 3). We submitted three sets of results run with models that differed slightly as described in the preceding paragraph. Our best F-score is 85.57 for this final formal evaluation (most frequent category baseline is 60.58). If we remove the unknown category and consider 2 categories (current smoker and noncurrent, which includes smoker, past, and nonsmoker), precision, recall and F-score for current are 53.33, 72.72, and 61.53 respectively; and for noncurrent are 88.46, 76.66, and 82.14 respectively.
Our error analysis uncovered several areas for improvement. Currently, our negation detection does not account for nonnegated lexical items indicating nonsmoker status, e.g., nonsmoker, nonsmoker. Also, phrases such as “nor does she smoke” are not flagged as negated.
Our temporal resolution component does not include an explicit one-year rule for distinguishing between past smoker and current smoker, but relies on the features and labeled data to learn the differences. The most challenging category for our system to classify is past smoker. Our system’s upper bound for this category is 78% when training and testing is performed on the same data. Potential enhancements are the inclusion metadata information as features, e.g., section headings, and experimenting with higher order SVMs, especially for temporal resolution.
We also noticed interesting cases such as the following report, which contained the sentence “He does drink alcohol three drinks per day, denies any current tobacco use.” The final classification as provided by the challenge organizers is unknown despite the fact that based on the above sentence one would be tempted to assign the nonsmoker label. Assigning smoking status by human experts might include information over the entire report and involve some inference based on the facts, medical and otherwise, as present in the entire record, which requires processing beyond sentence classification.