UNKNOWN was the most frequent label in the data set. Of the 502 i2b2 training and test documents, 315 (63%) were labeled UNKNOWN by i2b2 annotators. In the Nuance data set, 3,452 out of 4,292 documents (80%) were marked UNKNOWN after no smoking mentions were found in them by the engine. d
The accuracy of filtering UNKNOWNs using the extraction engine was 100% on i2b2 test data. We cannot report accuracy of engine filtering on the other data sets, because we did not examine the filtered documents to see if they actually did contain smoking mentions. However, the effectiveness of the filtering on i2b2 test data, and additional spot checks, indicated that this was probably a low source of error in our system. Note that documents that were not filtered by the engine could still subsequently be classified as UNKNOWN based on the nature of their associated features.
The extraction engine filtering of UNKNOWNs contributed significantly to the accuracy of document classification. The high degree of accuracy for this category raised our overall classification accuracy.
Effect of Data Set Size on Document Classification Accuracy
We restricted our investigation of the effect of data set size on document classification accuracy to the classification step applied to the set of documents with almost all UNKNOWNs filtered out, as described above. (Note that our accuracy for this step is significantly lower than the overall accuracy we report for the full sets, which include all documents.) For each data set (i2b2, Nuance, and Combo), we created a sequence of subsets of increasing size by random sampling. Each experiment was repeated 10 times, and our results are averaged over these trials. In each trial, we used 10-fold cross-validation to estimate classification accuracy.
For direct and mediated approaches, the accuracy increases sharply between 50 and 200 document subsets (see ). Above 200, the accuracy increases more slowly. We appear to be approaching the point where additional data would be of little help.
Effect of data set size on classification accuracy—direct approach
Effect of data set size on classification accuracy—mediated approach
We conducted the same experiment without using our extraction engine to mark mentions and assemble their features. All (filtered) documents were used in their entirety (not just sentences containing mentions as in the directed and mediated approaches), and the features collected were only words and bigrams used in the documents.
The accuracy for all subsets is significantly lower (). More interestingly, there is no big increase between 50 and 200 document subsets, and the accuracy keeps rising steadily to the end of the scale. We expect that we would observe a considerable increase still if we could train our models on significantly larger amounts of data. It is not clear, though, whether the accuracy obtained by this approach could ever match the accuracy obtained using our extraction engine-generated features, and if so, how large the data set would need to be.
Effect of data set size on classification accuracy—no engine features (note a different scale from )
Using informative features provided by our medical fact extraction engine allows our system to learn complicated concepts with much smaller data sets. This point is made even more clearly by a direct comparison of models trained with various feature subsets ().
Effect of feature set on classification accuracy—Nuance data set, direct approach
Classification using entire documents with only word-based features (“all doc word features”) performs worst of all. We observe a significant improvement when we base the classification on sentences containing mentions only (all other curves). Among these, using engine-generated linguistic features (“engine features”) from these sentences produces better results than using only word-based features (“word features”), and the combination of engine-generated features and word-based features (“both”) performs the best. We report our results on the Nuance data set; the results on other sets were similar.
Finally, in most of our experiments, more homogenous data sets (Nuance, i2b2) have higher accuracy than the joint data set (the experiment with no engine-generated features reported in is an exception). We suspect that homogeneity is beneficial, although its effect is small. e
Overall Accuracy Including Filtered UNKNOWNs
The experiments presented in the previous section concern classification accuracy for the subset of documents remaining after a filtering step. For completeness, we report our overall accuracy here. The only data set for which we can report it with full confidence is the i2b2 data set, where all the documents were inspected manually. For Nuance and Combo datasets, we rely on the accuracy of the filtering step, which was tested only on a small subset of the documents.
As can be seen in , mediated and direct approaches have very similar overall accuracy scores (with direct slightly better). The approach using entire documents and no extraction engine-generated features (“no engine features”) is clearly inferior, although the difference here is mitigated by the effect of adding perfectly classified filtered documents. The last category, “no filtering,” is the result of an attempt to classify all documents, including those that were filtered for other approaches. In this experiment, the extraction engine was not used for any part of the process, and the documents were used in their entirety. Features were defined as words and bigrams present in documents.
Overall accuracy (including filtered documents) assuming 100% accuracy of the filtering step
This last experiment clearly shows the benefit of restricting one’s attention to relevant documents and their relevant fragments. Classification of entire documents requires much more data; classification with a large proportion of data including no relevant features (no smoking related comments), requires more data still. This is particularly evident with the i2b2 data set, which, due to its small size, improves the most with the addition of preprocessing by the extraction engine.