In addition to the officially submitted and scored runs, we conducted a series of experiments to make it possible to determine which features of our system lead to improved performance. shows the results of these experiments for the 5-way classification problem, and shows the performance of our submissions on the 3-way task.
Table 2 Micro- and Macro-averaged F1 Results for 5-way Classification by Tested Systems on Test Collection
Table 3 Micro- and Macro-averaged F1 Results for 3-way Classification by Tested Systems on Test Collection
The systems labeled Run1-3 were our officially submitted runs, the systems labeled A-I are additional system configurations scored using methods equivalent to those used by the task organizers. The system labeled “i2b2 Best” was the best performing system run submitted to the task organizers. Configurations A-I systematically vary properties of our classification approach making it possible to determine the contribution to performance for each property.
As seen in , Run2 was our best submitted run, using Lucene tokenization, zero-vector filtering, SVM weighting, and ECOC, but not the post processing rules. This run scored second out of all runs submitted. Though we did not see this effect on our cross-validation experiments before submitting our official runs, on the test data there is degradation in performance when using stop word filtering. Comparing system A to Run2, the only difference is the tokenization and lack of stop word filtering in system A. System A has a micro-F1 of 0.9000, which is 0.0140 better than Run1, actually higher than the best performing system submitted to the i2b2 challenge task.
By far the largest effect was due to the hot-spotting technique. This can be seen by comparing the results of System A with System I, hot-spotting being the one difference between these approaches. The micro-F1 difference is huge, 0.2960, with the hot-spotting system achieving a micro-F1 of 0.9000 and the non-hot-spotting system only 0.6040. Simply removing hot-spotting transforms the best performing system presented here into the worst.
includes several other informative comparisons. The difference between Run2 and E is the use of zero-vector filtering. Run2 outperforms E by 0.02. SVM weighting is the difference between Run2 and B, and Run 2 outperforms B by 0.014. The ECOC technique is compared to the one-against-all-others in Run1 versus system G. This time Run1 outperforms G by 0.020. For the 5-way evaluations, the use of the post-processing rules was counterproductive. The use of this technique is the only difference between Run1 and Run2, and Run2 does a little better without it.
Finally, System H uses the inferior choice for each of the three alternative techniques (zero-vector filtering, weighting, ECOC), using hot-spotting but no post-processing rules (as with our best system), and achieves a micro-F1 of 0.8360, which is 0.0640 less than the best performing system. Interestingly this is very close to the sum of the individual contributions of each of the left-out features: 0.0140 + 0.20 + 0.0140 + 0.20 = 0.0680.