Biomedical text mining has become a thriving field because it proved its efficiency in a wide scope of application areas, such as the identification of biological entities in text,
1 assigning insurance codes to clinical records,
2 facilitating querying in biomedical databases,
3 etc. For a survey see Cohen and Hersh, 2005.
4 Discharge summaries offer a rich source of information for information extraction (IE) tasks, including classification. Several open challenges have been announced in this field: automated assignments of insurance codes to radiology reports
5 and smoking status identification task.
6 The processing of textual medical records like discharge summaries facilitates medical studies by providing statistically relevant data for analysis. Analysis of a particular disease and its comorbidities on sets of patients is an example of this. The findings drawn from connections observed between elements of a set of diseases are of key importance in treatment and prevention issues.
In this paper we present results on the i2b2 Obesity Challenge shared task, which is a multiclass multilabel classification task focused on obesity and its 15 most common comorbidities (termed
diseases). For each document, the task was to assign for each disease one of the following semantic labels: present, absent, questionable, or unmentioned (full description in Uzuner
7 ).
The problem of the Obesity Challenge is an atypical, two-dimensional classification problem with disease and semantic dimensions.
The top 10 solutions are dominated by rule-based systems, while, interestingly, no machine learning based approach can be found among them (see the survey of Uzuner
7 and the online only version available at
www.jamia.org).
Rule-based text classifiers (aka
expert systems) were widespread (see, e.g., Hayes et al
8 ) before the steady growth of computational capacity made machine learning approaches more popular. The rule-based approach is often criticized due to the knowledge acquisition bottleneck (Sebastiani
9 ). That is, each rule must be manually created, and the portability and flexibility of such systems are often very limited. These concerns are valid at categorization problems, where the rules are domain-dependent and the semantics of categories may shift. However, if the expert knowledge is available in knowledge bases (such as ontologies, typical for the biomedical domain), and the rules can be generated automatically, the overhead of manual processing can be minimized to error analysis. Consequently, expert and rule-based systems are often applied for different problems on medical domain (see, e.g., Zeng et al
10 and Chi et al
11 ).
In the medical field, there is a growing need for interactive systems; however, the challenge did not address this aspect. Health experts usually do not trust a system that acts like a black-box, but instead they want to verify the evidences that support the decision made. Our system is transparent for humans, while a system that is using sophisticated and, hence, not easy-to-understand machine learning techniques may require additional efforts to achieve this goal.
Next we describe our context-aware rule-based classifier, present its performance on the i2b2 Obesity Challenge, and briefly discuss the results and lessons learnt from our study. For the community, we provide an online appendix to this paper (available as an online data supplement at
www.jamia.org) and on-line demo (available at
categorizer.tmit.bme.hu/~illes/i2b2/obesity_demo).