|Home | About | Journals | Submit | Contact Us | Français|
In this study the authors describe the system submitted by the team of University of Szeged to the second i2b2 Challenge in Natural Language Processing for Clinical Data. The challenge focused on the development of automatic systems that analyzed clinical discharge summary texts and addressed the following question: “Who's obese and what co-morbidities do they (definitely/most likely) have?”. Target diseases included obesity and its 15 most frequent comorbidities exhibited by patients, while the target labels corresponded to expert judgments based on textual evidence and intuition (separately).
The authors applied statistical methods to preselect the most common and confident terms and evaluated outlier documents by hand to discover infrequent spelling variants. The authors expected a system with dictionaries gathered semi-automatically to have a good performance with moderate development costs (the authors examined just a small proportion of the records manually).
Following the standard evaluation method of the second Workshop on challenges in Natural Language Processing for Clinical Data, the authors used both macro- and microaveraged Fβ=1 measure for evaluation.
The authors submission achieved a microaverage Fβ=1 score of 97.29% for classification based on textual evidence (macroaverage Fβ=1 = 76.22%) and 96.42% for intuitive judgments (macroaverage Fβ=1 = 67.27%).
The results demonstrate the feasibility of the authors approach and show that even very simple systems with a shallow linguistic analysis can achieve remarkable accuracy scores for classifying clinical records on a limited set of concepts.
Medical institutes usually store a considerable amount of valuable information (patient data) as free text. Such information has great potential to aid research related to diseases or improving the quality of medical care. The size of document repositories makes automated processing in a cost-efficient and timely manner an increasingly important issue. The intelligent processing of clinical texts is the main goal of Natural Language Processing 1 for medical texts.
In this work, we introduce our system for identifying morbidities in the flow-text parts of clinical discharge summaries. The system was designed and implemented for the Obesity Challenge organized by the Informatics for Integrating Biology and the Bedside (i2b2), National Center for Biomedical Computing in Spring 2008. The full paper with more detailed description is published as the online supplement of this study, and is available at http://www.jamia.org.
The importance of applying Natural Language Processing (NLP) techniques to facilitate processes in clinical care and clinical research that require the analysis of textual data is clearly evidenced by the increasing number of publications and case studies related to the topic.
There have been several shared tasks that involved multilabel classification of clinical documents. The smoker challenge organized by i2b2 in 2006 2 targeted the identification of the patient's smoker status. The clinical coding challenge 3 organized by the Computational Medicine Center of Cincinatti Children's Hospital in 2007 focused on the assignment of ICD codes to radiology reports to enable automated billing.
The target diseases of the Obesity Challenge included obesity and its 15 most frequent comorbidities exhibited by patients, while the target labels corresponded to expert judgments based on textual evidence and intuition. The development of systems that can successfully replicate the decisions made by obesity experts would be desirable to facilitate large-scale research on obesity, one of the leading preventable causes of death. 4–6
Although several results are reported in peer-reviewed literature on medical text classification, 8–10 the most obvious references to work related to this study are the systems submitted to the same challenge by other participants.
The two main approaches of participants were the construction of rule-based dictionary lookup systems and statistical classifiers based on the Bag-of-Words (or bi- and trigram) representation of documents.
The dictionaries of rule-based systems mostly consisted of the names of the diseases and their spelling variants, abbreviations, etc. One team also used other related clinical named entities. 11 The dictionaries used were constructed mainly manually (either by domain experts 12 or computer scientists 13 ), but one team applied a fully automatic approach to construct their lexicons. 14
Machine learning methods applied by participating systems ranged from Maximum Entropy Classifiers 15 and Support Vector Machines 11 to Bayesian classifiers (Naïve Bayes 16 and Bayesian Network 17 ). These systems showed competitive performance on the frequent classes but had major difficulties in predicting the less represented negative and uncertain information in the texts.
Based on our previous experiences in similar tasks 18,19 we observed that the classic word uni-, bi- or trigram (or in general n-gram) of words representation is not well suited to specific medical text classification problems like the obesity challenge, regardless of the learning method applied. This is mainly because the target pieces of information are in just a few sentences (possibly fragmented over the text) and most of the text is irrelevant to the problem.
In this sense the obesity challenge is more like an Information Extraction task, which gathers the relevant information from scattered sentences of the document, then makes the document-level decision based on the extracted information.
These aspects motivated us to develop a rule-based system to the challenge that exploits the lists of keywords that trigger important sentences (that is, the names and various spellings of the actual disease) and to implement a simple context analyzer that enabled the correct prediction of negative and uncertain information in text. We applied statistical methods to complement, assist, and speed up manual work wherever it proved to be possible.
The system can be tested online at http://www.inf.u-szeged.hu/rgai/obesity. The most important resources of the system can be downloaded from the same site and are free for reuse if properly acknowledged.
Our approach focused on the rapid development of dictionary-lookup-based systems, which also took into account the document structure and the context of disease terms for classification.
We expected a system with dictionaries gathered semi-automatically to show a good performance with moderate development costs (we examined just a small proportion of the patient records manually).
For the challenge we applied a dictionary-lookup-based system. That is, we collected a dictionary of terms and abbreviations for each disease separately, processed each document, and collected occurrences of dictionary terms from the text. Sentences containing disease terms were then further evaluated to decide the appropriate class label for the corresponding disease. Further evaluations included a judgment of relevance (information on the patient instead of family members, etc) and an analysis of context to detect negation and uncertainty.
After locating and evaluating all the relevant pieces of information in the document, the main decision function of our system was based on the following rules (the rules were executed in order, and once a rule was matched, the system assigned the relevant classification):
Classify a document as:
Our intuitive model was based on the textual model. That is, we attempted to discriminate the documents classified as unmentioned by our textual classifier to intuitive yes or no classes. When the textual system assigned a label that was different from unmentioned, we accepted that decision as an intuitive judgment as well. Although somewhat simplistic, this assumption turned out to be quite reasonable.
To classify textual unmentioned documents, we collected phrases and numeric expressions which indicated an intuitive yes label: names of Associated Drugs and medication, phrases related to certain social habits of the patients (e.g., cigarette for hypertension), tension values, weight, etc. While the phrases were collected using a semi-automated procedure similar to the one used to set up the disease term dictionaries, the numeric expressions describing relevant biomarkers were constructed by hand. Since these terms usually contained implicit information on the corresponding disease, it made no sense to evaluate their context for uncertainty. That is, the lists gathered specifically for the intuitive task were not used to predict intuitive questionable labels.
After locating and evaluating all relevant pieces of information in the document, the main decision function of our system was based on the following rules:
The terms included in the dictionaries were gathered semi-automatically: we filtered them according to their frequency (infrequent terms were discarded to reduce the number of term-candidates and avoid overfitting on the data) and then ranked each term according to their positive class (yes) conditional probability scores (p[yes|word]). We evaluated the top ranked terms and added the meaningful ones to the corresponding disease-name dictionary manually. This way a 95% complete dictionary could be gathered quite rapidly—only the most frequent and reliable few dozens of keywords had to be evaluated manually for every disease.
Next, we collected pseudo terms (i.e., longer phrases containing a previously added disease term that are irrelevant to the disease) using a similar semi-automated procedure. This step was performed to avoid the overfitting of the dictionary lookup system (e.g., “depression”, but not “st. depression” or “hypertension” but not “pulmonary hypertension”).
The disease name dictionaries we collected were then extended with a few spelling variants manually, to handle different spellings of the same term.
We also made use of an unmentioned dictionary that triggered the exclusion of the text from further processing. This way we excluded sections under headings like “FAMILY HISTORY:” and also phrases like “son with …”, “family history of …” from further processing. To define the scope of irrelevant phrases, we used the same context-identifier as that for negation and uncertainty detection (see below).
The system with the above-described components was able to tag documents with yes labels or leave them as unmentioned. Doing this, we also extracted sentences with disease names from yes-tagged questionable and no documents and these sentences served as the basis for implementing a simple negation and uncertainty detection module. This exploited a list of negation/uncertainty cues and a list of delimiters (which triggered the end of scope). This approach is similar to NegEx. 20 Our biomedical text corpus annotated for negation and uncertainty 21 also demonstrates that this simple scope resolution approach works well for clinical texts.
We extended the system with intuitive dictionaries that triggered intuitive yes labels. These dictionaries were used to classify a document as an intuitive yes when it was judged to be unmentioned by the textual classifier system.
We also added a model that looked for numeric expressions preceding or following certain keywords (that is, biomarker expressions) in the text to classify intuitive yes documents. Thresholds for the numeric expressions were set to provide the optimal performance on the training dataset.
According to the official evaluation, our system achieved an F-macro score of 84% on the train for our best model (which degraded to 76% on the test set), and an intuitive F-macro score of 82% on the train set (which degraded to 67% on the test set)—detailed results can be seen in and . This system came sixth in the textual F-macro ranking and second in the intuitive F-macro ranking (third best and second best microaveraged scores, respectively). The microaveraged results were in the high 90s as the system was especially accurate on the yes and unmentioned classes (yes and no for intuitive judgment), and these classes had many more examples than the questionable and textual no classes.
Our intuitive model was based on the textual model. This is why we got a worse performance in intuitive questionable tagging on the test data: we neglected textual unmentioned documents that had an intuitive questionable label because there were too few of them in the training data to model this phenomenon, especially without background medical knowledge.
Our system achieved the second best result on the previously unseen test set for both the micro- and macro-averaged evaluation (intuitive task). The good micro ranking tells us that the dictionaries we collected had a good coverage compared to other participants, while our second place in macro ranking confirms that predicting intuitive questionable cases also proved rather difficult (or even impossible) for the other participating systems as well.
The model suffered from a lack of coverage for the no and questionable classes in textual annotation as well (the performance dropped from 84 to 76% in the textual task, mainly due to more no and questionable documents left as unmentioned than in the training set).
We should add here that the main evaluation metric of the challenge was the macro-averaged F-measure. This metric gave special emphasis to the rare no and questionable classes, which means that a few dozen examples had a major impact on the results.
This explains both the worse results on the test set (it was particularly hard to model these infrequent classes), and some seemingly strong drops (e.g., for osteoarthritis) or increases (e.g., for obesity) in performance for particular diseases. Micro-averaged results, which take all document-label pairs into account with a uniform weight, are more stable. Moreover, our third place in the textual micro ranking surely confirms that our disease term dictionaries had a reasonably good coverage (compared to other systems), while our context analyzer overlooked some no and questionable cases (sixth place in macro ranking).
We suppose that the relatively good results achieved by our model are due to the high-precision term-dictionaries and context-analysis rules. We argue that such simple solutions are efficient whenever the classification depends on the presence or absence of certain single facts (assertions) in the text. In such problems, usually one sentence (in some cases, two or three) contains the target information. This means that the information can be extracted using a simple approach based on dictionary lookup and modifier detection and the recognition of complex dependencies in the document is not necessary.
For a detailed analysis and comparison of the submitted systems and their performance, see Uzuner 2009. 7
As regards the classification accuracy scores, the method proposed here looks quite promising for the automated processing of large datasets to gather information on obesity and related diseases. Classes with a few hundred training examples for each disease (yes and unmentioned for textual and yes and no for intuitive annotation) generally achieved a micro-averaged F-measure of around 97%. This suggests that our approach is indeed capable of locating the most relevant pieces of information for each of the 16 diseases addressed in most of the documents. We should mention here that the manual filtering of the synonym lists (which were collected using statistical methods) required no more than 10 minutes per disease on average, and the lists used for context-analysis seemed to be independent of the particular disease. The more time consuming step was the manual evaluation of singleton documents that contributed to less than 1% of the system performance. These points make us think that our approach could be scaled up to a larger set of diseases without much effort.
Lower scores were observed for infrequent classes (with only 1–10 examples on average per class/disease pair) and we think that having more examples for questionable cases and negative examples (textual no label) would probably lead to a substantial improvement in performance on these particular classes as well. Overall, we believe that our results demonstrate the feasibility of our approach for classifying clinical records and show that even very simple systems with a shallow linguistic analysis can achieve remarkable accuracy scores for classifying clinical records on a limited set of concepts.
The authors thank the organizers of the i2b2 Obesity challenge and the annotators of the challenge dataset for their efforts that enabled us to work on this interesting and challenging problem; and the anonymous JAMIA reviewers for their valuable comments and suggestions. This work was supported in part by the NKTH grant of the Jedlik Ányos R&D Programme 2007 of the Hungarian government (codename TUDORKA7).