|Home | About | Journals | Submit | Contact Us | Français|
In order to survey, facilitate, and evaluate studies of medical language processing on clinical narratives, i2b2 (Informatics for Integrating Biology to the Bedside) organized its second challenge and workshop. This challenge focused on automatically extracting information on obesity and fifteen of its most common comorbidities from patient discharge summaries. For each patient, obesity and any of the comorbidities could be Present, Absent, or Questionable (i.e., possible) in the patient, or Unmentioned in the discharge summary of the patient. i2b2 provided data for, and invited the development of, automated systems that can classify obesity and its comorbidities into these four classes based on individual discharge summaries. This article refers to obesity and comorbidities as diseases. It refers to the categories Present, Absent, Questionable, and Unmentioned as classes. The task of classifying obesity and its comorbidities is called the Obesity Challenge.
The data released by i2b2 was annotated for textual judgments reflecting the explicitly reported information on diseases, and intuitive judgments reflecting medical professionals' reading of the information presented in discharge summaries. There were very few examples of some disease classes in the data. The Obesity Challenge paid particular attention to the performance of systems on these less well-represented classes.
A total of 30 teams participated in the Obesity Challenge. Each team was allowed to submit two sets of up to three system runs for evaluation, resulting in a total of 136 submissions. The submissions represented a combination of rule-based and machine learning approaches.
Evaluation of system runs shows that the best predictions of textual judgments come from systems that filter the potentially noisy portions of the narratives, project dictionaries of disease names onto the remaining text, apply negation extraction, and process the text through rules. Information on disease-related concepts, such as symptoms and medications, and general medical knowledge help systems infer intuitive judgments on the diseases.
Narrative patient records allow doctors to write precise notes. The narratives do not contain controlled vocabularies, and thus allow doctors flexibility of expression. 1 However, the narratives also make information contained inaccessible to automated clinical systems. Natural language processing (NLP) and medical language processing (MLP) focus on technologies that can extract structured information from narratives. 2
The Obesity Challenge was motivated by the clinical need for technologies that can help counter the current obesity epidemic. 3 Its goal was to systematically evaluate NLP and MLP systems. Run as a shared task, the challenge was organized as a part of an i2b2 (Informatics for Integrating Biology to the Bedside) “Driving Biology Project.” A total of 30 teams participated in the Obesity Challenge and met at a workshop cosponsored by the American Medical Informatics Association. This paper provides an overview of the challenge, describes the data and the evaluation metrics, reviews the best performing systems, and identifies directions for future MLP research.
Systematic, head-to-head evaluations of technology can help advance state of the art and guide future research. 4 Shared tasks provide a way of conducting such evaluations. They provide the participants with a common set of training documents annotated with the ground truth for a particular task and evaluate all participants on the same held-out set.
Outside the medical domain, shared tasks have included the Message Understanding Conference 5 and the Text Retrieval Evaluation Conferences (TREC), 6 organized by the National Institute of Standards and Technology. 7 Shared tasks for biomedicine have included BioCreAtIvE 8 and TREC Genomics. 9
In 2006, we organized the first MLP shared task on clinical narratives. 10 This task focused on two challenges involving discharge summaries: automatic de-identification of personal health information (the De-identification Challenge) 11 and automatic evaluation of the smoking status of patients (the Smoking Challenge). 12 These shared tasks were followed by a similar effort of the University of Cincinnati Computational Medicine Center. 13 The Obesity Challenge continued i2b2's efforts to make existing clinical records available to the research community. Extracting information about obesity and comorbidities from narrative discharge summaries was the focus of this challenge.
To define the Obesity Challenge task, two experts from the Massachusetts General Hospital Weight Center studied 50 (25 each) random pilot discharge summaries from the Partners HealthCare Research Patient Data Repository. The experts identified fifteen frequently occurring obesity comorbidities: asthma, atherosclerotic cardiovascular disease (CAD), congestive heart failure (CHF), depression, diabetes mellitus (DM), gallstones/cholecystectomy, gastroesophageal reflux disease (GERD), gout, hypercholesterolemia, hypertension (HTN), hypertriglyceridemia, obstructive sleep apnea (OSA), osteoarthritis (OA), peripheral vascular disease (PVD), and venous insufficiency. They determined the Obesity Challenge task as automatic classification of obesity and the above comorbidities, referred to as diseases, as Present, Absent, or Questionable in a patient, or Unmentioned in the discharge summary of the patient. We define these classes as follows:
We expect that the technologies developed in response to the challenge will be useful for indexing, classifying, and summarizing obesity-related facts found in discharge summaries. All relevant Institutional Review Boards approved the i2b2 Obesity Challenge.
Obesity Challenge data consisted of 1237 discharge summaries from the Partners HealthCare Research Patient Data Repository. These data were chosen from the discharge summaries of patients who were overweight or diabetic and had been hospitalized for obesity or diabetes sometime since 12/1/04. Some of the selected summaries included no mention of the stems “obes” and “diabet”, others included at least one mention of these stems.
De-identification was performed semi-automatically. All private health information was replaced with synthetic identifiers. 11
The data for the challenge were annotated by two obesity experts from the Massachusetts General Hospital Weight Center. The experts were given a textual task, which asked them to classify each disease (see list of diseases above) as Present, Absent, Questionable, or Unmentioned based on explicitly documented information in the discharge summaries, e.g., the statement “the patient is obese”. The experts were also given an intuitive task, which asked them to classify each disease as Present, Absent, or Questionable by applying their intuition and judgment to information in the discharge summaries, e.g., the statement “the patient weighs 230 lbs and is 5 ft 2 inches”. We refer to the textual task annotations as textual judgments and the intuitive task annotations as intuitive judgments.
Given the tasks, the experts agreed that:
The two experts independently annotated our 1237 discharge summaries. The kappa (κ) agreement 14 between the two annotators on each disease is shown in . The lowest κ on textual judgments was 0.71. For 12 diseases, κ on textual judgments was above 0.8; for four diseases, κ on textual judgments was between 0.71 and 0.79. The lowest κ on intuitive judgments was 0.44. For seven diseases, κ on intuitive judgments was above 0.8; for six of the diseases, κ on intuitive judgments was between 0.6 and 0.79. Although the κ values are open to interpretation, 15 κ of 0.8 is widely used as the threshold for “almost perfect agreement”, κ values of 0.6–0.79 indicate “substantial agreement” 14 . Please see the online supplement at http://jamia.org for a description of agreement calculation and extended analysis of agreement.
After annotation, a resident from the Massachusetts General Hospital resolved the disagreements in textual judgments. Majority vote among the three annotators determined the ground truth for the textual task. In the absence of a third obesity expert who could resolve the disagreements in intuitive judgments, only judgments agreed on by the two obesity experts were used as the ground truth for the intuitive task. shows the correspondence between the ground truth textual and intuitive judgments. Most textual Present judgments map to intuitive Present judgments. Similar observations hold for the other classes.
and show data distribution into training and test sets per disease. The distributions are non-uniform. In studying datasets with unbalanced class distributions, it is easier to focus on the better populated classes and ignore the less well-represented ones due to their limited contribution to overall performance. In our case, the less well-represented classes indicate the possibility or absence of a disease in a patient. Accurate recognition of these classes allows their inclusion in structured knowledge bases that can support future clinical decisions. Please refer to the online supplement at http://jamia.org for Table 5 and baseline results on these data.
We evaluated system performances using micro- and macro-averaged precision (P), recall (R), and F-measure (F1). Given the emphasis of the Obesity Challenge on the less well-represented classes, we used macro-averaged F-measure as the primary metric for evaluation. Micro-averaged F-measure maintained a global perspective on the results.
For each disease, the macro-averaged metrics represent the arithmetic mean of the precision, recall, and F-measure on the Present, Absent, Questionable, and Unmentioned classes that are observed in the ground truth for that disease (see Eqs 1, 2 and 3). The macro-averaged precision, recall, and F-measure of the system are obtained from the precision, recall, and F-measure on the classes observed in the ground truth for all diseases. In these formulae, M is the number of classes.
Macro-averages give equal weight to each class, including rare ones. 16 As a result, two systems that make the same raw number of mistakes can end up with two different macro-averaged scores.
Equation 4 and Equation 5 show the formulae for computing micro-averaged precision and recall from true positives (TP), false positives (FP), and false negatives (FN) for each class. 16,17 In these formulae, M is the number of classes. Micro-averaged F-measure is the harmonic mean of micro-averaged precision and recall (Eq 6). Micro-averages give equal weight to each sample regardless of its class. They are dominated by those classes with the greatest number of samples.
A total of 30 teams participated in the Obesity Challenge (see ). Training data were released in March 2008. Test data were released in June 2008. Each team submitted up to three system runs for predicting textual judgments and three for predicting intuitive judgments on test data.
We received a total of 68 textual and 68 intuitive system runs. 21–46 To obtain textual task results, we ranked each team on its best performing textual system run. To assess the intuitive task, we ranked each team on its best performing intuitive system run. We review the top ten textual and intuitive systems in ranked order below.
Of the top ten textual systems, Yang et al., 22 Solt et al., 42 Ware et al., 28 Childs et al., 24 Mishra et al. 43 Szarvas et al., 21 and Deshazo et al. 26 filtered the narrative summaries from information indirectly related to the patient and marked negations and uncertainty through methods that resembled NegEx 47 or ConText. 48 In addition:
Yang et al. used a precompiled dictionary of disease, symptom, treatment, and medication terms. They looked for sentences with either exact or approximate matches. For documents that contained more than one sentence about a disease, they determined the class for that disease based on a weighted combination of the evidence in sentences. 22
Solt et al. stripped the documents of personal identifiers, expanded abbreviations, and split discharge summaries into sections. To mark a disease as Present, they developed a rule-based classifier with disease names, synonyms, spelling variants, and semantically related terms. They partitioned text using contextual clues that indicate negative or uncertain statements and fed the partitions into a series of binary classifiers that determined whether each disease was Questionable, Absent, or Present, in that order. Diseases that failed to receive any of these three labels were labeled Unmentioned. 42
Ware et al. used regular expressions with a set of disease-related keywords and their synonyms. They assumed that keywords not marked as negated, historical, or associated with a relative would indicate a disease is present. 28
Childs et al. used the rule-based Rocket AeroText information extraction system 49 with keywords, their synonyms, and patterns generated by medical experts. They weighed and combined the evidence for each class of each disease. 24
Mishra et al. marked the text with a set of disease-related keywords compiled by analyzing the training set. They determined the total number of positive, negative, and uncertain assertions for each disease in a discharge summary. The class with the highest number of assertions related to the disease labeled the disease. Ties were broken in favor of positive assertions. 43
Szarvas et al. used term frequency and conditional probability in the Present class to preselect the most common terms that could aid classification. They supplemented this list with spelling variants and infrequent terms. The resulting dictionaries, along with disease contexts and document structure, formed the backbone of their rule-based system. 21
Savova et al. 25 and Patrick et al. 44 deviated from the pattern of text filtering and negation extraction. Savova et al. combined an information extraction system, a maximum entropy classifier, and an SVM. They evaluated these approaches, and determined the best one for each disease on each of the textual and intuitive tasks. They then allowed the identified best method to judge a disease for a task. 25
Patrick et al. used a combination of rules and a decision-tree classifier with features that included signs, symptoms, and medication names related to each disease. They also leveraged the correlations between diseases. 44
DeShazo et al. analyzed 300 of the discharge summaries, annotating them for information that supported ground truth textual judgments. They employed a rule base to propagate the information supporting ground truth judgments to the rest of the corpus. 26
Most intuitive systems benefited from the output of the textual systems. Solt et al. 42 Szarvas et al. 21 and Childs et al. 24 determined a default mapping between textual and intuitive judgments and used it as the starting point. The top four intuitive systems employed rule-bases that incorporated “disease-specific, non-preventive medications and their brand names”, disease-related procedures, and symptoms highly correlated with diseases, 42 “numeric expressions corresponding to measurements” 21 , and medication names. 24,28
Different from the top four, Ambert et al. took a machine learning approach to the intuitive task. They combined hot-spot filtering with error-correcting output codes. They identified words that demonstrated high information gain with respect to each disease, extracted the text within a 100-character window of these words, marked the negations, and vectorized the extracted text. Of the created vectors, “the ones that were absent any non-zero features” were automatically labeled Absent. The rest were labeled using error-correcting output codes that weighted each class inversely proportionally to its size. 45
Meystre extracted sections and sentences from each discharge summary using regular expressions and rules. In these excerpts, he disambiguated acronyms and extracted concept identifiers from the Unified Medical Language System (UMLS). 50 He supplemented the identified concepts with medications and biomarker values that could indicate a disease. He determined intuitive labels using NegEx and ConText. 46
Yang et al. based their intuitive predictions on evidence sentences containing information about symptoms, clinical measurements, and medications. They processed the sentences using clinical information, so the symptoms more directly related to a disease were more heavily weighted. The evidence sentences were considered to mark the presence of a disease unless a negation extractor marked them as negative or uncertain. In diseases with multiple evidence sentences, the information was combined. 22
DeShazo et al. used SVMs for their intuitive system. This system used features derived from the text by the rule-based classifier they developed for the textual task. 26
Matthews evaluated as features stemmed word tokens, bigrams, trigrams, UMLS semantic types of concepts, and negation as extracted by NegEx. He identified the most useful features for each class and applied Bayesian networks to classify diseases. 33
The results for the textual task are shown in and in . shows that the best macro-averaged F-measure on the textual task was 0.8052; the best micro-averaged F-measure was 0.9773. shows that the macro-averaged performance difference between the top two systems is not statistically significant. The top three systems are not significantly different in their micro-averaged F-measures. and show the top ten intuitive systems, as ranked by the macro-averaged F-measure. The best macro-averaged F-measure on the intuitive task is 0.6745; the best micro-averaged F-measure is 0.9654. shows that the top three systems are not statistically different in either macro- or micro-averaged F-measures.
shows that the top ten systems on the textual task had F-measures ranging from 0.92 to 0.97 on Present class. Their F-measures range from 0.97 to 0.99 on the Unmentioned class. On the Absent class, the F-measures range from 0.39 to 0.66; on the Questionable class, the F-measures range from 0 to 0.62. shows that seven out of the top ten systems produced a zero F-measure on the Questionable class on the intuitive task. The best F-measure for this class is 0.12. The performance of the top ten systems on the Present class range from 0.92 to 0.95, while the top ten systems on the Absent class performed in a range from 0.97 to 0.98.
Rule-based approaches played a significant role in the top ten systems in the textual task. Machine learning approaches contributed to the top ten systems in the intuitive task but were less dominant in the textual task.
Given the similar approaches taken by the top ten textual systems, we expect that their performance differences resulted from the accuracy of their negation extraction modules and the completeness of their dictionaries. The approaches taken by the intuitive systems were more varied. In general, clinical information, world knowledge, and information from the textual task benefited the top ten intuitive systems. A subset of the top ten textual and intuitive systems took advantage of medical experts, indicating the value of engaging medical professionals in system development.
A subset of the top ten textual and intuitive systems encodes expert knowledge in the form of hand-crafted rules and patterns, generated either through direct interactions with domain experts or through (laypersons') observations on the ground truth created by domain experts. “Expert knowledge is a combination of a theoretical understanding of the problem and a collection of heuristic problem-solving rules that experience has shown to be effective in the domain” 51 . However, such knowledge is limited to a closed-domain, narrowly defined task. Expert systems based on this knowledge, e.g., the hand-crafted systems developed for the Obesity Challenge, perform well when tested within the domain of their focus; however, they require some work to be adapted to new tasks and domains.
Despite the limitations on their generalizeability, MLP systems that can address the Obesity Challenge with near-human-level performance were developed within a three month period. Although starting from an existing system was preferred for the development of some systems, e.g., 24,46 most, including two of the best systems 22,42 developed for the Obesity Challenge, were built from scratch.
The main complexity and difficulty of the Obesity Challenge, in contrast to past challenges 12,13 and most mainstream MLP work, came from the focus on less well-represented classes. The worst macro-averaged F-measures on the challenge were 0.2237 and 0.3358, in the textual and intuitive tasks respectively.
In particular, the textual Questionable class contained some discharge summaries that were incorrectly classified by all system runs. One such summary, marked Questionable for GERD, stated “The patient was continued on her PPI for GERD prophylaxis. … required increasing her dosage of Nexium secondary to GERD-like symptoms.”
Similarly, for the textual Absent class, no system runs could correctly predict the judgment for CAD in a discharge summary which stated, “no history of cancer or heart disease.” In general, textual Absent judgment required careful study of the context where diseases are mentioned. For example, recognizing the absence of diabetes when a patient “had no further insulin requirement and was not a diabetic” requires correct interpretation of this text. Only a subset of the submitted system runs correctly classified this case.
The Present class was easier to predict. For example, all systems correctly labeled a discharge summary which stated “adult onset diabetes mellitus”. However, even the Present class was not straightforward when the discharge summary failed to mention the disease by name. For example, a discharge summary about “ventral hernia” and “atrial fibrillation” that did not mention “coronary artery disease” or “cardiovascular disease” was judged Present for CAD. Only a subset of the submitted system runs predicted this textual judgment. Prediction of textual Present judgments was even more difficult in summaries using biomarkers or other related information to describe a disease. For example, none of the system runs submitted to the i2b2 challenge could correctly predict the ground truth judgment for obesity on the discharge summary that stated “The patient's admission weight was 106.2 kg. Her discharge weight was 100.7 kilograms”, and “weight should be monitored daily.”
The textual Unmentioned class was the easiest to predict. Most of these judgments were classified correctly by almost all the submitted system runs. Those textual Unmentioned judgments that could not be predicted correctly demonstrate peculiarities of data. For example, author's reading of the statement “The patient was an obese male” indicates a textual label of Present for obesity and disagrees with the ground truth label of Unmentioned.
Given the characteristics of the data and the observations on performance on the less well-represented classes, removing the emphasis from these classes would have made the Obesity Challenge much more mainstream and much more straightforward, but not trivial. Eighty-five percent of the systems in the intuitive task and 93% of the systems in the textual task achieved micro-averaged F-measures above 0.8. Two of the best performing systems from the Obesity Challenge are open source and can either be downloaded for local installations or utilized online. 52,53
The Obesity Challenge demonstrates the difficulty of differentiating textual judgments from intuitive ones. The overlap in information used by automated systems for identifying textual and intuitive judgments and the author's observations on the Obesity Challenge data indicate that textual judgments of domain experts may differ from textual judgments of lay persons. In other words, the annotators' domain knowledge may have led them to consider some inferred information as explicit. As a result, some judgments that could be considered intuitive by lay persons were found among the textual judgments. 54
However, even with unclear boundaries between textual and intuitive judgments, the automated systems built by lay persons effectively extracted much useful information from discharge summaries. These systems performed best on the most factual and objective pieces of information. They experienced more difficulty arriving at conclusions only medical experts could infer. Most of the factual and objective pieces of information were identified by simple rule-based systems armed with dictionaries of terms and negation extraction modules. Machine learning approaches that studied the patterns in the textual judgments provided a beginning to correctly predicting intuitive judgments. We should emphasize that the relative performance of the systems is likely to change if we have much larger corpora for both training and testing. The unavailability of such corpora is likely to be the largest bottleneck for future progress in MLP.
This work was supported in part by the NIH Road Map for Medical Research Grants U54LM008748. Institutional Review Board approval has been granted for the studies presented in this manuscript. The author thanks all participating teams for their contributions to the challenge, and AMIA for its support in organizing the workshop that accompanied the challenge.