|Home | About | Journals | Submit | Contact Us | Français|
Due to the lack of annotated data sets, there are few studies on machine learning based approaches to extract named entities (NEs) in clinical text. The 2009 i2b2 NLP challenge is a task to extract six types of medication related NEs, including medication names, dosage, mode, frequency, duration, and reason from hospital discharge summaries. Several machine learning based systems have been developed and showed good performance in the challenge. Those systems often involve two steps: 1) recognition of medication related entities; and 2) determination of the relation between a medication name and its modifiers (e.g., dosage). A few machine learning algorithms including Conditional Random Field (CRF) and Maximum Entropy have been applied to the Named Entity Recognition (NER) task at the first step. In this study, we developed a Support Vector Machine (SVM) based method to recognize medication related entities. In addition, we systematically investigated various types of features for NER in clinical text. Evaluation on 268 manually annotated discharge summaries from i2b2 challenge showed that the SVM-based NER system achieved the best F-score of 90.05% (93.20% Precision, 87.12% Recall), when semantic features generated from a rule-based system were included.
Named Entity Recognition (NER) is an important step in natural language processing (NLP). It has many applications in general language domain such as identifying person names, locations, and organizations. NER is crucial for biomedical literature mining as well (Hirschman, Morgan, & Yeh, 2002; Krauthammer & Nenadic, 2004) and many studies have focused on biomedical entities, such as gene/protein names. There are mainly two types of approaches to identify biomedical entities: rule-based and machine learning based approaches. While rule-based approaches use existing biomedical knowledge/resources, machine learning (ML) based approaches rely much on annotated training data. The advantage of rule-based approaches is that they usually can achieve stable performance across different data sets due to the verified resources, while machine learning approaches often report better results when the training data are good enough. In order to harness the advantages of both approaches, the combination of them, called the hybrid approach, has often been used as well. CRF and SVM are two common machine learning algorithms that have been widely used in biomedical NER (Takeuchi & Collier, 2003; Kazama, Makino, Ohta, & Tsujii, 2002; Yamamoto, Kudo, Konagaya, & Matsumoto, 2003; Torii, Hu, Wu, & Liu, 2009; Li, Savova, & Kipper-Schuler, 2008). Some studies reported better results using CRF (Li, Savova, & Kipper-Schuler, 2008), while others showed that the SVM was better (Tsochantaridis, Joachims, & Hofmann, 2005) in NER. Keerthi & Sundararajan (Keerthi & Sunda-rarajan, 2007) conducted some experiments and demonstrated that CRF and SVM were quite close in performance, when identical feature functions were used.
There has been large ongoing effort on processing clinical text in Electronic Medical Records (EMRs). Many clinical NLP systems have been developed, including MedLEE (Friedman, Alderson, Austin, Cimino, & Johnson, 1994), SymTex (Haug et al., 1997), Meta-Map (Aronson, 2001). Most of those systems recognize clinical named entities such as diseases, medications, and labs, using rule-based methods such as lexicon lookup, mainly because of two reasons: 1) there are very rich knowledge bases and vocabularies of clinical entities, such as the Unified Medical Language System (UMLS) (Lindberg, Humphreys, & McCray, 1993), which includes over 100 controlled bio-medical vocabularies, such as RxNorm, SNOMED, and ICD-9-CM; 2) very few annotated data sets of clinical text are available for machine learning based approaches.
Medication is one of the most important types of information in clinical text. Several studies have worked on extracting drug names from clinical notes. Evans et al. (Evans, Brownlow, Hersh, & Campbell, 1996) showed that drug and dosage phrases in discharge summaries could be identified by the CLARIT system with an accuracy of 80%. Chhieng et al. (Chhieng, Day, Gordon, & Hicks, 2007) reported a precision of 83% when using a string matching method to identify drug names in clinical records. Levin et al. (Levin, Krol, Doshi, & Reich, 2007) developed an effective rule-based system to extract drug names from anesthesia records and map to RxNorm concepts with 92.2% sensitivity and 95.7% specificity. Sirohi and Peissig (Sirohi & Peissig, 2005) studied the effect of lexicon sources on drug extraction. Recently, Xu et al. (Xu et al., 2010) developed a rule-based system for medication information extraction, called MedEx, and reported F-scores over 90% on extracting drug names, dose, route, and frequency from discharge summaries.
Starting 2007, Informatics for Integrating Biology and the Bedside (i2b2), an NIH-funded National Center for Biomedical Computing (NCBC) based at Partners Healthcare System in Boston, organized a series of shared tasks of NLP in clinical text. The 2009 i2b2 NLP challenge was to extract medication names, as well as their corresponding signature information including dosage, mode, frequency, duration, and reason from de-identified hospital discharge summaries (Uzüner, Solti, & Cadag, 2009). At the beginning of the challenge, a training set of 696 notes were provided by the organizers. Among them, 17 notes were annotated by the i2b2 organizers, based on an annotation guideline (see Table 1 for examples of medication information in the guideline), and the rest were un-annotated notes. Participating teams would develop their systems based on the training set, and they were allowed to annotate additional notes in the training set. The test data set included 547clinical notes, from which 251 notes were randomly picked by the organizers. Those 251 notes were then annotated by participating teams, as well as the organizers, and they served as the gold standard for evaluating the performance of systems submitted by participating teams. An example of original text and annotated text were shown in Figure 1.
The results of systems submitted by the participating teams were presented at the i2b2 workshop and short papers describing each system were available at i2b2 web site with protected passwords. Among top 10 systems which achieved the best performance, there were 6 rule-based, 2 machine learning based, and 2 hybrid systems. The best system, which used a machine learning based approach, reached the highest F-score of 85.7% (Patrick & Li, 2009). The second best system, which was a rule-based system using the existing MedEx tool, reported an F-score of 82.1% (Doan, Bastarache L., Klimkowski S., Denny J.C., & Xu, 2009). The difference between those two systems was statistically significant. However, this finding was not very surprising, as the machine learning based system utilized additional 147 annotated notes by the participating team, while the rule-based system mainly used 17 annotated training data to customize the system.
Interestingly, two machine learning systems in the top ten systems achieved very different performance, one (Patrick et al., 2009) achieved an F-score of 85.7%, ranked the first; while another (Li et al., 2009) achieved an F-score of 76.4%, ranked the 10th on the final evaluation. Both systems used CRF for NER, on the equivalent number of training data (145 and 147 notes respectively). The large difference in F-score of those two systems could be due to: the quality of training set, and feature sets using for classification. More recently, i2b2 organizers also reported a Maximum Entropy (ME) based approach for the 2009 challenge (Halgrim, Xia, Solti, Cadag, & Uzuner, 2010). Using the same annotated data set as in (Patrick et al., 2009), they reported an F-score of 84.1%, when combined features such as unigram, word bigrams/trigrams, and label of previous words were used. These results indicated the importance of feature sets used in machine learning algorithms in this task.
For supervised machine learning based systems in the i2b2 challenge, the task was usually divided into two steps: 1) NER of six medication related findings; and 2) determination of the relation between detected medication names and other entities. It is obvious that NER is the first crucial step and it affects the performance of the whole system. However, short papers presented at the i2b2 workshop did not show much detailed evaluation on NER components in machine learning based systems. The variation in performance of different machine learning based systems also motivated us to further investigate the effect of different types of features on recognizing medication related entities.
In this study, we developed an SVM-based NER system for recognizing medication related entities, which is a sub-task of the i2b2 challenge. We systematically investigated the effects of typical local contextual features that have been reported in many biomedical NER studies. Our studies provided some valuable insights to NER tasks of medical entities in clinical text.
A total of 268 annotated discharge summaries (17 from training set and 251 from test set) from i2b2 challenge were used in this study. This annotated corpus contains 9,689 sentences, 326,474 words, and 27,589 entities. Annotated notes were converted into a BIO format and different types of feature sets were used in an SVM classifier for NER. Performance of the NER system was evaluated using precision, recall, and F-score, based on 10-fold cross validation.
The annotated corpus was converted into a BIO format (see an example in Figure 2). Specifically, it assigned each word into a class as follows: B means beginning of an entity, I means inside an entity, and O means outside of an entity. As we have six types of entities, we have six different B classes and six different I classes. For example, for medication names, we define the B class as “B-m”, and the I class as “I-m”. Therefore, we had total 13 possible classes to each word (including O class).
After preprocessing, the NER problem now can be considered as a classification problem, which is to assigns one of the 13 class labels to each word.
Support Vector Machine (SVM) is a machine learning method that is widely used in many NLP tasks such as chunking, POS, and NER. Essentially, it constructs a binary classifier using labeled training samples. Given a set of training samples, the SVM training phrase tries to find the optimal hyperplane, which maximizes the distance of training sample nearest to it (called support vectors). SVM takes an input as a vector and maps it into a feature space using a kernel function.
In this paper we used TinySVM1 along with Yamcha2 developed at NAIST (Kudo & Matsu-moto, 2000; Kudo & Matsumoto, 2001). We used a polynomial kernel function with the degree of kernel as 2, context window as +/-2, and the strategy for multiple classification as pair-wise (one-against-one). Pairwise strategy means it will build K(K-1)/2 binary classifiers in which K is the number of classes (in this case K=13). Each binary classifier will determine whether the sample should be classified as one of the two classes. Each binary classifier has one vote and the final output is the class with the maximum votes. These parameters were used in many biomedical NER tasks such as (Takeuchi & Collier, 2003; Kazama et al., 2002; Yamamoto et al., 2003).
In this study, we investigated different types of features for the SVM-based NER system for medication related entities, including 1) words; 2) Part-of-Speech (POS) tags; 3) morphological clues; 4) orthographies of words; 5) previous history features; 6) semantic tags determined by MedEx, a rule based medication extraction system. Details of those features are described below:
MedEx was originally developed at Vanderbilt University, for extracting medication information from clinical text (Xu et al., 2010). MedEx labels medication related entities with a pre-defined semantic categories, which has overlap with the six entities defined in the i2b2 challenge, but not exactly same. For example, MedEx breaks the phrase “fluocinonide 0 5% cream” into drug name: “fluocinonide”, strength: “0.5%”, and form: “cream”; while i2b2 labels the whole phrase as a medication name. There are a total of 11 pre-defined semantic categories which are listed in (Xu et al., 2010c). When the Vanderbilt team applied MedEx to the i2b2 challenge, they customized and extended MedEx to label medication related entities as required by i2b2. Those customizations included:
In a summary, the MedEx system will produce two sets of semantic tags: 1) initial tags that are identified by the original MedEx system; 2) final tags that are identified by the customized MedEx system for the i2b2 challenge. The initial tagger will be equivalent to some simple dictionary look up methods used in many NER systems. The final tagger is a more advanced method that integrates other level of information such as sections and spellings. The outputs of initial tag include 11 pre-defined semantic tags in MedEx, and outputs of final tags consist of 6 types of NEs as in the i2b2 requirements. Therefore, it is interesting to us to study effects of both types of tags from MedEx in this study. These semantic tags were also converted into the BIO format when they were used as features.
In this study, we measured Precision, Recall, and F-score using the CoNLL evaluation script4. Precision is the ratio between the number of correctly identified NE chunks by the system and the total number of NE chunks found by the system; Recall is the ratio between the number of correctly identified NE chunks by the system and the total number of NE chunks in the gold standard. Experiments were run in a Linux machine with 16GB RAM and 8 cores of Intel Xeon 2.0GHz processor. The performance of different types of feature sets was evaluated using 10-fold cross-validation.
Table 2 shows the precision, recall, and F-score of the SVM-based NER system for all six types of entities, when different combinations of feature sets were used. Among them, the best F-score of 90.05% was achieved, when all feature sets were used. A number of interesting findings can be concluded from those results. First, the contribution of different types of features to the system's performance varies. For example, the “previous history feature” and the “morphology feature” improved the performance substantially (F-score from 81.76% to 83.83%, and from 83.81% to 86.06% respectively). These findings were consistent with previous reported results on protein/gene NER (Kazama et al., 2002; Takeuchi and Collier, 2003; Yamamoto et al., 2003). However, “POS” and “orthographic” features contributed very little, not as much as in protein/gene names recognition tasks. This could be related to the differences between gene/protein phrases and medication phrases – more orthographic clues are observed in gene/protein names. Second, the “semantic tags” features alone, even just using the original tagger in Me-dEx, improved the performance dramatically (from 81.76% to 86.51% or 89.47%). This indicates that the knowledge bases in the biomedical domain are crucial to biomedical NER. Third, the customized final semantic tagger in MedEx had much better performance than the original tagger, which indicated that advanced semantic tagging methods that integrate other levels of linguistic information (e.g., sections) were more useful than simple dictionary lookup methods.
Table 3 shows the precision, recall, and F-score for each type of entity, from the MedEx alone, and the baseline and the best runs of the SVM-based NER system. As we can see, the best SVM-based NER system that combines all types of features (including inputs from MedEx) was much better than the MedEx system alone (90.05% vs. 85.86%). This suggested that the combination of rule-based systems with machine learning approaches could yield the most optimized performance in biomedical NER tasks.
Among six types of medication entities, we noticed that four types of entities (medication names, dosage, mode, and frequency) got very high F-scores (over 92%); while two others (duration and reason) had low F-scores (up to 50%). This finding was consistent with results from i2b2 challenge. Duration and reason are more difficult to identify because they do not have well-formed patterns and few knowledge bases exist for duration and reasons.
This study only focused on the first step of the i2b2 medication extraction challenge – NER. Our next plan is to work on the second step of determining relations between medication names and other entities, thus allowing us to compare our results with those reported in the i2b2 challenge. In addition, we will also evaluate and compare the performance of other ML algorithms such as CRF and ME on the same NER task.
In this study, we developed an SVM-based NER system for medication related entities. We systematically investigated different types of features and our results showed that by combining semantic features from a rule-based system, the ML-based NER system could achieve the best F-score of 90.05% in recognizing medication related entities, using the i2b2 annotated data set. The experiments also showed that optimized usage of external knowledge bases were crucial to high performance ML based NER systems for medical entities such as drug names.
Authors would like to thank i2b2 organizers for organizing the 2009 i2b2 challenge and providing dataset for research studies. This study was in part supported by NCI grant R01CA141307-01.
1Available at http://chasen.org/~taku/software/TinySVM/
2Available at http://chasen.org/~taku/software/YamCha/
4Available at http://www.cnts.ua.ac.be/conll2002/ner/bin/conlleval.txt