Search tips
Search criteria

Results 1-25 (438488)

Clipboard (0)

Related Articles

1.  Disambiguating the species of biomedical named entities using natural language parsers 
Bioinformatics  2010;26(5):661-667.
Motivation: Text mining technologies have been shown to reduce the laborious work involved in organizing the vast amount of information hidden in the literature. One challenge in text mining is linking ambiguous word forms to unambiguous biological concepts. This article reports on a comprehensive study on resolving the ambiguity in mentions of biomedical named entities with respect to model organisms and presents an array of approaches, with focus on methods utilizing natural language parsers.
Results: We build a corpus for organism disambiguation where every occurrence of protein/gene entity is manually tagged with a species ID, and evaluate a number of methods on it. Promising results are obtained by training a machine learning model on syntactic parse trees, which is then used to decide whether an entity belongs to the model organism denoted by a neighbouring species-indicating word (e.g. yeast). The parser-based approaches are also compared with a supervised classification method and results indicate that the former are a more favorable choice when domain portability is of concern. The best overall performance is obtained by combining the strengths of syntactic features and supervised classification.
Availability: The corpus and demo are available at, and the software is freely available as U-Compare components (Kano et al., 2009): NaCTeM Species Word Detector and NaCTeM Species Disambiguator. U-Compare is available at
PMCID: PMC2828111  PMID: 20053840
2.  A sense inventory for clinical abbreviations and acronyms created using clinical notes and medical dictionary resources 
To create a sense inventory of abbreviations and acronyms from clinical texts.
The most frequently occurring abbreviations and acronyms from 352 267 dictated clinical notes were used to create a clinical sense inventory. Senses of each abbreviation and acronym were manually annotated from 500 random instances and lexically matched with long forms within the Unified Medical Language System (UMLS V.2011AB), Another Database of Abbreviations in Medline (ADAM), and Stedman's Dictionary, Medical Abbreviations, Acronyms & Symbols, 4th edition (Stedman's). Redundant long forms were merged after they were lexically normalized using Lexical Variant Generation (LVG).
The clinical sense inventory was found to have skewed sense distributions, practice-specific senses, and incorrect uses. Of 440 abbreviations and acronyms analyzed in this study, 949 long forms were identified in clinical notes. This set was mapped to 17 359, 5233, and 4879 long forms in UMLS, ADAM, and Stedman's, respectively. After merging long forms, only 2.3% matched across all medical resources. The UMLS, ADAM, and Stedman's covered 5.7%, 8.4%, and 11% of the merged clinical long forms, respectively. The sense inventory of clinical abbreviations and acronyms and anonymized datasets generated from this study are available for public use at (‘Sense Inventories’, website).
Clinical sense inventories of abbreviations and acronyms created using clinical notes and medical dictionary resources demonstrate challenges with term coverage and resource integration. Further work is needed to help with standardizing abbreviations and acronyms in clinical care and biomedicine to facilitate automated processes such as text-mining and information extraction.
PMCID: PMC3932450  PMID: 23813539
Abbreviations as Topic*; Medical Records*; Natural Language Processing*; Word sense disambiguation; Clinical sense inventory
3.  Combining Corpus-derived Sense Profiles with Estimated Frequency Information to Disambiguate Clinical Abbreviations 
AMIA Annual Symposium Proceedings  2012;2012:1004-1013.
Abbreviations are widely used in clinical notes and are often ambiguous. Word sense disambiguation (WSD) for clinical abbreviations therefore is a critical task for many clinical natural language processing (NLP) systems. Supervised machine learning based WSD methods are known for their high performance. However, it is time consuming and costly to construct annotated samples for supervised WSD approaches and sense frequency information is often ignored by these methods. In this study, we proposed a profile-based method that used dictated discharge summaries as an external source to automatically build sense profiles and applied them to disambiguate abbreviations in hospital admission notes via the vector space model. Our evaluation using a test set containing 2,386 annotated instances from 13 ambiguous abbreviations in admission notes showed that the profile-based method performed better than two baseline methods and achieved a best average precision of 0.792. Furthermore, we developed a strategy to combine sense frequency information estimated from a clustering analysis with the profile-based method. Our results showed that the combined approach largely improved the performance and achieved a highest precision of 0.875 on the same test set, indicating that integrating sense frequency information with local context is effective for clinical abbreviation disambiguation.
PMCID: PMC3540457  PMID: 23304376
4.  Methods for Building Sense Inventories of Abbreviations in Clinical Notes 
To develop methods for building corpus-specific sense inventories of abbreviations occurring in clinical documents.
A corpus of internal medicine admission notes was collected and instances of each clinical abbreviation in the corpus were clustered to different sense clusters. One instance from each cluster was manually annotated to generate a final list of senses. Two clustering-based methods (Expectation Maximization—EM and Farthest First—FF) and one random sampling method for sense detection were evaluated using a set of 12 clinical abbreviations.
The clustering-based sense detection methods were evaluated using a set of clinical abbreviations that were manually sense annotated. “Sense Completeness” and “Annotation Cost” were used to measure the performance of different methods. Clustering error rates were also reported for different clustering algorithms.
A clustering-based semi-automated method was developed to build corpus-specific sense inventories for abbreviations in hospital admission notes. Evaluation demonstrated that this method could largely reduce manual annotation cost and increase the completeness of sense inventories when compared with a manual annotation method using random samples.
The authors developed an effective clustering-based method for building corpus-specific sense inventories for abbreviations in a clinical corpus. To the best of the authors knowledge, this is the first time clustering technologies have been used to help building sense inventories of abbreviations in clinical text. The results demonstrated that the clustering-based method performed better than the manual annotation method using random samples for the task of building sense inventories of clinical abbreviations.
PMCID: PMC2605589  PMID: 18952935
5.  A Preliminary Study of Clinical Abbreviation Disambiguation in Real Time 
Applied Clinical Informatics  2015;6(2):364-374.
To save time, healthcare providers frequently use abbreviations while authoring clinical documents. Nevertheless, abbreviations that authors deem unambiguous often confuse other readers, including clinicians, patients, and natural language processing (NLP) systems. Most current clinical NLP systems “post-process” notes long after clinicians enter them into electronic health record systems (EHRs). Such post-processing cannot guarantee 100% accuracy in abbreviation identification and disambiguation, since multiple alternative interpretations exist.
Authors describe a prototype system for real-time Clinical Abbreviation Recognition and Disambiguation (rCARD) – i.e., a system that interacts with authors during note generation to verify correct abbreviation senses. The rCARD system design anticipates future integration with web-based clinical documentation systems to improve quality of healthcare records. When clinicians enter documents, rCARD will automatically recognize each abbreviation. For abbreviations with multiple possible senses, rCARD will show a ranked list of possible meanings with the best predicted sense at the top. The prototype application embodies three word sense disambiguation (WSD) methods to predict the correct senses of abbreviations. We then conducted three experments to evaluate rCARD, including 1) a performance evaluation of different WSD methods; 2) a time evaluation of real-time WSD methods; and 3) a user study of typing clinical sentences with abbreviations using rCARD.
Using 4,721 sentences containing 25 commonly observed, highly ambiguous clinical abbreviations, our evaluation showed that the best profile-based method implemented in rCARD achieved a reasonable WSD accuracy of 88.8% (comparable to SVM – 89.5%) and the cost of time for the different WSD methods are also acceptable (ranging from 0.630 to 1.649 milliseconds within the same network). The preliminary user study also showed that the extra time costs by rCARD were about 5% of total document entry time and users did not feel a significant delay when using rCARD for clinical document entry.
The study indicates that it is feasible to integrate a real-time, NLP-enabled abbreviation recognition and disambiguation module with clinical documentation systems.
PMCID: PMC4493336  PMID: 26171081
Clinical abbreviation; machine learning; clinical documentation system
6.  A new clustering method for detecting rare senses of abbreviations in clinical notes 
Journal of biomedical informatics  2012;45(6):1075-1083.
Abbreviations are widely used in clinical documents and they are often ambiguous. Building a list of possible senses (also called sense inventory) for each ambiguous abbreviation is the first step to automatically identify correct meanings of abbreviations in given contexts. Clustering based methods have been used to detect senses of abbreviations from a clinical corpus [1]. However, rare senses remain challenging and existing algorithms are not good enough to detect them. In this study, we developed a new two-phase clustering algorithm called Tight Clustering for Rare Senses (TCRS) and applied it to sense generation of abbreviations in clinical text. Using manually annotated sense inventories from a set of 13 ambiguous clinical abbreviations, we evaluated and compared TCRS with the existing Expectation Maximization (EM) clustering algorithm for sense generation, at two different levels of annotation cost (10 vs. 20 instances for each abbreviation). Our results showed that the TCRS-based method could detect 85% senses on average; while the EM-based method found only 75% senses, when similar annotation effort (about 20 instances) was used. Further analysis demonstrated that the improvement by the TCRS method was mainly from additionally detected rare senses, thus indicating its usefulness for building more complete sense inventories of clinical abbreviations.
PMCID: PMC3729222  PMID: 22742938
Natural language processing; Word sense discrimination; Clustering; Clinical abbreviations
7.  Biomedical term mapping databases 
Nucleic Acids Research  2004;33(Database Issue):D289-D293.
Longer words and phrases are frequently mapped onto a shorter form such as abbreviations or acronyms for efficiency of communication. These abbreviations are pervasive in all aspects of biology and medicine and as the amount of biomedical literature grows, so does the number of abbreviations and the average number of definitions per abbreviation. Even more confusing, different authors will often abbreviate the same word/phrase differently. This ambiguity impedes our ability to retrieve information, integrate databases and mine textual databases for content. Efforts to standardize nomenclature, especially those doing so retrospectively, need to be aware of different abbreviatory mappings and spelling variations. To address this problem, there have been several efforts to develop computer algorithms to identify the mapping of terms between short and long form within a large body of literature. To date, four such algorithms have been applied to create online databases that comprehensively map biomedical terms and abbreviations within MEDLINE: ARGH (, the Stanford Biomedical Abbreviation Server (, AcroMed ( and SaRAD ( In addition to serving as useful computational tools, these databases serve as valuable references that help biologists keep up with an ever-expanding vocabulary of terms.
PMCID: PMC540091  PMID: 15608198
8.  Using UMLS Lexical Resources to Disambiguate Abbreviations in Clinical Text 
Clinical text is rich in acronyms and abbreviations, and they are highly ambiguous. As a pre-processing step before subsequent NLP analysis, we are developing and evaluating clinical abbreviation disambiguation methods. The evaluation of two sequential steps, the detection and the disambiguation of abbreviations, is reported here, for various types of clinical notes. For abbreviations detection, our result indicated the SPECIALIST Lexicon LRABR needed to be revised for better abbreviation detection. Our semi-supervised method using generated training data based on expanded form matching for 12 frequent abbreviations in our clinical notes reached over 90% accuracy in five-fold cross validation and unsupervised approach produced comparable results with the semi-supervised methods.
PMCID: PMC3243121  PMID: 22195128
9.  Challenges and Practical Approaches with Word Sense Disambiguation of Acronyms and Abbreviations in the Clinical Domain 
Although acronyms and abbreviations in clinical text are used widely on a daily basis, relatively little research has focused upon word sense disambiguation (WSD) of acronyms and abbreviations in the healthcare domain. Since clinical notes have distinctive characteristics, it is unclear whether techniques effective for acronym and abbreviation WSD from biomedical literature are sufficient.
The authors discuss feature selection for automated techniques and challenges with WSD of acronyms and abbreviations in the clinical domain.
There are significant challenges associated with the informal nature of clinical text, such as typographical errors and incomplete sentences; difficulty with insufficient clinical resources, such as clinical sense inventories; and obstacles with privacy and security for conducting research with clinical text. Although we anticipated that using sophisticated techniques, such as biomedical terminologies, semantic types, part-of-speech, and language modeling, would be needed for feature selection with automated machine learning approaches, we found instead that simple techniques, such as bag-of-words, were quite effective in many cases. Factors, such as majority sense prevalence and the degree of separateness between sense meanings, were also important considerations.
The first lesson is that a comprehensive understanding of the unique characteristics of clinical text is important for automatic acronym and abbreviation WSD. The second lesson learned is that investigators may find that using simple approaches is an effective starting point for these tasks. Finally, similar to other WSD tasks, an understanding of baseline majority sense rates and separateness between senses is important. Further studies and practical solutions are needed to better address these issues.
PMCID: PMC4330198  PMID: 25705556
Abbreviations as Topic; Medical Records; Natural Language Processing; Artificial Intelligence; Automated Pattern Recognition
10.  Automatic Resolution of Ambiguous Terms Based on Machine Learning and Conceptual Relations in the UMLS 
Motivation. The UMLS has been used in natural language processing applications such as information retrieval and information extraction systems. The mapping of free-text to UMLS concepts is important for these applications. To improve the mapping, we need a method to disambiguate terms that possess multiple UMLS concepts. In the general English domain, machine-learning techniques have been applied to sense-tagged corpora, in which senses (or concepts) of ambiguous terms have been annotated (mostly manually). Sense disambiguation classifiers are then derived to determine senses (or concepts) of those ambiguous terms automatically. However, manual annotation of a corpus is an expensive task. We propose an automatic method that constructs sense-tagged corpora for ambiguous terms in the UMLS using MEDLINE abstracts.
Methods. For a term W that represents multiple UMLS concepts, a collection of MEDLINE abstracts that contain W is extracted. For each abstract in the collection, occurrences of concepts that have relations with W as defined in the UMLS are automatically identified. A sense-tagged corpus, in which senses of W are annotated, is then derived based on those identified concepts. The method was evaluated on a set of 35 frequently occurring ambiguous biomedical abbreviations using a gold standard set that was automatically derived. The quality of the derived sense-tagged corpus was measured using precision and recall.
Results. The derived sense-tagged corpus had an overall precision of 92.9% and an overall recall of 47.4%. After removing rare senses and ignoring abbreviations with closely related senses, the overall precision was 96.8% and the overall recall was 50.6%.
Conclusions. UMLS conceptual relations and MEDLINE abstracts can be used to automatically acquire knowledge needed for resolving ambiguity when mapping free-text to UMLS concepts.
PMCID: PMC349379  PMID: 12386113
11.  Mapping Abbreviations to Full Forms in Biomedical Articles 
Objective: To develop methods that automatically map abbreviations to their full forms in biomedical articles.
Methods: The authors developed two methods of mapping defined and undefined abbreviations (defined abbreviations are paired with their full forms in the articles, whereas undefined ones are not). For defined abbreviations, they developed a set of pattern-matching rules to map an abbreviation to its full form and implemented the rules into a software program, AbbRE (for “abbreviation recognition and extraction”). Using the opinions of domain experts as a reference standard, they evaluated the recall and precision of AbbRE for defined abbreviations in ten biomedical articles randomly selected from the ten most frequently cited medical and biological journals. They also measured the percentage of undefined abbreviations in the same set of articles, and they investigated whether they could map undefined abbreviations to any of four public abbreviation databases (GenBank LocusLink, swissprot, LRABR of the UMLS Specialist Lexicon, and Bioabacus).
Results: AbbRE had an average 0.70 recall and 0.95 precision for the defined abbreviations. The authors found that an average of 25 percent of abbreviations were defined in biomedical articles and that of a randomly selected subset of undefined abbreviations, 68 percent could be mapped to any of four abbreviation databases. They also found that many abbreviations are ambiguous (i.e., they map to more than one full form in abbreviation databases).
Conclusion: AbbRE is efficient for mapping defined abbreviations. To couple AbbRE with abbreviation databases for the mapping of undefined abbreviations, not only exhaustive abbreviation databases but also a method to resolve the ambiguity of abbreviations in the databases are needed.
PMCID: PMC344586  PMID: 11971887
12.  Building Linked Open Data towards integration of biomedical scientific literature with DBpedia 
There is a growing need for efficient and integrated access to databases provided by diverse institutions. Using a linked data design pattern allows the diverse data on the Internet to be linked effectively and accessed efficiently by computers. Previously, we developed the Allie database, which stores pairs of abbreviations and long forms (LFs, or expanded forms) used in the life sciences. LFs define the semantics of abbreviations, and Allie provides a Web-based search service for researchers to look up the LF of an unfamiliar abbreviation. This service encounters two problems. First, it does not display each LF’s definition, which could help the user to disambiguate and learn the abbreviations more easily. Furthermore, there are too many LFs for us to prepare a full dictionary from scratch. On the other hand, DBpedia has made the contents of Wikipedia available in the Resource Description Framework (RDF), which is expected to contain a significant number of entries corresponding to LFs. Therefore, linking the Allie LFs to DBpedia entries may present a solution to the Allie’s problems. This requires a method that is capable of matching large numbers of string pairs within a reasonable period of time because Allie and DBpedia are frequently updated.
We built a Linked Open Data set that links LFs to DBpedia titles by applying key collision methods (i.e., fingerprint and n-gram fingerprint) to their literals, which are simple approximate string-matching methods. In addition, we used UMLS resources to normalise the life science terms. As a result, combining the key collision methods with the domain-specific resources performed best, and 44,027 LFs have links to DBpedia titles. We manually evaluated the accuracy of the string matching by randomly sampling 1200 LFs, and our approach achieved an F-measure of 0.98. In addition, our experiments revealed the following. (1) Performances were similar independently from the frequency of the LFs in MEDLINE. (2) There is a relationship (r2 = 0.96, P < 0.01) between the occurrence frequencies of LFs in MEDLINE and their presence probabilities in DBpedia titles.
The obtained results help Allie users locate the correct LFs. Because the methods are computationally simple and yield a high performance and because the most frequently used LFs in MEDLINE appear more often in DBpedia titles, we can continually and reasonably update the linked dataset to reflect the latest publications and additions to DBpedia. Joining LFs between scientific literature and DBpedia enables cross-resource exploration for mutual benefits.
PMCID: PMC3621846  PMID: 23497538
13.  Exploiting MeSH indexing in MEDLINE to generate a data set for word sense disambiguation 
BMC Bioinformatics  2011;12:223.
Evaluation of Word Sense Disambiguation (WSD) methods in the biomedical domain is difficult because the available resources are either too small or too focused on specific types of entities (e.g. diseases or genes). We present a method that can be used to automatically develop a WSD test collection using the Unified Medical Language System (UMLS) Metathesaurus and the manual MeSH indexing of MEDLINE. We demonstrate the use of this method by developing such a data set, called MSH WSD.
In our method, the Metathesaurus is first screened to identify ambiguous terms whose possible senses consist of two or more MeSH headings. We then use each ambiguous term and its corresponding MeSH heading to extract MEDLINE citations where the term and only one of the MeSH headings co-occur. The term found in the MEDLINE citation is automatically assigned the UMLS CUI linked to the MeSH heading. Each instance has been assigned a UMLS Concept Unique Identifier (CUI). We compare the characteristics of the MSH WSD data set to the previously existing NLM WSD data set.
The resulting MSH WSD data set consists of 106 ambiguous abbreviations, 88 ambiguous terms and 9 which are a combination of both, for a total of 203 ambiguous entities. For each ambiguous term/abbreviation, the data set contains a maximum of 100 instances per sense obtained from MEDLINE.
We evaluated the reliability of the MSH WSD data set using existing knowledge-based methods and compared their performance to that of the results previously obtained by these algorithms on the pre-existing data set, NLM WSD. We show that the knowledge-based methods achieve different results but keep their relative performance except for the Journal Descriptor Indexing (JDI) method, whose performance is below the other methods.
The MSH WSD data set allows the evaluation of WSD algorithms in the biomedical domain. Compared to previously existing data sets, MSH WSD contains a larger number of biomedical terms/abbreviations and covers the largest set of UMLS Semantic Types. Furthermore, the MSH WSD data set has been generated automatically reusing already existing annotations and, therefore, can be regenerated from subsequent UMLS versions.
PMCID: PMC3123611  PMID: 21635749
14.  A study of abbreviations in MEDLINE abstracts. 
Abbreviations are widely used in writing, and the understanding of abbreviations is important for natural language processing applications. Abbreviations are not always defined in a document and they are highly ambiguous. A knowledge base that consists of abbreviations with their associated senses and a method to resolve the ambiguities are needed. In this paper, we studied the UMLS coverage, textual variants of senses, and the ambiguity of abbreviations in MEDLINE abstracts. We restricted our study to three-letter abbreviations which were defined using parenthetical expressions. When grouping similar expansions together and representing senses using groups, we found that after ignoring senses where the total number of occurrences within the corresponding group was less than 100, 82.8% of the senses matched the UMLS, covered over 93% of occurrences that were considered, and had an average of 7.74 expansions for each sense. Abbreviations are highly ambiguous: 81.2% of the abbreviations were ambiguous, and had an average of 16.6 senses. However, after ignoring senses with occurrences of less than 5, 64.6% of the abbreviations were ambiguous, and had an average of 4.91 senses.
PMCID: PMC2244212  PMID: 12463867
15.  Ambiguity of Human Gene Symbols in LocusLink and MEDLINE: Creating an Inventory and a Disambiguation Test Collection 
Genes are discovered almost on a daily basis and new names have to be found. Although there are guidelines for gene nomenclature, the naming process is highly creative. Human genes are often named with a gene symbol and a longer, more descriptive term; the short form is very often an abbreviation of the long form. Abbreviations in biomedical language are highly ambiguous, i.e., one gene symbol often refers to more than one gene. Using an existing abbreviation expansion algorithm, we explore MEDLINE for the use of human gene symbols derived from LocusLink. It turns out that just over 40% of these symbols occur in MEDLINE, however, many of these occurrences are not related to genes. Along the process of making an inventory, a disambiguation test collection is constructed automatically.
PMCID: PMC1480234  PMID: 14728264
16.  FACTA: a text search engine for finding associated biomedical concepts 
Bioinformatics  2008;24(21):2559-2560.
Summary: FACTA is a text search engine for MEDLINE abstracts, which is designed particularly to help users browse biomedical concepts (e.g. genes/proteins, diseases, enzymes and chemical compounds) appearing in the documents retrieved by the query. The concepts are presented to the user in a tabular format and ranked based on the co-occurrence statistics. Unlike existing systems that provide similar functionality, FACTA pre-indexes not only the words but also the concepts mentioned in the documents, which enables the user to issue a flexible query (e.g. free keywords or Boolean combinations of keywords/concepts) and receive the results immediately even when the number of the documents that match the query is very large. The user can also view snippets from MEDLINE to get textual evidence of associations between the query terms and the concepts. The concept IDs and their names/synonyms for building the indexes were collected from several biomedical databases and thesauri, such as UniProt, BioThesaurus, UMLS, KEGG and DrugBank.
Availability: The system is available at
PMCID: PMC2572701  PMID: 18772154
17.  A Study of Abbreviations in Clinical Notes 
Various natural language processing (NLP) systems have been developed to unlock patient information from narrative clinical notes in order to support knowledge based applications such as error detection, surveillance and decision support. In many clinical notes, abbreviations are widely used without mention of their definitions, which is very different from the use of abbreviations in the biomedical literature. Thus, it is critical, but more challenging, for NLP systems to correctly interpret abbreviations in these notes. In this paper we describe a study of a two-step model for building a clinical abbreviation database: first, abbreviations in a text corpus were detected and then a sense inventory was built for those that were found. Four detection methods were developed and evaluated. Results showed that the best detection method had a precision of 91.4% and recall of 80.3%. A simple method was used to build sense inventories from two different knowledge sources: the Unified Medical Language System (UMLS) and a MEDLINE abbreviation database (ADAM). Evaluation showed the inventory from the UMLS appeared to be the more appropriate of the two for defining the sense of abbreviations, but was not ideal. It covered 35% of the senses and had an ambiguity rate of 40% for those that were covered. However, annotation by domain experts appears necessary for uncovered abbreviations and to determine the correct senses.
PMCID: PMC2655910  PMID: 18693951
18.  Collocation analysis for UMLS knowledge-based word sense disambiguation 
BMC Bioinformatics  2011;12(Suppl 3):S4.
The effectiveness of knowledge-based word sense disambiguation (WSD) approaches depends in part on the information available in the reference knowledge resource. Off the shelf, these resources are not optimized for WSD and might lack terms to model the context properly. In addition, they might include noisy terms which contribute to false positives in the disambiguation results.
We analyzed some collocation types which could improve the performance of knowledge-based disambiguation methods. Collocations are obtained by extracting candidate collocations from MEDLINE and then assigning them to one of the senses of an ambiguous word. We performed this assignment either using semantic group profiles or a knowledge-based disambiguation method. In addition to collocations, we used second-order features from a previously implemented approach.
Specifically, we measured the effect of these collocations in two knowledge-based WSD methods. The first method, AEC, uses the knowledge from the UMLS to collect examples from MEDLINE which are used to train a Naïve Bayes approach. The second method, MRD, builds a profile for each candidate sense based on the UMLS and compares the profile to the context of the ambiguous word.
We have used two WSD test sets which contain disambiguation cases which are mapped to UMLS concepts. The first one, the NLM WSD set, was developed manually by several domain experts and contains words with high frequency occurrence in MEDLINE. The second one, the MSH WSD set, was developed automatically using the MeSH indexing in MEDLINE. It contains a larger set of words and covers a larger number of UMLS semantic types.
The results indicate an improvement after the use of collocations, although the approaches have different performance depending on the data set. In the NLM WSD set, the improvement is larger for the MRD disambiguation method using second-order features. Assignment of collocations to a candidate sense based on UMLS semantic group profiles is more effective in the AEC method.
In the MSH WSD set, the increment in performance is modest for all the methods. Collocations combined with the MRD disambiguation method have the best performance. The MRD disambiguation method and second-order features provide an insignificant change in performance. The AEC disambiguation method gives a modest improvement in performance. Assignment of collocations to a candidate sense based on knowledge-based methods has better performance.
Collocations improve the performance of knowledge-based disambiguation methods, although results vary depending on the test set and method used. Generally, the AEC method is sensitive to query drift. Using AEC, just a few selected terms provide a large improvement in disambiguation performance. The MRD method handles noisy terms better but requires a larger set of terms to improve performance.
PMCID: PMC3111590  PMID: 21658291
19.  Abbreviation and Acronym Disambiguation in Clinical Discourse 
Use of abbreviations and acronyms is pervasive in clinical reports despite many efforts to limit the use of ambiguous and unsanctioned abbreviations and acronyms. Due to the fact that many abbreviations and acronyms are ambiguous with respect to their sense, complete and accurate text analysis is impossible without identification of the sense that was intended for a given abbreviation or acronym. We present the results of an experiment where we used the contexts harvested from the Internet through Google API to collect contextual data for a set of 8 acronyms found in clinical notes at the Mayo Clinic. We then used the contexts to disambiguate the sense of abbreviations in a manually annotated corpus.
PMCID: PMC1560669  PMID: 16779108
20.  Word Sense Disambiguation of Clinical Abbreviations with Hyperdimensional Computing 
AMIA Annual Symposium Proceedings  2013;2013:1007-1016.
Automated Word Sense Disambiguation in clinical documents is a prerequisite to accurate extraction of medical information. Emerging methods utilizing hyperdimensional computing present new approaches to this problem. In this paper, we evaluate one such approach, the Binary Spatter Code Word Sense Disambiguation algorithm, on 50 ambiguous abbreviation sets derived from clinical notes. This algorithm uses reversible vector transformations to encode ambiguous terms and their context-specific senses into vectors representing surrounding terms. The sense for a new context is then inferred from vectors representing the terms it contains. One-to-one BSC-WSD achieves average accuracy of 94.55% when considering the orientation and distance of neighboring terms relative to the target abbreviation, outperforming Support Vector Machine and Naïve Bayes classifiers. Furthermore, it is practical to deal with all 50 abbreviations in an identical manner using a single one-to-many BSC-WSD model with average accuracy of 93.91%, which is not possible with common machine learning algorithms.
PMCID: PMC3900125  PMID: 24551390
21.  A UMLS-based spell checker for natural language processing in vaccine safety 
The Institute of Medicine has identified patient safety as a key goal for health care in the United States. Detecting vaccine adverse events is an important public health activity that contributes to patient safety. Reports about adverse events following immunization (AEFI) from surveillance systems contain free-text components that can be analyzed using natural language processing. To extract Unified Medical Language System (UMLS) concepts from free text and classify AEFI reports based on concepts they contain, we first needed to clean the text by expanding abbreviations and shortcuts and correcting spelling errors. Our objective in this paper was to create a UMLS-based spelling error correction tool as a first step in the natural language processing (NLP) pipeline for AEFI reports.
We developed spell checking algorithms using open source tools. We used de-identified AEFI surveillance reports to create free-text data sets for analysis. After expansion of abbreviated clinical terms and shortcuts, we performed spelling correction in four steps: (1) error detection, (2) word list generation, (3) word list disambiguation and (4) error correction. We then measured the performance of the resulting spell checker by comparing it to manual correction.
We used 12,056 words to train the spell checker and tested its performance on 8,131 words. During testing, sensitivity, specificity, and positive predictive value (PPV) for the spell checker were 74% (95% CI: 74–75), 100% (95% CI: 100–100), and 47% (95% CI: 46%–48%), respectively.
We created a prototype spell checker that can be used to process AEFI reports. We used the UMLS Specialist Lexicon as the primary source of dictionary terms and the WordNet lexicon as a secondary source. We used the UMLS as a domain-specific source of dictionary terms to compare potentially misspelled words in the corpus. The prototype sensitivity was comparable to currently available tools, but the specificity was much superior. The slow processing speed may be improved by trimming it down to the most useful component algorithms. Other investigators may find the methods we developed useful for cleaning text using lexicons specific to their area of interest.
PMCID: PMC1805499  PMID: 17295907
22.  Machine learning and word sense disambiguation in the biomedical domain: design and evaluation issues 
BMC Bioinformatics  2006;7:334.
Word sense disambiguation (WSD) is critical in the biomedical domain for improving the precision of natural language processing (NLP), text mining, and information retrieval systems because ambiguous words negatively impact accurate access to literature containing biomolecular entities, such as genes, proteins, cells, diseases, and other important entities. Automated techniques have been developed that address the WSD problem for a number of text processing situations, but the problem is still a challenging one. Supervised WSD machine learning (ML) methods have been applied in the biomedical domain and have shown promising results, but the results typically incorporate a number of confounding factors, and it is problematic to truly understand the effectiveness and generalizability of the methods because these factors interact with each other and affect the final results. Thus, there is a need to explicitly address the factors and to systematically quantify their effects on performance.
Experiments were designed to measure the effect of "sample size" (i.e. size of the datasets), "sense distribution" (i.e. the distribution of the different meanings of the ambiguous word) and "degree of difficulty" (i.e. the measure of the distances between the meanings of the senses of an ambiguous word) on the performance of WSD classifiers. Support Vector Machine (SVM) classifiers were applied to an automatically generated data set containing four ambiguous biomedical abbreviations: BPD, BSA, PCA, and RSV, which were chosen because of varying degrees of differences in their respective senses. Results showed that: 1) increasing the sample size generally reduced the error rate, but this was limited mainly to well-separated senses (i.e. cases where the distances between the senses were large); in difficult cases an unusually large increase in sample size was needed to increase performance slightly, which was impractical, 2) the sense distribution did not have an effect on performance when the senses were separable, 3) when there was a majority sense of over 90%, the WSD classifier was not better than use of the simple majority sense, 4) error rates were proportional to the similarity of senses, and 5) there was no statistical difference between results when using a 5-fold or 10-fold cross-validation method. Other issues that impact performance are also enumerated.
Several different independent aspects affect performance when using ML techniques for WSD. We found that combining them into one single result obscures understanding of the underlying methods. Although we studied only four abbreviations, we utilized a well-established statistical method that guarantees the results are likely to be generalizable for abbreviations with similar characteristics. The results of our experiments show that in order to understand the performance of these ML methods it is critical that papers report on the baseline performance, the distribution and sample size of the senses in the datasets, and the standard deviation or confidence intervals. In addition, papers should also characterize the difficulty of the WSD task, the WSD situations addressed and not addressed, as well as the ML methods and features used. This should lead to an improved understanding of the generalizablility and the limitations of the methodology.
PMCID: PMC1550263  PMID: 16822321
23.  Allie: a database and a search service of abbreviations and long forms 
Many abbreviations are used in the literature especially in the life sciences, and polysemous abbreviations appear frequently, making it difficult to read and understand scientific papers that are outside of a reader’s expertise. Thus, we have developed Allie, a database and a search service of abbreviations and their long forms (a.k.a. full forms or definitions). Allie searches for abbreviations and their corresponding long forms in a database that we have generated based on all titles and abstracts in MEDLINE. When a user query matches an abbreviation, Allie returns all potential long forms of the query along with their bibliographic data (i.e. title and publication year). In addition, for each candidate, co-occurring abbreviations and a research field in which it frequently appears in the MEDLINE data are displayed. This function helps users learn about the context in which an abbreviation appears. To deal with synonymous long forms, we use a dictionary called GENA that contains domain-specific terms such as gene, protein or disease names along with their synonymic information. Conceptually identical domain-specific terms are regarded as one term, and then conceptually identical abbreviation-long form pairs are grouped taking into account their appearance in MEDLINE. To keep up with new abbreviations that are continuously introduced, Allie has an automatic update system. In addition, the database of abbreviations and their long forms with their corresponding PubMed IDs is constructed and updated weekly.
Database URL: The Allie service is available at
PMCID: PMC3077826  PMID: 21498548
24.  Automated Disambiguation of Acronyms and Abbreviations in Clinical Texts: Window and Training Size Considerations 
AMIA Annual Symposium Proceedings  2012;2012:1310-1319.
Acronyms and abbreviations within electronic clinical texts are widespread and often associated with multiple senses. Automated acronym sense disambiguation (WSD), a task of assigning the context-appropriate sense to ambiguous clinical acronyms and abbreviations, represents an active problem for medical natural language processing (NLP) systems. In this paper, fifty clinical acronyms and abbreviations with 500 samples each were studied using supervised machine-learning techniques (Support Vector Machines (SVM), Naïve Bayes (NB), and Decision Trees (DT)) to optimize the window size and orientation and determine the minimum training sample size needed for optimal performance. Our analysis of window size and orientation showed best performance using a larger left-sided and smaller right-sided window. To achieve an accuracy of over 90%, the minimum required training sample size was approximately 125 samples for SVM classifiers with inverted cross-validation. These findings support future work in clinical acronym and abbreviation WSD and require validation with other clinical texts.
PMCID: PMC3540435  PMID: 23304410
25.  A Multi-aspect Comparison Study of Supervised Word Sense Disambiguation 
Objective: The aim of this study was to investigate relations among different aspects in supervised word sense disambiguation (WSD; supervised machine learning for disambiguating the sense of a term in a context) and compare supervised WSD in the biomedical domain with that in the general English domain. Methods:The study involves three data sets (a biomedical abbreviation data set, a general biomedical term data set, and a general English data set). The authors implemented three machine-learning algorithms, including (1) naïve Bayes (NBL) and decision lists (TDLL), (2) their adaptation of decision lists (ODLL), and (3) their mixed supervised learning (MSL). There were six feature representations (various combinations of collocations, bag of words, oriented bag of words, etc.) and five window sizes (2, 4, 6, 8, and 10). Results: Supervised WSD is suitable only when there are enough sense-tagged instances with at least a few dozens of instances for each sense. Collocations combined with neighboring words are appropriate selections for the context. For terms with unrelated biomedical senses, a large window size such as the whole paragraph should be used, while for general English words a moderate window size between 4 and 10 should be used. The performance of the authors' implementation of decision list classifiers for abbreviations was better than that of traditional decision list classifiers. However, the opposite held for the other two sets. Also, the authors' mixed supervised learning was stable and generally better than others for all sets. Conclusion: From this study, it was found that different aspects of supervised WSD depend on each other. The experiment method presented in the study can be used to select the best supervised WSD classifier for each ambiguous term.
PMCID: PMC436083  PMID: 15064284

Results 1-25 (438488)