Search tips
Search criteria

Results 1-21 (21)

Clipboard (0)

Select a Filter Below

Year of Publication
Document Types
1.  Patient-level temporal aggregation for text-based asthma status ascertainment 
To specify the problem of patient-level temporal aggregation from clinical text and introduce several probabilistic methods for addressing that problem. The patient-level perspective differs from the prevailing natural language processing (NLP) practice of evaluating at the term, event, sentence, document, or visit level.
We utilized an existing pediatric asthma cohort with manual annotations. After generating a basic feature set via standard clinical NLP methods, we introduce six methods of aggregating time-distributed features from the document level to the patient level. These aggregation methods are used to classify patients according to their asthma status in two hypothetical settings: retrospective epidemiology and clinical decision support.
In both settings, solid patient classification performance was obtained with machine learning algorithms on a number of evidence aggregation methods, with Sum aggregation obtaining the highest F1 score of 85.71% on the retrospective epidemiological setting, and a probability density function-based method obtaining the highest F1 score of 74.63% on the clinical decision support setting. Multiple techniques also estimated the diagnosis date (index date) of asthma with promising accuracy.
The clinical decision support setting is a more difficult problem. We rule out some aggregation methods rather than determining the best overall aggregation method, since our preliminary data set represented a practical setting in which manually annotated data were limited.
Results contrasted the strengths of several aggregation algorithms in different settings. Multiple approaches exhibited good patient classification performance, and also predicted the timing of estimates with reasonable accuracy.
PMCID: PMC4147607  PMID: 24833775
Patient classification; Asthma epidemiology; Natural language processing; Information extraction
2.  MedXN: an open source medication extraction and normalization tool for clinical text 
We developed the Medication Extraction and Normalization (MedXN) system to extract comprehensive medication information and normalize it to the most appropriate RxNorm concept unique identifier (RxCUI) as specifically as possible.
Medication descriptions in clinical notes were decomposed into medication name and attributes, which were separately extracted using RxNorm dictionary lookup and regular expression. Then, each medication name and its attributes were combined together according to RxNorm convention to find the most appropriate RxNorm representation. To do this, we employed serialized hierarchical steps implemented in Apache's Unstructured Information Management Architecture. We also performed synonym expansion, removed false medications, and employed inference rules to improve the medication extraction and normalization performance.
An evaluation on test data of 397 medication mentions showed F-measures of 0.975 for medication name and over 0.90 for most attributes. The RxCUI assignment produced F-measures of 0.932 for medication name and 0.864 for full medication information. Most false negative RxCUI assignments in full medication information are due to human assumption of missing attributes and medication names in the gold standard.
The MedXN system ( was able to extract comprehensive medication information with high accuracy and demonstrated good normalization capability to RxCUI as long as explicit evidence existed. More sophisticated inference rules might result in further improvements to specific RxCUI assignments for incomplete medication descriptions.
PMCID: PMC4147619  PMID: 24637954
Medication Extraction; Medication Normalization; Electronic Medical Records; RxNorm; Natural Language Processing
3.  Drug Normalization for Cancer Therapeutic and Druggable Genome Target Discovery 
Heterogeneous drug data representation among different druggable genome knowledge resources and datasets delays effective cancer therapeutic target discovery within the broad scientific community. The objective of the present paper is to describe the challenges and lessons learned from our efforts in developing and evaluating a standards-based drug normalization framework targeting cancer druggable genome datasets. Our findings suggested that mechanisms need to be established to deal with spelling errors and irregularities in normalizing clinical drug data in The Cancer Genome Atlas (TCGA), whereas the annotations from NCI Thesaurus (NCIt) and PubChem are two layers of normalization that potentially bridge between the clinical phenotypes and the druggable genome knowledge for effective cancer therapeutic target discovery.
PMCID: PMC4525232  PMID: 26306243
4.  Normalization and standardization of electronic health records for high-throughput phenotyping: the SHARPn consortium 
Research objective
To develop scalable informatics infrastructure for normalization of both structured and unstructured electronic health record (EHR) data into a unified, concept-based model for high-throughput phenotype extraction.
Materials and methods
Software tools and applications were developed to extract information from EHRs. Representative and convenience samples of both structured and unstructured data from two EHR systems—Mayo Clinic and Intermountain Healthcare—were used for development and validation. Extracted information was standardized and normalized to meaningful use (MU) conformant terminology and value set standards using Clinical Element Models (CEMs). These resources were used to demonstrate semi-automatic execution of MU clinical-quality measures modeled using the Quality Data Model (QDM) and an open-source rules engine.
Using CEMs and open-source natural language processing and terminology services engines—namely, Apache clinical Text Analysis and Knowledge Extraction System (cTAKES) and Common Terminology Services (CTS2)—we developed a data-normalization platform that ensures data security, end-to-end connectivity, and reliable data flow within and across institutions. We demonstrated the applicability of this platform by executing a QDM-based MU quality measure that determines the percentage of patients between 18 and 75 years with diabetes whose most recent low-density lipoprotein cholesterol test result during the measurement year was <100 mg/dL on a randomly selected cohort of 273 Mayo Clinic patients. The platform identified 21 and 18 patients for the denominator and numerator of the quality measure, respectively. Validation results indicate that all identified patients meet the QDM-based criteria.
End-to-end automated systems for extracting clinical information from diverse EHR systems require extensive use of standardized vocabularies and terminologies, as well as robust information models for storing, discovering, and processing that information. This study demonstrates the application of modular and open-source resources for enabling secondary use of EHR data through normalization into standards-based, comparable, and consistent format for high-throughput phenotyping to identify patient cohorts.
PMCID: PMC3861933  PMID: 24190931
Electronic health record; Meaningful Use; Normalization; Natural Language Processing; Phenotype Extraction
5.  Analysis of Medication and Indication Occurrences in Clinical Notes 
AMIA Annual Symposium Proceedings  2014;2014:1046-1055.
A medication indication is a valid reason to use medication. Comprehensive information on medication and its intended indications has valuable potential applications for patient treatments, quality improvements, and clinical decision support. Though there are some publicly available medication resources, this medication and indication information is comprised primarily of labeled uses approved by the FDA. Additionally, linking those medications and the corresponding indications is not easy to accomplish. Furthermore, research that analyzes actual medication and indication occurrences used in real clinical practice is limited. In this study, we compiled clinician-asserted medication and indication pairs from a large cohort of Mayo Clinic electronic medical records (EMRs) and normalized them to the standard forms (ie, medication to the RxNorm ingredient and indication to SNOMED-CT). We then analyzed medication and indication occurrences and compared them with the public resource in various ways, including off-label statistics.
PMCID: PMC4419933  PMID: 25954414
6.  Automated chart review for asthma cohort identification using natural language processing: an exploratory study 
A significant proportion of children with asthma have delayed diagnosis of asthma by health care providers. Manual chart review according to established criteria is more accurate than directly using diagnosis codes, which tend to under-identify asthmatics, but chart reviews are more costly and less timely.
To evaluate the accuracy of a computational approach to asthma ascertainment, characterizing its utility and feasibility toward large-scale deployment in electronic medical records.
A natural language processing (NLP) system was developed for extracting predetermined criteria for asthma from unstructured text in electronic medical records and then inferring asthma status based on these criteria. Using manual chart reviews as a gold standard, asthma status (yes vs no) and identification date (first date of a “yes” asthma status) were determined by the NLP system.
Patients were a group of children (n =112, 84% Caucasian, 49% girls) younger than 4 years (mean 2.0 years, standard deviation 1.03 years) who participated in previous studies. The NLP approach to asthma ascertainment showed sensitivity, specificity, positive predictive value, negative predictive value, and median delay in diagnosis of 84.6%, 96.5%, 88.0%, 95.4%, and 0 months, respectively; this compared favorably with diagnosis codes, at 30.8%, 93.2%, 57.1%, 82.2%, and 2.3 months, respectively.
Automated asthma ascertainment from electronic medical records using NLP is feasible and more accurate than traditional approaches such as diagnosis codes. Considering the difficulty of labor-intensive manual record review, NLP approaches for asthma ascertainment should be considered for improving clinical care and research, especially in large-scale efforts.
PMCID: PMC3839107  PMID: 24125142
7.  Comprehensive temporal information detection from clinical text: medical events, time, and TLINK identification 
Temporal information detection systems have been developed by the Mayo Clinic for the 2012 i2b2 Natural Language Processing Challenge.
To construct automated systems for EVENT/TIMEX3 extraction and temporal link (TLINK) identification from clinical text.
Materials and methods
The i2b2 organizers provided 190 annotated discharge summaries as the training set and 120 discharge summaries as the test set. Our Event system used a conditional random field classifier with a variety of features including lexical information, natural language elements, and medical ontology. The TIMEX3 system employed a rule-based method using regular expression pattern match and systematic reasoning to determine normalized values. The TLINK system employed both rule-based reasoning and machine learning. All three systems were built in an Apache Unstructured Information Management Architecture framework.
Our TIMEX3 system performed the best (F-measure of 0.900, value accuracy 0.731) among the challenge teams. The Event system produced an F-measure of 0.870, and the TLINK system an F-measure of 0.537.
Our TIMEX3 system demonstrated good capability of regular expression rules to extract and normalize time information. Event and TLINK machine learning systems required well-defined feature sets to perform well. We could also leverage expert knowledge as part of the machine learning features to further improve TLINK identification performance.
PMCID: PMC3756269  PMID: 23558168
8.  Facilitating post-surgical complication detection through sublanguage analysis 
Identification of postsurgical complications is the first step towards improving patient safety and health care quality as well as reducing heath care cost. Existing NLP-based approaches for retrieving postsurgical complications are based on search strategies. Here, we conduct a sublanguage analysis study using free text reports available for a cohort of patients with postsurgical complications identified manually to compare the keywords identified by subject matter experts with words/phrases automatically identified by sublanguage analysis. The results suggest that search-based approaches may miss some cases and the sublanguage analysis results can be used as a base to develop an information extraction system or support search-based NLP approaches by augmenting search queries.
PMCID: PMC4333707  PMID: 25717405
9.  Coreference analysis in clinical notes: a multi-pass sieve with alternate anaphora resolution modules 
This paper describes the coreference resolution system submitted by Mayo Clinic for the 2011 i2b2/VA/Cincinnati shared task Track 1C. The goal of the task was to construct a system that links the markables corresponding to the same entity.
Materials and methods
The task organizers provided progress notes and discharge summaries that were annotated with the markables of treatment, problem, test, person, and pronoun. We used a multi-pass sieve algorithm that applies deterministic rules in the order of preciseness and simultaneously gathers information about the entities in the documents. Our system, MedCoref, also uses a state-of-the-art machine learning framework as an alternative to the final, rule-based pronoun resolution sieve.
The best system that uses a multi-pass sieve has an overall score of 0.836 (average of B3, MUC, Blanc, and CEAF F score) for the training set and 0.843 for the test set.
A supervised machine learning system that typically uses a single function to find coreferents cannot accommodate irregularities encountered in data especially given the insufficient number of examples. On the other hand, a completely deterministic system could lead to a decrease in recall (sensitivity) when the rules are not exhaustive. The sieve-based framework allows one to combine reliable machine learning components with rules designed by experts.
Using relatively simple rules, part-of-speech information, and semantic type properties, an effective coreference resolution system could be designed. The source code of the system described is available at
PMCID: PMC3422831  PMID: 22707745
Natural language processing; machine learning; information extraction; electronic medical record; coreference resolution; text mining; computational linguistics; named entity recognition; distributional semantics; relationship extraction; information storage and retrieval (text and images)
10.  Analysis of Cross-Institutional Medication Description Patterns in Clinical Narratives 
Biomedical Informatics Insights  2013;6(Suppl 1):7-16.
A large amount of medication information resides in the unstructured text found in electronic medical records, which requires advanced techniques to be properly mined. In clinical notes, medication information follows certain semantic patterns (eg, medication, dosage, frequency, and mode). Some medication descriptions contain additional word(s) between medication attributes. Therefore, it is essential to understand the semantic patterns as well as the patterns of the context interspersed among them (ie, context patterns) to effectively extract comprehensive medication information. In this paper we examined both semantic and context patterns, and compared those found in Mayo Clinic and i2b2 challenge data. We found that some variations exist between the institutions but the dominant patterns are common.
PMCID: PMC3702197  PMID: 23847423
medication extraction; electronic medical record; natural language processing
11.  Identifying Abdominal Aortic Aneurysm Cases and Controls using Natural Language Processing of Radiology Reports  
Prevalence of abdominal aortic aneurysm (AAA) is increasing due to longer life expectancy and implementation of screening programs. Patient-specific longitudinal measurements of AAA are important to understand pathophysiology of disease development and modifiers of abdominal aortic size. In this paper, we applied natural language processing (NLP) techniques to process radiology reports and developed a rule-based algorithm to identify AAA patients and also extract the corresponding aneurysm size with the examination date. AAA patient cohorts were determined by a hierarchical approach that: 1) selected potential AAA reports using keywords; 2) classified reports into AAA-case vs. non-case using rules; and 3) determined the AAA patient cohort based on a report-level classification. Our system was built in an Unstructured Information Management Architecture framework that allows efficient use of existing NLP components. Our system produced an F-score of 0.961 for AAA-case report classification with an accuracy of 0.984 for aneurysm size extraction.
PMCID: PMC3845740  PMID: 24303276
12.  Workflow-based Data Reconciliation for Clinical Decision Support: Case of Colorectal Cancer Screening and Surveillance  
A major barrier for computer-based clinical decision support (CDS), is the difficulty in obtaining the patient information required for decision making. The information gap is often due to deficiencies in the clinical documentation. One approach to address this gap is to gather and reconcile data from related documents or data sources. In this paper we consider the case of a CDS system for colorectal cancer screening and surveillance. We describe the use of workflow analysis to design data reconciliation processes. Further, we perform a quantitative analysis of the impact of these processes on system performance using a dataset of 106 patients. Results show that data reconciliation considerably improves the performance of the system. Our study demonstrates that, workflow-based data reconciliation can play a vital role in designing new-generation CDS systems that are based on complex guideline models and use natural language processing (NLP) to obtain patient data.
PMCID: PMC3845748  PMID: 24303280
13.  An Information Extraction Framework for Cohort Identification Using Electronic Health Records  
Information extraction (IE), a natural language processing (NLP) task that automatically extracts structured or semi-structured information from free text, has become popular in the clinical domain for supporting automated systems at point-of-care and enabling secondary use of electronic health records (EHRs) for clinical and translational research. However, a high performance IE system can be very challenging to construct due to the complexity and dynamic nature of human language. In this paper, we report an IE framework for cohort identification using EHRs that is a knowledge-driven framework developed under the Unstructured Information Management Architecture (UIMA). A system to extract specific information can be developed by subject matter experts through expert knowledge engineering of the externalized knowledge resources used in the framework.
PMCID: PMC3845757  PMID: 24303255
14.  Drug side effect extraction from clinical narratives of psychiatry and psychology patients 
To extract physician-asserted drug side effects from electronic medical record clinical narratives.
Materials and methods
Pattern matching rules were manually developed through examining keywords and expression patterns of side effects to discover an individual side effect and causative drug relationship. A combination of machine learning (C4.5) using side effect keyword features and pattern matching rules was used to extract sentences that contain side effect and causative drug pairs, enabling the system to discover most side effect occurrences. Our system was implemented as a module within the clinical Text Analysis and Knowledge Extraction System.
The system was tested in the domain of psychiatry and psychology. The rule-based system extracting side effects and causative drugs produced an F score of 0.80 (0.55 excluding allergy section). The hybrid system identifying side effect sentences had an F score of 0.75 (0.56 excluding allergy section) but covered more side effect and causative drug pairs than individual side effect extraction.
The rule-based system was able to identify most side effects expressed by clear indication words. More sophisticated semantic processing is required to handle complex side effect descriptions in the narrative. We demonstrated that our system can be trained to identify sentences with complex side effect descriptions that can be submitted to a human expert for further abstraction.
Our system was able to extract most physician-asserted drug side effects. It can be used in either an automated mode for side effect extraction or semi-automated mode to identify side effect sentences that can significantly simplify abstraction by a human expert.
PMCID: PMC3241172  PMID: 21946242
Natural language processing; machine learning; information extraction; electronic medical record; Information storage and retrieval (text and images); discovery; and text and data mining methods; Other methods of information extraction; Natural-language processing; bioinformatics; Ontologies; Knowledge representations, Controlled terminologies and vocabularies; Information Retrieval; HIT Data Standards; Human-computer interaction and human-centered computing; Providing just-in-time access to the biomedical literature and other health information; Applications that link biomedical knowledge from diverse primary sources (includes automated indexing); Linking the genotype and phenotype
15.  Towards a semantic lexicon for clinical natural language processing 
A semantic lexicon which associates words and phrases in text to concepts is critical for extracting and encoding clinical information in free text and therefore achieving semantic interoperability between structured and unstructured data in Electronic Health Records (EHRs). Directly using existing standard terminologies may have limited coverage with respect to concepts and their corresponding mentions in text. In this paper, we analyze how tokens and phrases in a large corpus distribute and how well the UMLS captures the semantics. A corpus-driven semantic lexicon, MedLex, has been constructed where the semantics is based on the UMLS assisted with variants mined and usage information gathered from clinical text. The detailed corpus analysis of tokens, chunks, and concept mentions shows the UMLS is an invaluable source for natural language processing. Increasing the semantic coverage of tokens provides a good foundation in capturing clinical information comprehensively. The study also yields some insights in developing practical NLP systems.
PMCID: PMC3540492  PMID: 23304329
16.  Dependency Parser-based Negation Detection in Clinical Narratives 
Negation of clinical named entities is common in clinical documents and is a crucial factor to accurately compile patients’ clinical conditions and to further support complex phenotype detection. In 2009, Mayo Clinic released the clinical Text Analysis and Knowledge Extraction System (cTAKES), which includes a negation annotator that identifies negation status of a named entity by searching for negation words within a fixed word distance. However, this negation strategy is not sophisticated enough to correctly identify complicated patterns of negation. This paper aims to investigate whether the dependency structure from the cTAKES dependency parser can improve the negation detection performance. Manually compiled negation rules, derived from dependency paths were tested. Dependency negation rules do not limit the negation scope to word distance; instead, they are based on syntactic context. We found that using a dependency-based negation proved a superior alternative to the current cTAKES negation annotator.
PMCID: PMC3392064  PMID: 22779038
17.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications 
We aim to build and evaluate an open-source natural language processing system for information extraction from electronic medical record clinical free-text. We describe and evaluate our system, the clinical Text Analysis and Knowledge Extraction System (cTAKES), released open-source at The cTAKES builds on existing open-source technologies—the Unstructured Information Management Architecture framework and OpenNLP natural language processing toolkit. Its components, specifically trained for the clinical domain, create rich linguistic and semantic annotations. Performance of individual components: sentence boundary detector accuracy=0.949; tokenizer accuracy=0.949; part-of-speech tagger accuracy=0.936; shallow parser F-score=0.924; named entity recognizer and system-level evaluation F-score=0.715 for exact and 0.824 for overlapping spans, and accuracy for concept mapping, negation, and status attributes for exact and overlapping spans of 0.957, 0.943, 0.859, and 0.580, 0.939, and 0.839, respectively. Overall performance is discussed against five applications. The cTAKES annotations are the foundation for methods and modules for higher-level semantic processing of clinical free-text.
PMCID: PMC2995668  PMID: 20819853
18.  Classification of Medication Status Change in Clinical Narratives 
The patient’s medication history and status changes play essential roles in medical treatment. A notable amount of medication status information typically resides in unstructured clinical narratives that require a sophisticated approach to automated classification. In this paper, we investigated rule-based and machine learning methods of medication status change classification from clinical free text. We also examined the impact of balancing training data in machine learning classification when using the data from skewed class distribution.
PMCID: PMC3041444  PMID: 21347081
19.  Mayo Clinic Smoking Status Classification System: Extensions and Improvements 
This paper describes improvements of and extensions to the Mayo Clinic 2006 smoking status classification system. The new system aims at addressing some of the limitations of the previous one. The performance improvements were mainly achieved through remodeling the negation detection for non-smoker, temporal resolution to distinguish a past and current smoker, and improved detection of the smoking status category of unknown. In addition, we introduced a rule-based component for patient-level smoking status assignments in which the individual smoking statuses of all clinical documents for a given patient are aggregated and analyzed to produce the final patient smoking status. The enhanced system builds upon components from Mayo’s clinical Text Analysis and Knowledge Extraction System developed within IBM’s Unstructured Information Management Architecture framework. This reusability minimized the development effort. The extended system is in use to identify smoking status risk factors for a peripheral artery disease NHGRI study.
PMCID: PMC2815365  PMID: 20351929
20.  Optimal Training Sets for Bayesian Prediction of MeSH® Assignment 
The aim of this study was to improve naïve Bayes prediction of Medical Subject Headings (MeSH) assignment to documents using optimal training sets found by an active learning inspired method.
The authors selected 20 MeSH terms whose occurrences cover a range of frequencies. For each MeSH term, they found an optimal training set, a subset of the whole training set. An optimal training set consists of all documents including a given MeSH term (C 1 class) and those documents not including a given MeSH term (C −1 class) that are closest to the C 1 class. These small sets were used to predict MeSH assignments in the MEDLINE® database.
Average precision was used to compare MeSH assignment using the naïve Bayes learner trained on the whole training set, optimal sets, and random sets. The authors compared 95% lower confidence limits of average precisions of naïve Bayes with upper bounds for average precisions of a K-nearest neighbor (KNN) classifier.
For all 20 MeSH assignments, the optimal training sets produced nearly 200% improvement over use of the whole training sets. In 17 of those MeSH assignments, naïve Bayes using optimal training sets was statistically better than a KNN. In 15 of those, optimal training sets performed better than optimized feature selection. Overall naïve Bayes averaged 14% better than a KNN for all 20 MeSH assignments. Using these optimal sets with another classifier, C-modified least squares (CMLS), produced an additional 6% improvement over naïve Bayes.
Using a smaller optimal training set greatly improved learning with naïve Bayes. The performance is superior to a KNN. The small training set can be used with other sophisticated learning methods, such as CMLS, where using the whole training set would not be feasible.
PMCID: PMC2442263  PMID: 18436913
21.  Abbreviation definition identification based on automatic precision estimates 
BMC Bioinformatics  2008;9:402.
The rapid growth of biomedical literature presents challenges for automatic text processing, and one of the challenges is abbreviation identification. The presence of unrecognized abbreviations in text hinders indexing algorithms and adversely affects information retrieval and extraction. Automatic abbreviation definition identification can help resolve these issues. However, abbreviations and their definitions identified by an automatic process are of uncertain validity. Due to the size of databases such as MEDLINE only a small fraction of abbreviation-definition pairs can be examined manually. An automatic way to estimate the accuracy of abbreviation-definition pairs extracted from text is needed. In this paper we propose an abbreviation definition identification algorithm that employs a variety of strategies to identify the most probable abbreviation definition. In addition our algorithm produces an accuracy estimate, pseudo-precision, for each strategy without using a human-judged gold standard. The pseudo-precisions determine the order in which the algorithm applies the strategies in seeking to identify the definition of an abbreviation.
On the Medstract corpus our algorithm produced 97% precision and 85% recall which is higher than previously reported results. We also annotated 1250 randomly selected MEDLINE records as a gold standard. On this set we achieved 96.5% precision and 83.2% recall. This compares favourably with the well known Schwartz and Hearst algorithm.
We developed an algorithm for abbreviation identification that uses a variety of strategies to identify the most probable definition for an abbreviation and also produces an estimated accuracy of the result. This process is purely automatic.
PMCID: PMC2576267  PMID: 18817555

Results 1-21 (21)