Search tips
Search criteria

Results 1-25 (25)

Clipboard (0)

Select a Filter Below

Year of Publication
Document Types
1.  Patient-level temporal aggregation for text-based asthma status ascertainment 
To specify the problem of patient-level temporal aggregation from clinical text and introduce several probabilistic methods for addressing that problem. The patient-level perspective differs from the prevailing natural language processing (NLP) practice of evaluating at the term, event, sentence, document, or visit level.
We utilized an existing pediatric asthma cohort with manual annotations. After generating a basic feature set via standard clinical NLP methods, we introduce six methods of aggregating time-distributed features from the document level to the patient level. These aggregation methods are used to classify patients according to their asthma status in two hypothetical settings: retrospective epidemiology and clinical decision support.
In both settings, solid patient classification performance was obtained with machine learning algorithms on a number of evidence aggregation methods, with Sum aggregation obtaining the highest F1 score of 85.71% on the retrospective epidemiological setting, and a probability density function-based method obtaining the highest F1 score of 74.63% on the clinical decision support setting. Multiple techniques also estimated the diagnosis date (index date) of asthma with promising accuracy.
The clinical decision support setting is a more difficult problem. We rule out some aggregation methods rather than determining the best overall aggregation method, since our preliminary data set represented a practical setting in which manually annotated data were limited.
Results contrasted the strengths of several aggregation algorithms in different settings. Multiple approaches exhibited good patient classification performance, and also predicted the timing of estimates with reasonable accuracy.
PMCID: PMC4147607  PMID: 24833775
Patient classification; Asthma epidemiology; Natural language processing; Information extraction
2.  Using Large Clinical Corpora for Query Expansion in Text-based Cohort Identification 
In light of the heightened problems of polysemy, synonymy, and hyponymy in clinical text, we hypothesize that patient cohort identification can be improved by using a large, in-domain clinical corpus for query expansion. We evaluate the utility of four auxiliary collections for the Text REtrieval Conference task of IR-based cohort retrieval, considering the effects of collection size, the inherent difficulty of a query, and the interaction between the collections. Each collection was applied to aid in cohort retrieval from the Pittsburgh NLP Repository by using a mixture of relevance models. Measured by mean average precision, performance using any auxiliary resource (MAP=0.386 and above) is shown to improve over the baseline query likelihood model (MAP=0.373). Considering subsets of the Mayo Clinic collection, we found that after including 2.5 billion term instances, retrieval is not improved by adding more instances. However, adding the Mayo Clinic collection did improve performance significantly over any existing setup, with a system using all four auxiliary collections obtaining the best results (MAP=0.4223). Because optimal results in the mixture of relevance models would require selective sampling of the collections, the common sense approach of “use all available data” is inappropriate. However, we found that it was still beneficial to add the Mayo corpus to any mixture of relevance models. On the task of IR-based cohort identification, query expansion with the Mayo Clinic corpus resulted in consistent and significant improvements. As such, any IR query expansion with access to a large clinical corpus could benefit from the additional resource. Additionally, we have shown that more data is not necessarily better, implying that there is value in collection curation.
PMCID: PMC4058413  PMID: 24680983
Cohort Identification; Information retrieval; Query expansion; Clinical text; Electronic Medical Records
3.  An ancient Chinese wisdom for metabolic engineering: Yin-Yang 
In ancient Chinese philosophy, Yin-Yang describes two contrary forces that are interconnected and interdependent. This concept also holds true in microbial cell factories, where Yin represents energy metabolism in the form of ATP, and Yang represents carbon metabolism. Current biotechnology can effectively edit the microbial genome or introduce novel enzymes to redirect carbon fluxes. On the other hand, microbial metabolism loses significant free energy as heat when converting sugar into ATP; while maintenance energy expenditures further aggravate ATP shortage. The limitation of cell “powerhouse” prevents hosts from achieving high carbon yields and rates. Via an Escherichia coli flux balance analysis model, we further demonstrate the penalty of ATP cost on biofuel synthesis. To ensure cell powerhouse being sufficient in microbial cell factories, we propose five principles: 1. Take advantage of native pathways for product synthesis. 2. Pursue biosynthesis relying only on pathways or genetic parts without significant ATP burden. 3. Combine microbial production with chemical conversions (semi-biosynthesis) to reduce biosynthesis steps. 4. Create “minimal cells” or use non-model microbial hosts with higher energy fitness. 5. Develop a photosynthesis chassis that can utilize light energy and cheap carbon feedstocks. Meanwhile, metabolic flux analysis can be used to quantify both carbon and energy metabolisms. The fluxomics results are essential to evaluate the industrial potential of laboratory strains, avoiding false starts and dead ends during metabolic engineering.
PMCID: PMC4374363  PMID: 25889067
ATP; Energy metabolism; Flux analysis; Free energy; Maintenance loss; Semi-biosynthesis
4.  Normalization and standardization of electronic health records for high-throughput phenotyping: the SHARPn consortium 
Research objective
To develop scalable informatics infrastructure for normalization of both structured and unstructured electronic health record (EHR) data into a unified, concept-based model for high-throughput phenotype extraction.
Materials and methods
Software tools and applications were developed to extract information from EHRs. Representative and convenience samples of both structured and unstructured data from two EHR systems—Mayo Clinic and Intermountain Healthcare—were used for development and validation. Extracted information was standardized and normalized to meaningful use (MU) conformant terminology and value set standards using Clinical Element Models (CEMs). These resources were used to demonstrate semi-automatic execution of MU clinical-quality measures modeled using the Quality Data Model (QDM) and an open-source rules engine.
Using CEMs and open-source natural language processing and terminology services engines—namely, Apache clinical Text Analysis and Knowledge Extraction System (cTAKES) and Common Terminology Services (CTS2)—we developed a data-normalization platform that ensures data security, end-to-end connectivity, and reliable data flow within and across institutions. We demonstrated the applicability of this platform by executing a QDM-based MU quality measure that determines the percentage of patients between 18 and 75 years with diabetes whose most recent low-density lipoprotein cholesterol test result during the measurement year was <100 mg/dL on a randomly selected cohort of 273 Mayo Clinic patients. The platform identified 21 and 18 patients for the denominator and numerator of the quality measure, respectively. Validation results indicate that all identified patients meet the QDM-based criteria.
End-to-end automated systems for extracting clinical information from diverse EHR systems require extensive use of standardized vocabularies and terminologies, as well as robust information models for storing, discovering, and processing that information. This study demonstrates the application of modular and open-source resources for enabling secondary use of EHR data through normalization into standards-based, comparable, and consistent format for high-throughput phenotyping to identify patient cohorts.
PMCID: PMC3861933  PMID: 24190931
Electronic health record; Meaningful Use; Normalization; Natural Language Processing; Phenotype Extraction
5.  Negation’s Not Solved: Generalizability Versus Optimizability in Clinical Natural Language Processing 
PLoS ONE  2014;9(11):e112774.
A review of published work in clinical natural language processing (NLP) may suggest that the negation detection task has been “solved.” This work proposes that an optimizable solution does not equal a generalizable solution. We introduce a new machine learning-based Polarity Module for detecting negation in clinical text, and extensively compare its performance across domains. Using four manually annotated corpora of clinical text, we show that negation detection performance suffers when there is no in-domain development (for manual methods) or training data (for machine learning-based methods). Various factors (e.g., annotation guidelines, named entity characteristics, the amount of data, and lexical and syntactic context) play a role in making generalizability difficult, but none completely explains the phenomenon. Furthermore, generalizability remains challenging because it is unclear whether to use a single source for accurate data, combine all sources into a single model, or apply domain adaptation methods. The most reliable means to improve negation detection is to manually annotate in-domain training data (or, perhaps, manually modify rules); this is a strategy for optimizing performance, rather than generalizing it. These results suggest a direction for future work in domain-adaptive and task-adaptive methods for clinical NLP.
PMCID: PMC4231086  PMID: 25393544
6.  Automated chart review for asthma cohort identification using natural language processing: an exploratory study 
A significant proportion of children with asthma have delayed diagnosis of asthma by health care providers. Manual chart review according to established criteria is more accurate than directly using diagnosis codes, which tend to under-identify asthmatics, but chart reviews are more costly and less timely.
To evaluate the accuracy of a computational approach to asthma ascertainment, characterizing its utility and feasibility toward large-scale deployment in electronic medical records.
A natural language processing (NLP) system was developed for extracting predetermined criteria for asthma from unstructured text in electronic medical records and then inferring asthma status based on these criteria. Using manual chart reviews as a gold standard, asthma status (yes vs no) and identification date (first date of a “yes” asthma status) were determined by the NLP system.
Patients were a group of children (n =112, 84% Caucasian, 49% girls) younger than 4 years (mean 2.0 years, standard deviation 1.03 years) who participated in previous studies. The NLP approach to asthma ascertainment showed sensitivity, specificity, positive predictive value, negative predictive value, and median delay in diagnosis of 84.6%, 96.5%, 88.0%, 95.4%, and 0 months, respectively; this compared favorably with diagnosis codes, at 30.8%, 93.2%, 57.1%, 82.2%, and 2.3 months, respectively.
Automated asthma ascertainment from electronic medical records using NLP is feasible and more accurate than traditional approaches such as diagnosis codes. Considering the difficulty of labor-intensive manual record review, NLP approaches for asthma ascertainment should be considered for improving clinical care and research, especially in large-scale efforts.
PMCID: PMC3839107  PMID: 24125142
7.  MACE: model based analysis of ChIP-exo 
Nucleic Acids Research  2014;42(20):e156.
Understanding the role of a given transcription factor (TF) in regulating gene expression requires precise mapping of its binding sites in the genome. Chromatin immunoprecipitation-exo, an emerging technique using λ exonuclease to digest TF unbound DNA after ChIP, is designed to reveal transcription factor binding site (TFBS) boundaries with near-single nucleotide resolution. Although ChIP-exo promises deeper insights into transcription regulation, no dedicated bioinformatics tool exists to leverage its advantages. Most ChIP-seq and ChIP-chip analytic methods are not tailored for ChIP-exo, and thus cannot take full advantage of high-resolution ChIP-exo data. Here we describe a novel analysis framework, termed MACE (model-based analysis of ChIP-exo) dedicated to ChIP-exo data analysis. The MACE workflow consists of four steps: (i) sequencing data normalization and bias correction; (ii) signal consolidation and noise reduction; (iii) single-nucleotide resolution border peak detection using the Chebyshev Inequality and (iv) border matching using the Gale-Shapley stable matching algorithm. When applied to published human CTCF, yeast Reb1 and our own mouse ONECUT1/HNF6 ChIP-exo data, MACE is able to define TFBSs with high sensitivity, specificity and spatial resolution, as evidenced by multiple criteria including motif enrichment, sequence conservation, direct sequence pileup, nucleosome positioning and open chromatin states. In addition, we show that the fundamental advance of MACE is the identification of two boundaries of a TFBS with high resolution, whereas other methods only report a single location of the same event. The two boundaries help elucidate the in vivo binding structure of a given TF, e.g. whether the TF may bind as dimers or in a complex with other co-factors.
PMCID: PMC4227761  PMID: 25249628
8.  Comparative Analysis of Online Health Queries Originating From Personal Computers and Smart Devices on a Consumer Health Information Portal 
The number of people using the Internet and mobile/smart devices for health information seeking is increasing rapidly. Although the user experience for online health information seeking varies with the device used, for example, smart devices (SDs) like smartphones/tablets versus personal computers (PCs) like desktops/laptops, very few studies have investigated how online health information seeking behavior (OHISB) may differ by device.
The objective of this study is to examine differences in OHISB between PCs and SDs through a comparative analysis of large-scale health search queries submitted through Web search engines from both types of devices.
Using the Web analytics tool, IBM NetInsight OnDemand, and based on the type of devices used (PCs or SDs), we obtained the most frequent health search queries between June 2011 and May 2013 that were submitted on Web search engines and directed users to the Mayo Clinic’s consumer health information website. We performed analyses on “Queries with considering repetition counts (QwR)” and “Queries without considering repetition counts (QwoR)”. The dataset contains (1) 2.74 million and 3.94 million QwoR, respectively for PCs and SDs, and (2) more than 100 million QwR for both PCs and SDs. We analyzed structural properties of the queries (length of the search queries, usage of query operators and special characters in health queries), types of search queries (keyword-based, wh-questions, yes/no questions), categorization of the queries based on health categories and information mentioned in the queries (gender, age-groups, temporal references), misspellings in the health queries, and the linguistic structure of the health queries.
Query strings used for health information searching via PCs and SDs differ by almost 50%. The most searched health categories are “Symptoms” (1 in 3 search queries), “Causes”, and “Treatments & Drugs”. The distribution of search queries for different health categories differs with the device used for the search. Health queries tend to be longer and more specific than general search queries. Health queries from SDs are longer and have slightly fewer spelling mistakes than those from PCs. Users specify words related to women and children more often than that of men and any other age group. Most of the health queries are formulated using keywords; the second-most common are wh- and yes/no questions. Users ask more health questions using SDs than PCs. Almost all health queries have at least one noun and health queries from SDs are more descriptive than those from PCs.
This study is a large-scale comparative analysis of health search queries to understand the effects of device type (PCs vs SDs) used on OHISB. The study indicates that the device used for online health information search plays an important role in shaping how health information searches by consumers and patients are executed.
PMCID: PMC4115262  PMID: 25000537
online health information seeking; health information search; eHealth; mHealth; search query analysis; health search log; mobile health; health seeking behavior
9.  Open Source Clinical NLP – More than Any Single System 
The number of Natural Language Processing (NLP) tools and systems for processing clinical free-text has grown as interest and processing capability have surged. Unfortunately any two systems typically cannot simply interoperate, even when both are built upon a framework designed to facilitate the creation of pluggable components. We present two ongoing activities promoting open source clinical NLP. The Open Health Natural Language Processing (OHNLP) Consortium was originally founded to foster a collaborative community around clinical NLP, releasing UIMA-based open source software. OHNLP’s mission currently includes maintaining a catalog of clinical NLP software and providing interfaces to simplify the interaction of NLP systems. Meanwhile, Apache cTAKES aims to integrate best-of-breed annotators, providing a world-class NLP system for accessing clinical information within free-text. These two activities are complementary. OHNLP promotes open source clinical NLP activities in the research community and Apache cTAKES bridges research to the health information technology (HIT) practice.
PMCID: PMC4419764  PMID: 25954581
10.  Coreference analysis in clinical notes: a multi-pass sieve with alternate anaphora resolution modules 
This paper describes the coreference resolution system submitted by Mayo Clinic for the 2011 i2b2/VA/Cincinnati shared task Track 1C. The goal of the task was to construct a system that links the markables corresponding to the same entity.
Materials and methods
The task organizers provided progress notes and discharge summaries that were annotated with the markables of treatment, problem, test, person, and pronoun. We used a multi-pass sieve algorithm that applies deterministic rules in the order of preciseness and simultaneously gathers information about the entities in the documents. Our system, MedCoref, also uses a state-of-the-art machine learning framework as an alternative to the final, rule-based pronoun resolution sieve.
The best system that uses a multi-pass sieve has an overall score of 0.836 (average of B3, MUC, Blanc, and CEAF F score) for the training set and 0.843 for the test set.
A supervised machine learning system that typically uses a single function to find coreferents cannot accommodate irregularities encountered in data especially given the insufficient number of examples. On the other hand, a completely deterministic system could lead to a decrease in recall (sensitivity) when the rules are not exhaustive. The sieve-based framework allows one to combine reliable machine learning components with rules designed by experts.
Using relatively simple rules, part-of-speech information, and semantic type properties, an effective coreference resolution system could be designed. The source code of the system described is available at
PMCID: PMC3422831  PMID: 22707745
Natural language processing; machine learning; information extraction; electronic medical record; coreference resolution; text mining; computational linguistics; named entity recognition; distributional semantics; relationship extraction; information storage and retrieval (text and images)
11.  Using Empirically Constructed Lexical Resources for Named Entity Recognition 
Biomedical Informatics Insights  2013;6(Suppl 1):17-27.
Because of privacy concerns and the expense involved in creating an annotated corpus, the existing small-annotated corpora might not have sufficient examples for learning to statistically extract all the named-entities precisely. In this work, we evaluate what value may lie in automatically generated features based on distributional semantics when using machine-learning named entity recognition (NER). The features we generated and experimented with include n-nearest words, support vector machine (SVM)-regions, and term clustering, all of which are considered distributional semantic features. The addition of the n-nearest words feature resulted in a greater increase in F-score than by using a manually constructed lexicon to a baseline system. Although the need for relatively small-annotated corpora for retraining is not obviated, lexicons empirically derived from unannotated text can not only supplement manually created lexicons, but also replace them. This phenomenon is observed in extracting concepts from both biomedical literature and clinical notes.
PMCID: PMC3702195  PMID: 23847424
natural language processing; distributional semantics; concept extraction; named entity recognition; empirical lexical resources
12.  Computational Semantics in Clinical Text 
Biomedical Informatics Insights  2013;6(Suppl 1):3-5.
PMCID: PMC3702196  PMID: 23847422
13.  Analysis of Cross-Institutional Medication Description Patterns in Clinical Narratives 
Biomedical Informatics Insights  2013;6(Suppl 1):7-16.
A large amount of medication information resides in the unstructured text found in electronic medical records, which requires advanced techniques to be properly mined. In clinical notes, medication information follows certain semantic patterns (eg, medication, dosage, frequency, and mode). Some medication descriptions contain additional word(s) between medication attributes. Therefore, it is essential to understand the semantic patterns as well as the patterns of the context interspersed among them (ie, context patterns) to effectively extract comprehensive medication information. In this paper we examined both semantic and context patterns, and compared those found in Mayo Clinic and i2b2 challenge data. We found that some variations exist between the institutions but the dominant patterns are common.
PMCID: PMC3702197  PMID: 23847423
medication extraction; electronic medical record; natural language processing
14.  Workflow-based Data Reconciliation for Clinical Decision Support: Case of Colorectal Cancer Screening and Surveillance  
A major barrier for computer-based clinical decision support (CDS), is the difficulty in obtaining the patient information required for decision making. The information gap is often due to deficiencies in the clinical documentation. One approach to address this gap is to gather and reconcile data from related documents or data sources. In this paper we consider the case of a CDS system for colorectal cancer screening and surveillance. We describe the use of workflow analysis to design data reconciliation processes. Further, we perform a quantitative analysis of the impact of these processes on system performance using a dataset of 106 patients. Results show that data reconciliation considerably improves the performance of the system. Our study demonstrates that, workflow-based data reconciliation can play a vital role in designing new-generation CDS systems that are based on complex guideline models and use natural language processing (NLP) to obtain patient data.
PMCID: PMC3845748  PMID: 24303280
15.  An Information Extraction Framework for Cohort Identification Using Electronic Health Records  
Information extraction (IE), a natural language processing (NLP) task that automatically extracts structured or semi-structured information from free text, has become popular in the clinical domain for supporting automated systems at point-of-care and enabling secondary use of electronic health records (EHRs) for clinical and translational research. However, a high performance IE system can be very challenging to construct due to the complexity and dynamic nature of human language. In this paper, we report an IE framework for cohort identification using EHRs that is a knowledge-driven framework developed under the Unstructured Information Management Architecture (UIMA). A system to extract specific information can be developed by subject matter experts through expert knowledge engineering of the externalized knowledge resources used in the framework.
PMCID: PMC3845757  PMID: 24303255
16.  Enhancing clinical concept extraction with distributional semantics 
Journal of Biomedical Informatics  2011;45(1):129-140.
Extracting concepts (such as drugs, symptoms, and diagnoses) from clinical narratives constitutes a basic enabling technology to unlock the knowledge within and support more advanced reasoning applications such as diagnosis explanation, disease progression modeling, and intelligent analysis of the effectiveness of treatment. The recent release of annotated training sets of de-identified clinical narratives has contributed to the development and refinement of concept extraction methods. However, as the annotation process is labor-intensive, training data are necessarily limited in the concepts and concept patterns covered, which impacts the performance of supervised machine learning applications trained with these data. This paper proposes an approach to minimize this limitation by combining supervised machine learning with empirical learning of semantic relatedness from the distribution of the relevant words in additional unannotated text.
The approach uses a sequential discriminative classifier (Conditional Random Fields) to extract the mentions of medical problems, treatments and tests from clinical narratives. It takes advantage of all Medline abstracts indexed as being of the publication type “clinical trials” to estimate the relatedness between words in the i2b2/VA training and testing corpora. In addition to the traditional features such as dictionary matching, pattern matching and part-of-speech tags, we also used as a feature words that appear in similar contexts to the word in question (that is, words that have a similar vector representation measured with the commonly used cosine metric, where vector representations are derived using methods of distributional semantics). To the best of our knowledge, this is the first effort exploring the use of distributional semantics, the semantics derived empirically from unannotated text often using vector space models, for a sequence classification task such as concept extraction. Therefore, we first experimented with different sliding window models and found the model with parameters that led to best performance in a preliminary sequence labeling task.
The evaluation of this approach, performed against the i2b2/VA concept extraction corpus, showed that incorporating features based on the distribution of words across a large unannotated corpus significantly aids concept extraction. Compared to a supervised-only approach as a baseline, the micro-averaged f-measure for exact match increased from 80.3% to 82.3% and the micro-averaged f-measure based on inexact match increased from 89.7% to 91.3%. These improvements are highly significant according to the bootstrap resampling method and also considering the performance of other systems. Thus, distributional semantic features significantly improve the performance of concept extraction from clinical narratives by taking advantage of word distribution information obtained from unannotated data.
PMCID: PMC3272090  PMID: 22085698
NLP; Information extraction; NER; Distributional Semantics; Clinical Informatics
17.  A common type system for clinical natural language processing 
One challenge in reusing clinical data stored in electronic medical records is that these data are heterogenous. Clinical Natural Language Processing (NLP) plays an important role in transforming information in clinical text to a standard representation that is comparable and interoperable. Information may be processed and shared when a type system specifies the allowable data structures. Therefore, we aim to define a common type system for clinical NLP that enables interoperability between structured and unstructured data generated in different clinical settings.
We describe a common type system for clinical NLP that has an end target of deep semantics based on Clinical Element Models (CEMs), thus interoperating with structured data and accommodating diverse NLP approaches. The type system has been implemented in UIMA (Unstructured Information Management Architecture) and is fully functional in a popular open-source clinical NLP system, cTAKES (clinical Text Analysis and Knowledge Extraction System) versions 2.0 and later.
We have created a type system that targets deep semantics, thereby allowing for NLP systems to encapsulate knowledge from text and share it alongside heterogenous clinical data sources. Rather than surface semantics that are typically the end product of NLP algorithms, CEM-based semantics explicitly build in deep clinical semantics as the point of interoperability with more structured data types.
PMCID: PMC3575354  PMID: 23286462
Natural Language Processing; Standards and interoperability; Clinical information extraction; Clinical Element Models; Common type system
18.  Towards a semantic lexicon for clinical natural language processing 
A semantic lexicon which associates words and phrases in text to concepts is critical for extracting and encoding clinical information in free text and therefore achieving semantic interoperability between structured and unstructured data in Electronic Health Records (EHRs). Directly using existing standard terminologies may have limited coverage with respect to concepts and their corresponding mentions in text. In this paper, we analyze how tokens and phrases in a large corpus distribute and how well the UMLS captures the semantics. A corpus-driven semantic lexicon, MedLex, has been constructed where the semantics is based on the UMLS assisted with variants mined and usage information gathered from clinical text. The detailed corpus analysis of tokens, chunks, and concept mentions shows the UMLS is an invaluable source for natural language processing. Increasing the semantic coverage of tokens provides a good foundation in capturing clinical information comprehensively. The study also yields some insights in developing practical NLP systems.
PMCID: PMC3540492  PMID: 23304329
19.  A Cognitive-Behavioral Treatment for Irritable Bowel Syndrome Using Interoceptive Exposure to Visceral Sensations 
Behaviour research and therapy  2011;49(6-7):413-421.
Irritable bowel syndrome (IBS) is a chronic and debilitating medical condition with few efficacious pharmacological or psychosocial treatment options available. Evidence suggests that visceral anxiety may be implicated in IBS onset and severity. Thus, cognitive behavioral treatment (CBT) that targets visceral anxiety may alleviate IBS symptoms.
The current study examined the efficacy of a CBT protocol for the treatment of IBS which directly targeted visceral sensations. Participants (N = 110) were randomized to receive 10 sessions of either: (a) CBT with interoceptive exposure to visceral sensations (IE); (b) stress management (SM); or (c) an attention control (AC), and were assessed at baseline, mid-treatment, post-treatment, and follow-up sessions.
Consistent with hypotheses, the IE group outperformed AC on several indices of outcome, and outperformed SM in some domains. No differences were observed between SM and AC. The results suggest that IE may be a particularly efficacious treatment for IBS.
Implications for research and clinical practice are discussed.
PMCID: PMC3100429  PMID: 21565328
20.  Unified Medical Language System term occurrences in clinical notes: a large-scale corpus analysis 
To characterise empirical instances of Unified Medical Language System (UMLS) Metathesaurus term strings in a large clinical corpus, and to illustrate what types of term characteristics are generalisable across data sources.
Based on the occurrences of UMLS terms in a 51 million document corpus of Mayo Clinic clinical notes, this study computes statistics about the terms' string attributes, source terminologies, semantic types and syntactic categories. Term occurrences in 2010 i2b2/VA text were also mapped; eight example filters were designed from the Mayo-based statistics and applied to i2b2/VA data.
For the corpus analysis, negligible numbers of mapped terms in the Mayo corpus had over six words or 55 characters. Of source terminologies in the UMLS, the Consumer Health Vocabulary and Systematized Nomenclature of Medicine—Clinical Terms (SNOMED-CT) had the best coverage in Mayo clinical notes at 106 426 and 94 788 unique terms, respectively. Of 15 semantic groups in the UMLS, seven groups accounted for 92.08% of term occurrences in Mayo data. Syntactically, over 90% of matched terms were in noun phrases. For the cross-institutional analysis, using five example filters on i2b2/VA data reduces the actual lexicon to 19.13% of the size of the UMLS and only sees a 2% reduction in matched terms.
The corpus statistics presented here are instructive for building lexicons from the UMLS. Features intrinsic to Metathesaurus terms (well formedness, length and language) generalise easily across clinical institutions, but term frequencies should be adapted with caution. The semantic groups of mapped terms may differ slightly from institution to institution, but they differ greatly when moving to the biomedical literature domain.
PMCID: PMC3392861  PMID: 22493050
21.  Using SNOMED-CT to encode summary level data – a corpus analysis 
Extracting and encoding clinical information captured in free text with standard medical terminologies is vital to enable secondary use of electronic medical records (EMRs) for clinical decision support, improved patient safety, and clinical/translational research. A critical portion of free text is comprised of ‘summary level’ information in the form of problem lists, diagnoses and reasons of visit. We conducted a systematic analysis of SNOMED-CT in representing the summary level information utilizing a large collection of summary level data in the form of itemized entries. Results indicate that about 80% of the entries can be encoded with SNOMED-CT normalized phrases. When tolerating one unmapped token, 96% of the itemized entries can be encoded with SNOMED-CT concepts. The study provides a solid foundation for developing an automated system to encode summary level data using SNOMED-CT.
PMCID: PMC3392059  PMID: 22779045
22.  Dependency Parser-based Negation Detection in Clinical Narratives 
Negation of clinical named entities is common in clinical documents and is a crucial factor to accurately compile patients’ clinical conditions and to further support complex phenotype detection. In 2009, Mayo Clinic released the clinical Text Analysis and Knowledge Extraction System (cTAKES), which includes a negation annotator that identifies negation status of a named entity by searching for negation words within a fixed word distance. However, this negation strategy is not sophisticated enough to correctly identify complicated patterns of negation. This paper aims to investigate whether the dependency structure from the cTAKES dependency parser can improve the negation detection performance. Manually compiled negation rules, derived from dependency paths were tested. Dependency negation rules do not limit the negation scope to word distance; instead, they are based on syntactic context. We found that using a dependency-based negation proved a superior alternative to the current cTAKES negation annotator.
PMCID: PMC3392064  PMID: 22779038
23.  Semantic Characteristics of NLP-extracted Concepts in Clinical Notes vs. Biomedical Literature 
AMIA Annual Symposium Proceedings  2011;2011:1550-1558.
Natural language processing (NLP) has become crucial in unlocking information stored in free text, from both clinical notes and biomedical literature. Clinical notes convey clinical information related to individual patient health care, while biomedical literature communicates scientific findings. This work focuses on semantic characterization of texts at an enterprise scale, comparing and contrasting the two domains and their NLP approaches. We analyzed the empirical distributional characteristics of NLP-discovered named entities in Mayo Clinic clinical notes from 2001–2010, and in the 2011 MetaMapped Medline Baseline. We give qualitative and quantitative measures of domain similarity and point to the feasibility of transferring resources and techniques. An important by-product for this study is the development of a weighted ontology for each domain, which gives distributional semantic information that may be used to improve NLP applications.
PMCID: PMC3243230  PMID: 22195220
24.  Introduction of a Self-report Version of the Prescription Drug Use Questionnaire and Relationship to Medication Agreement Non-Compliance 
The Prescription Drug Use Questionnaire (PDUQ) is one of several published tools developed to help clinicians better identify the presence of opioid abuse or dependence in patients with chronic pain. This paper introduces a patient version of the PDUQ (PDUQp), a 31-item questionnaire derived from the items of the original tool designed for self-administration, and describes evidence for its validity and reliability in a sample of patients with chronic nonmalignant pain and on opioid therapy. Further, this study examines instances of discontinuation from opioid medication treatment related to violation of the medication agreement (MAVRD) in this population, and the relationship of these with problematic opioid misuse behaviors, PDUQ and PDUQp scores. A sample of 135 consecutive patients with chronic nonmalignant pain was recruited from a multidisciplinary Veterans Affairs chronic pain clinic, and prospectively followed over one year of opioid therapy. Utilizing the PDUQ as a criterion measure, moderate to good concurrent and predictive validity data for the PDUQp are presented, as well as item-by-item comparison of the two formats. Reliability data indicate moderate test stability over time. Of those patients whose opioid treatment was discontinued due to MAVRD (n = 38 or 28% of sample), 40% of these (n = 11) were due to specific problematic opioid misuse behaviors. Based upon specificity and sensitivity analyses, a suggested cut-off PDUQp score for predicting MAVRD is provided. This study supports the PDUQp as a useful tool for assessing and predicting problematic opioid medication use in a chronic pain patient sample.
PMCID: PMC2630195  PMID: 18508231
Chronic nonmalignant pain; opioid medications; substance use disorder; problematic opioid use and/or misuse; medication agreements
25.  General Nitrogen Regulation of Nitrate Assimilation Regulatory Gene nasR Expression in Klebsiella oxytoca M5al 
Journal of Bacteriology  1999;181(23):7274-7284.
Klebsiella oxytoca can assimilate nitrate and nitrite by using enzymes encoded by the nasFEDCBA operon. Expression of the nasF operon is controlled by general nitrogen regulation (Ntr) via the NtrC transcription activator and by pathway-specific nitrate and nitrite induction via the NasR transcription antiterminator. This paper reports our analysis of nasR gene expression. We constructed strains bearing single-copy Φ(nasR-lacZ) operon fusions within the chromosomal rhaBAD-rhaSR locus. The expression of ΔrhaBS::[Φ(nasR-lacZ)] operon fusions was induced about 10-fold during nitrogen-limited growth. Induction was reduced in both ntrC and rpoN null mutants, indicating that Ntr control of nasR gene expression requires the NtrC and ςN (ς54) proteins. Sequence inspection of the nasR control region reveals an apparent ςN-dependent promoter but no apparent NtrC protein binding sites. Analysis of site-specific mutations coupled with primer extension analysis authenticated the ςN-dependent nasR promoter. Fusion constructs with only about 70 nucleotides (nt) upstream of the transcription initiation site exhibited patterns of β-galactosidase expression indistinguishable from Φ(nasR-lacZ) constructs with about 470 nt upstream. Expression was independent of the Nac protein, implying that NtrC is a direct activator of nasR transcription. Together, these results indicate that nasR gene expression does not require specific upstream NtrC-binding sequences, as previously noted for argT gene expression in Salmonella typhimurium (G. Schmitz, K. Nikaido, and G. F.-L. Ames, Mol. Gen. Genet. 215:107–117, 1988).
PMCID: PMC103690  PMID: 10572131

Results 1-25 (25)