Search tips
Search criteria

Results 1-12 (12)

Clipboard (0)

Select a Filter Below

Year of Publication
1.  Improving performance of natural language processing part-of-speech tagging on clinical narratives through domain adaptation 
Natural language processing (NLP) tasks are commonly decomposed into subtasks, chained together to form processing pipelines. The residual error produced in these subtasks propagates, adversely affecting the end objectives. Limited availability of annotated clinical data remains a barrier to reaching state-of-the-art operating characteristics using statistically based NLP tools in the clinical domain. Here we explore the unique linguistic constructions of clinical texts and demonstrate the loss in operating characteristics when out-of-the-box part-of-speech (POS) tagging tools are applied to the clinical domain. We test a domain adaptation approach integrating a novel lexical-generation probability rule used in a transformation-based learner to boost POS performance on clinical narratives.
Two target corpora from independent healthcare institutions were constructed from high frequency clinical narratives. Four leading POS taggers with their out-of-the-box models trained from general English and biomedical abstracts were evaluated against these clinical corpora. A high performing domain adaptation method, Easy Adapt, was compared to our newly proposed method ClinAdapt.
The evaluated POS taggers drop in accuracy by 8.5–15% when tested on clinical narratives. The highest performing tagger reports an accuracy of 88.6%. Domain adaptation with Easy Adapt reports accuracies of 88.3–91.0% on clinical texts. ClinAdapt reports 93.2–93.9%.
ClinAdapt successfully boosts POS tagging performance through domain adaptation requiring a modest amount of annotated clinical data. Improving the performance of critical NLP subtasks is expected to reduce pipeline error propagation leading to better overall results on complex processing tasks.
PMCID: PMC3756264  PMID: 23486109
Natural Language Processing; NLP; POS Tagging; Domain Adaptation; Clinical Narratives
2.  High rates of early treatment discontinuation in hepatitis C-infected US veterans 
BMC Research Notes  2014;7:266.
Patients with chronic hepatitis C (HCV) frequently discontinued dual therapy with pegylated interferon alfa (Peg-IFN) plus ribavirin (RBV) before reaching the recommended duration of 48 or 24 weeks for genotypes (G) 1/4 or 2/3, respectively. We quantified rates of discontinuation despite efficacy (non-LOE) versus lack of efficacy (LOE) versus discontinuation for unknown reasons in a national database of United States veterans.
We identified a population-based cohort of U.S. veterans with encounters from 2004 through 2009 who had lab-confirmed HCV infection and initiated therapy with Peg-IFN plus RBV in Veterans Health Administration medical centers. Pharmacy data were used to determine therapy duration, defined as the sum of Peg-IFN days supplied. Patients “discontinued” if they failed to receive at least 44 (G1/4) or 20 weeks (G2/3) of therapy. We classified discontinuations as due to non-LOE, LOE, or unknown reasons using a classification rule based on treatment duration and laboratory confirmed response.
Of 321,238 diagnosed HCV patients during the evaluation period, 9.7% initiated therapy and 6.4% met all other inclusion criteria. 54.9% of patients discontinued early; of these, 41.2% discontinued due to non-LOE reasons, 12.5% discontinued for LOE reasons, and 46.3% discontinued for unknown reasons. Among non-LOE discontinuers, most (60.1%) discontinued in the first 4 weeks of therapy, which constitutes 13.6% of all treated patients.
We observed a high proportion of early discontinuations with dual-therapy regimens in a national cohort of HCV-infected veterans. If this trend persists in the triple-therapy era, then efforts must be undertaken to improve adherence.
PMCID: PMC4012175  PMID: 24758162
Hepatitis C virus; Pegylated interferon; Discontinuation; Veterans
3.  Automated extraction of ejection fraction for quality measurement using regular expressions in Unstructured Information Management Architecture (UIMA) for heart failure 
Left ventricular ejection fraction (EF) is a key component of heart failure quality measures used within the Department of Veteran Affairs (VA). Our goals were to build a natural language processing system to extract the EF from free-text echocardiogram reports to automate measurement reporting and to validate the accuracy of the system using a comparison reference standard developed through human review. This project was a Translational Use Case Project within the VA Consortium for Healthcare Informatics.
Materials and methods
We created a set of regular expressions and rules to capture the EF using a random sample of 765 echocardiograms from seven VA medical centers. The documents were randomly assigned to two sets: a set of 275 used for training and a second set of 490 used for testing and validation. To establish the reference standard, two independent reviewers annotated all documents in both sets; a third reviewer adjudicated disagreements.
System test results for document-level classification of EF of <40% had a sensitivity (recall) of 98.41%, a specificity of 100%, a positive predictive value (precision) of 100%, and an F measure of 99.2%. System test results at the concept level had a sensitivity of 88.9% (95% CI 87.7% to 90.0%), a positive predictive value of 95% (95% CI 94.2% to 95.9%), and an F measure of 91.9% (95% CI 91.2% to 92.7%).
An EF value of <40% can be accurately identified in VA echocardiogram reports.
An automated information extraction system can be used to accurately extract EF for quality measurement.
PMCID: PMC3422820  PMID: 22437073
Natural language processing (NLP); heart failure; left ventricular ejection fraction (EF); Improving healthcare workflow and process efficiency; applied informatics; Improving government and community policy relevant to informatics and health quality; process modeling and hypothesis generation; Informatics; Enhancing the conduct of biological/clinical research and trials; applications that link biomedical knowledge from diverse primary sources (includes automated indexing); visualization of data and knowledge; uncertain reasoning and decision theory; languages and computational methods; statistical analysis of large datasets; advanced algorithms; discovery and text and data mining methods; other methods of information extraction; automated learning; human-computer interaction and human-centered computing; cognitive study (including experiments emphasizing verbal protocol analysis and usability); knowledge representations; knowledge acquisition and knowledge management; delivering health information and knowledge to the public; processing and display; analysis; image representation; controlled terminologies and vocabularies; ontologies; knowledge bases; ejection; fraction; machine learning; simulation of complex systems (at all levels: molecules to work groups to organizations); developing/using clinical decision support (other than diagnostic) and guideline systems; detecting disease outbreaks and biological threats
4.  Evaluation of record linkage between a large healthcare provider and the Utah Population Database 
Electronically linked datasets have become an important part of clinical research. Information from multiple sources can be used to identify comorbid conditions and patient outcomes, measure use of healthcare services, and enrich demographic and clinical variables of interest. Innovative approaches for creating research infrastructure beyond a traditional data system are necessary.
Materials and methods
Records from a large healthcare system's enterprise data warehouse (EDW) were linked to a statewide population database, and a master subject index was created. The authors evaluate the linkage, along with the impact of missing information in EDW records and the coverage of the population database. The makeup of the EDW and population database provides a subset of cancer records that exist in both resources, which allows a cancer-specific evaluation of the linkage.
About 3.4 million records (60.8%) in the EDW were linked to the population database with a minimum accuracy of 96.3%. It was estimated that approximately 24.8% of target records were absent from the population database, which enabled the effect of the amount and type of information missing from a record on the linkage to be estimated. However, 99% of the records from the oncology data mart linked; they had fewer missing fields and this correlated positively with the number of patient visits.
Discussion and conclusion
A general-purpose research infrastructure was created which allows disease-specific cohorts to be identified. The usefulness of creating an index between institutions is that it allows each institution to maintain control and confidentiality of their own information.
PMCID: PMC3392872  PMID: 21926112
Master subject index; record linking; confidentiality; cancer cohort; population database; informatics; statistics; record linking; master subject index; population database
5.  Creation and Storage of Standards-based Pre-scanning Patient Questionnaires in PACS as DICOM Objects 
Journal of Digital Imaging  2010;24(5):823-827.
Radiology departments around the country have completed the first evolution to digital imaging by becoming filmless. The next step in this evolution is to become truly paperless. Both patient and non-patient paperwork has to be eliminated in order for this transition to occur. A paper-based set of patient pre-scanning questionnaires were replaced with web-based forms for use in an outpatient imaging center. We discuss this process by which questionnaire elements are converted into SNOMED-CT terminology concepts, stored for future use, and sent to PACS in Digital Imaging and Communications in Medicine (DICOM) format to be permanently stored with the relevant study in the DICOM image database.
PMCID: PMC3180552  PMID: 20976611
Paperless; Pseudo paperless; Filmless; SNOMED-CT; Data mining; Clinical workflow; Data collection
6.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text 
The 2010 i2b2/VA Workshop on Natural Language Processing Challenges for Clinical Records presented three tasks: a concept extraction task focused on the extraction of medical concepts from patient reports; an assertion classification task focused on assigning assertion types for medical problem concepts; and a relation classification task focused on assigning relation types that hold between medical problems, tests, and treatments. i2b2 and the VA provided an annotated reference standard corpus for the three tasks. Using this reference standard, 22 systems were developed for concept extraction, 21 for assertion classification, and 16 for relation classification.
These systems showed that machine learning approaches could be augmented with rule-based systems to determine concepts, assertions, and relations. Depending on the task, the rule-based systems can either provide input for machine learning or post-process the output of machine learning. Ensembles of classifiers, information from unlabeled data, and external knowledge sources can help when the training data are inadequate.
PMCID: PMC3168320  PMID: 21685143
Information storage and retrieval (text and images); discovery; and text and data mining methods; Other methods of information extraction; Natural-language processing; Automated learning; visualization of data and knowledge; uncertain reasoning and decision theory; languages, and computational methods; statistical analysis of large datasets; advanced algorithms; discovery; other methods of information extraction; automated learning; human-computer interaction and human-centered computing; NLP; machine learning; Informatics
7.  Identification of methicillin-resistant Staphylococcus aureus within the Nation’s Veterans Affairs Medical Centers using natural language processing 
Accurate information is needed to direct healthcare systems’ efforts to control methicillin-resistant Staphylococcus aureus (MRSA). Assembling complete and correct microbiology data is vital to understanding and addressing the multiple drug-resistant organisms in our hospitals.
Herein, we describe a system that securely gathers microbiology data from the Department of Veterans Affairs (VA) network of databases. Using natural language processing methods, we applied an information extraction process to extract organisms and susceptibilities from the free-text data. We then validated the extraction against independently derived electronic data and expert annotation.
We estimate that the collected microbiology data are 98.5% complete and that methicillin-resistant Staphylococcus aureus was extracted accurately 99.7% of the time.
Applying natural language processing methods to microbiology records appears to be a promising way to extract accurate and useful nosocomial pathogen surveillance data. Both scientific inquiry and the data’s reliability will be dependent on the surveillance system’s capability to compare from multiple sources and circumvent systematic error. The dataset constructed and methods used for this investigation could contribute to a comprehensive infectious disease surveillance system or other pressing needs.
PMCID: PMC3394221  PMID: 22533507
8.  Qualitative Analysis of Workflow Modifications Used to Generate the Reference Standard for the 2010 i2b2/VA Challenge 
AMIA Annual Symposium Proceedings  2011;2011:1243-1251.
The Department of Veterans Affairs (VA) and the Informatics for Integrating Biology and the Bedside (i2b2) team partnered to generate the reference standard for the 2010 i2b2/VA challenge task on concept extraction, assertion classification, and relation classification. The purpose of this paper is to report an in-depth qualitative analysis of the experience and perceptions of human annotators for these tasks. Transcripts of semi-structured interviews were analyzed using qualitative methods to identify key constructs and themes related to these annotation tasks. Interventions were embedded with these tasks using pre-annotation of clinical concepts and a modified annotation workflow. From the human perspective, annotation tasks involve an inherent conflict between bias, accuracy, and efficiency. This analysis deepens understanding of the biases, complexities and impact of variations in the annotation process that may affect annotation task reliability and reference standard validity that are generalizable for other similar large-scale clinical corpus annotation projects.
PMCID: PMC3243132  PMID: 22195185
9.  Using Java to Generate Globally Unique Identifiers for DICOM Objects 
Digital imaging and communication in medicine (DICOM) specifies that all DICOM objects have globally unique identifiers (UIDs). Creating these UIDs can be a difficult task due to the variety of techniques in use and the requirement to ensure global uniqueness. We present a simple technique of combining a root organization identifier, assigned descriptive identifiers, and JAVA generated unique identifiers to construct DICOM compliant UIDs.
PMCID: PMC3043668  PMID: 17896137
Digital imaging and communications in medicine (DICOM); structured reporting; digital imaging
10.  Translating the IHE Teaching File and Clinical Trial Export (TCE) Profile Document Templates into Functional DICOM Structured Report Objects 
Journal of Digital Imaging  2007;21(4):390-407.
The Integrating the Healthcare Enterprise (IHE) Teaching File and Clinical Trial Export (TCE) integration profile describes a standard workflow for exporting key images from an image manager/archive to a teaching file, clinical trial, or electronic publication application. Two specific digital imaging and communication in medicine (DICOM) structured reports (SR) reference the key images and contain associated case information. This paper presents step-by-step instructions for translating the TCE document templates into functional and complete DICOM SR objects. Others will benefit from these instructions in developing TCE compliant applications.
PMCID: PMC3043848  PMID: 17805930
Digital imaging and communications in medicine (DICOM); integrating healthcare enterprise (IHE); extensible markup; language (XML); electronic teaching file; clinical trial; electronic; publishing
11.  Using Applet–Servlet Communication for Optimizing Window, Level and Crop for DICOM to JPEG Conversion 
Journal of Digital Imaging  2007;21(3):348-354.
In the creation of interesting radiological cases in a digital teaching file, it is necessary to adjust the window and level settings of an image to effectively display the educational focus. The web-based applet described in this paper presents an effective solution for real-time window and level adjustments without leaving the picture archiving and communications system workstation. Optimized images are created, as user-defined parameters are passed between the applet and a servlet on the Health Insurance Portability and Accountability Act-compliant teaching file server.
PMCID: PMC3043843  PMID: 17534682
Electronic teaching file;  image manipulation; web technology

Results 1-12 (12)