Natural language processing (NLP) tasks are commonly decomposed into subtasks, chained together to form processing pipelines. The residual error produced in these subtasks propagates, adversely affecting the end objectives. Limited availability of annotated clinical data remains a barrier to reaching state-of-the-art operating characteristics using statistically based NLP tools in the clinical domain. Here we explore the unique linguistic constructions of clinical texts and demonstrate the loss in operating characteristics when out-of-the-box part-of-speech (POS) tagging tools are applied to the clinical domain. We test a domain adaptation approach integrating a novel lexical-generation probability rule used in a transformation-based learner to boost POS performance on clinical narratives.
Two target corpora from independent healthcare institutions were constructed from high frequency clinical narratives. Four leading POS taggers with their out-of-the-box models trained from general English and biomedical abstracts were evaluated against these clinical corpora. A high performing domain adaptation method, Easy Adapt, was compared to our newly proposed method ClinAdapt.
The evaluated POS taggers drop in accuracy by 8.5–15% when tested on clinical narratives. The highest performing tagger reports an accuracy of 88.6%. Domain adaptation with Easy Adapt reports accuracies of 88.3–91.0% on clinical texts. ClinAdapt reports 93.2–93.9%.
ClinAdapt successfully boosts POS tagging performance through domain adaptation requiring a modest amount of annotated clinical data. Improving the performance of critical NLP subtasks is expected to reduce pipeline error propagation leading to better overall results on complex processing tasks.
Natural Language Processing; NLP; POS Tagging; Domain Adaptation; Clinical Narratives
Patients with chronic hepatitis C (HCV) frequently discontinued dual therapy with pegylated interferon alfa (Peg-IFN) plus ribavirin (RBV) before reaching the recommended duration of 48 or 24 weeks for genotypes (G) 1/4 or 2/3, respectively. We quantified rates of discontinuation despite efficacy (non-LOE) versus lack of efficacy (LOE) versus discontinuation for unknown reasons in a national database of United States veterans.
We identified a population-based cohort of U.S. veterans with encounters from 2004 through 2009 who had lab-confirmed HCV infection and initiated therapy with Peg-IFN plus RBV in Veterans Health Administration medical centers. Pharmacy data were used to determine therapy duration, defined as the sum of Peg-IFN days supplied. Patients “discontinued” if they failed to receive at least 44 (G1/4) or 20 weeks (G2/3) of therapy. We classified discontinuations as due to non-LOE, LOE, or unknown reasons using a classification rule based on treatment duration and laboratory confirmed response.
Of 321,238 diagnosed HCV patients during the evaluation period, 9.7% initiated therapy and 6.4% met all other inclusion criteria. 54.9% of patients discontinued early; of these, 41.2% discontinued due to non-LOE reasons, 12.5% discontinued for LOE reasons, and 46.3% discontinued for unknown reasons. Among non-LOE discontinuers, most (60.1%) discontinued in the first 4 weeks of therapy, which constitutes 13.6% of all treated patients.
We observed a high proportion of early discontinuations with dual-therapy regimens in a national cohort of HCV-infected veterans. If this trend persists in the triple-therapy era, then efforts must be undertaken to improve adherence.
Hepatitis C virus; Pegylated interferon; Discontinuation; Veterans
Left ventricular ejection fraction (EF) is a key component of heart failure quality measures used within the Department of Veteran Affairs (VA). Our goals were to build a natural language processing system to extract the EF from free-text echocardiogram reports to automate measurement reporting and to validate the accuracy of the system using a comparison reference standard developed through human review. This project was a Translational Use Case Project within the VA Consortium for Healthcare Informatics.
Materials and methods
We created a set of regular expressions and rules to capture the EF using a random sample of 765 echocardiograms from seven VA medical centers. The documents were randomly assigned to two sets: a set of 275 used for training and a second set of 490 used for testing and validation. To establish the reference standard, two independent reviewers annotated all documents in both sets; a third reviewer adjudicated disagreements.
System test results for document-level classification of EF of <40% had a sensitivity (recall) of 98.41%, a specificity of 100%, a positive predictive value (precision) of 100%, and an F measure of 99.2%. System test results at the concept level had a sensitivity of 88.9% (95% CI 87.7% to 90.0%), a positive predictive value of 95% (95% CI 94.2% to 95.9%), and an F measure of 91.9% (95% CI 91.2% to 92.7%).
An EF value of <40% can be accurately identified in VA echocardiogram reports.
An automated information extraction system can be used to accurately extract EF for quality measurement.
Natural language processing (NLP); heart failure; left ventricular ejection fraction (EF); Improving healthcare workflow and process efficiency; applied informatics; Improving government and community policy relevant to informatics and health quality; process modeling and hypothesis generation; Informatics; Enhancing the conduct of biological/clinical research and trials; applications that link biomedical knowledge from diverse primary sources (includes automated indexing); visualization of data and knowledge; uncertain reasoning and decision theory; languages and computational methods; statistical analysis of large datasets; advanced algorithms; discovery and text and data mining methods; other methods of information extraction; automated learning; human-computer interaction and human-centered computing; cognitive study (including experiments emphasizing verbal protocol analysis and usability); knowledge representations; knowledge acquisition and knowledge management; delivering health information and knowledge to the public; processing and display; analysis; image representation; controlled terminologies and vocabularies; ontologies; knowledge bases; ejection; fraction; machine learning; simulation of complex systems (at all levels: molecules to work groups to organizations); developing/using clinical decision support (other than diagnostic) and guideline systems; detecting disease outbreaks and biological threats
Electronically linked datasets have become an important part of clinical research. Information from multiple sources can be used to identify comorbid conditions and patient outcomes, measure use of healthcare services, and enrich demographic and clinical variables of interest. Innovative approaches for creating research infrastructure beyond a traditional data system are necessary.
Materials and methods
Records from a large healthcare system's enterprise data warehouse (EDW) were linked to a statewide population database, and a master subject index was created. The authors evaluate the linkage, along with the impact of missing information in EDW records and the coverage of the population database. The makeup of the EDW and population database provides a subset of cancer records that exist in both resources, which allows a cancer-specific evaluation of the linkage.
About 3.4 million records (60.8%) in the EDW were linked to the population database with a minimum accuracy of 96.3%. It was estimated that approximately 24.8% of target records were absent from the population database, which enabled the effect of the amount and type of information missing from a record on the linkage to be estimated. However, 99% of the records from the oncology data mart linked; they had fewer missing fields and this correlated positively with the number of patient visits.
Discussion and conclusion
A general-purpose research infrastructure was created which allows disease-specific cohorts to be identified. The usefulness of creating an index between institutions is that it allows each institution to maintain control and confidentiality of their own information.
Master subject index; record linking; confidentiality; cancer cohort; population database; informatics; statistics; record linking; master subject index; population database
Radiology departments around the country have completed the first evolution to digital imaging by becoming filmless. The next step in this evolution is to become truly paperless. Both patient and non-patient paperwork has to be eliminated in order for this transition to occur. A paper-based set of patient pre-scanning questionnaires were replaced with web-based forms for use in an outpatient imaging center. We discuss this process by which questionnaire elements are converted into SNOMED-CT terminology concepts, stored for future use, and sent to PACS in Digital Imaging and Communications in Medicine (DICOM) format to be permanently stored with the relevant study in the DICOM image database.
Paperless; Pseudo paperless; Filmless; SNOMED-CT; Data mining; Clinical workflow; Data collection
The 2010 i2b2/VA Workshop on Natural Language Processing Challenges for Clinical Records presented three tasks: a concept extraction task focused on the extraction of medical concepts from patient reports; an assertion classification task focused on assigning assertion types for medical problem concepts; and a relation classification task focused on assigning relation types that hold between medical problems, tests, and treatments. i2b2 and the VA provided an annotated reference standard corpus for the three tasks. Using this reference standard, 22 systems were developed for concept extraction, 21 for assertion classification, and 16 for relation classification.
These systems showed that machine learning approaches could be augmented with rule-based systems to determine concepts, assertions, and relations. Depending on the task, the rule-based systems can either provide input for machine learning or post-process the output of machine learning. Ensembles of classifiers, information from unlabeled data, and external knowledge sources can help when the training data are inadequate.
Information storage and retrieval (text and images); discovery; and text and data mining methods; Other methods of information extraction; Natural-language processing; Automated learning; visualization of data and knowledge; uncertain reasoning and decision theory; languages, and computational methods; statistical analysis of large datasets; advanced algorithms; discovery; other methods of information extraction; automated learning; human-computer interaction and human-centered computing; NLP; machine learning; Informatics
Accurate information is needed to direct healthcare systems’ efforts to control methicillin-resistant Staphylococcus aureus (MRSA). Assembling complete and correct microbiology data is vital to understanding and addressing the multiple drug-resistant organisms in our hospitals.
Herein, we describe a system that securely gathers microbiology data from the Department of Veterans Affairs (VA) network of databases. Using natural language processing methods, we applied an information extraction process to extract organisms and susceptibilities from the free-text data. We then validated the extraction against independently derived electronic data and expert annotation.
We estimate that the collected microbiology data are 98.5% complete and that methicillin-resistant Staphylococcus aureus was extracted accurately 99.7% of the time.
Applying natural language processing methods to microbiology records appears to be a promising way to extract accurate and useful nosocomial pathogen surveillance data. Both scientific inquiry and the data’s reliability will be dependent on the surveillance system’s capability to compare from multiple sources and circumvent systematic error. The dataset constructed and methods used for this investigation could contribute to a comprehensive infectious disease surveillance system or other pressing needs.
The Department of Veterans Affairs (VA) and the Informatics for Integrating Biology and the Bedside (i2b2) team partnered to generate the reference standard for the 2010 i2b2/VA challenge task on concept extraction, assertion classification, and relation classification. The purpose of this paper is to report an in-depth qualitative analysis of the experience and perceptions of human annotators for these tasks. Transcripts of semi-structured interviews were analyzed using qualitative methods to identify key constructs and themes related to these annotation tasks. Interventions were embedded with these tasks using pre-annotation of clinical concepts and a modified annotation workflow. From the human perspective, annotation tasks involve an inherent conflict between bias, accuracy, and efficiency. This analysis deepens understanding of the biases, complexities and impact of variations in the annotation process that may affect annotation task reliability and reference standard validity that are generalizable for other similar large-scale clinical corpus annotation projects.
Digital imaging and communication in medicine (DICOM) specifies that all DICOM objects have globally unique identifiers (UIDs). Creating these UIDs can be a difficult task due to the variety of techniques in use and the requirement to ensure global uniqueness. We present a simple technique of combining a root organization identifier, assigned descriptive identifiers, and JAVA generated unique identifiers to construct DICOM compliant UIDs.
Digital imaging and communications in medicine (DICOM); structured reporting; digital imaging
The Integrating the Healthcare Enterprise (IHE) Teaching File and Clinical Trial Export (TCE) integration profile describes a standard workflow for exporting key images from an image manager/archive to a teaching file, clinical trial, or electronic publication application. Two specific digital imaging and communication in medicine (DICOM) structured reports (SR) reference the key images and contain associated case information. This paper presents step-by-step instructions for translating the TCE document templates into functional and complete DICOM SR objects. Others will benefit from these instructions in developing TCE compliant applications.
Digital imaging and communications in medicine (DICOM); integrating healthcare enterprise (IHE); extensible markup; language (XML); electronic teaching file; clinical trial; electronic; publishing
In the creation of interesting radiological cases in a digital teaching file, it is necessary to adjust the window and level settings of an image to effectively display the educational focus. The web-based applet described in this paper presents an effective solution for real-time window and level adjustments without leaving the picture archiving and communications system workstation. Optimized images are created, as user-defined parameters are passed between the applet and a servlet on the Health Insurance Portability and Accountability Act-compliant teaching file server.
Electronic teaching file; image manipulation; web technology