Search tips
Search criteria

Results 1-23 (23)

Clipboard (0)

Select a Filter Below

Year of Publication
Document Types
1.  Development of a HIPAA-compliant environment for translational research data and analytics 
High-performance computing centers (HPC) traditionally have far less restrictive privacy management policies than those encountered in healthcare. We show how an HPC can be re-engineered to accommodate clinical data while retaining its utility in computationally intensive tasks such as data mining, machine learning, and statistics. We also discuss deploying protected virtual machines. A critical planning step was to engage the university's information security operations and the information security and privacy office. Access to the environment requires a double authentication mechanism. The first level of authentication requires access to the university's virtual private network and the second requires that the users be listed in the HPC network information service directory. The physical hardware resides in a data center with controlled room access. All employees of the HPC and its users take the university's local Health Insurance Portability and Accountability Act training series. In the first 3 years, researcher count has increased from 6 to 58.
PMCID: PMC3912719  PMID: 23911553
High-performance Computing; Translational Medical Research; Clinical Research Informatics; HIPAA
2.  Automatic Extraction of Nanoparticle Properties Using Natural Language Processing: NanoSifter an Application to Acquire PAMAM Dendrimer Properties 
PLoS ONE  2014;9(1):e83932.
In this study, we demonstrate the use of natural language processing methods to extract, from nanomedicine literature, numeric values of biomedical property terms of poly(amidoamine) dendrimers. We have developed a method for extracting these values for properties taken from the NanoParticle Ontology, using the General Architecture for Text Engineering and a Nearly-New Information Extraction System. We also created a method for associating the identified numeric values with their corresponding dendrimer properties, called NanoSifter.
We demonstrate that our system can correctly extract numeric values of dendrimer properties reported in the cancer treatment literature with high recall, precision, and f-measure. The micro-averaged recall was 0.99, precision was 0.84, and f-measure was 0.91. Similarly, the macro-averaged recall was 0.99, precision was 0.87, and f-measure was 0.92. To our knowledge, these results are the first application of text mining to extract and associate dendrimer property terms and their corresponding numeric values.
PMCID: PMC3879259  PMID: 24392101
3.  AMIA's Code of Professional and Ethical Conduct 
PMCID: PMC3555329  PMID: 22733977
Ethics; web 2.0; qualitative methods; consumer health; patient education; e-health; system implementation and management issues; improving the education and skills training of health professionals; developing/using clinical decision support (other than diagnostic) and guideline systems; systems to support and improve diagnostic accuracy; measuring/improving patient safety and reducing medical errors; qualitative/ethnographic field study; enhancing the conduct of biological/clinical research and trials; classical experimental and quasi-experimental study methods (lab and field); consumer health informatics; ethical study methods; clinical research informatics; nutrition; clinical natural language processing; developing/using computerized provider order entry; other specific ehr applications (results review); medication administration; ethics committees; code of ethics; bioethics
4.  Identifying clinical/translational research cohorts: ascertainment via querying an integrated multi-source database 
Ascertainment of potential subjects has been a longstanding problem in clinical research. Various methods have been proposed, including using data in electronic health records. However, these methods typically suffer from scaling effects—some methods work well for large cohorts; others work for small cohorts only.
We propose a method that provides a simple identification of pre-research cohorts and relies on data available in most states in the USA: merged public health data sources.
Materials and methods
The Utah Population Database Limited query tool allows users to build complex queries that may span several types of health records, such as cancer registries, inpatient hospital discharges, and death certificates; in addition, these can be combined with family history information. The architectural approach incorporates several coding systems for medical information. It provides a front-end graphical user interface and enables researchers to build and run queries and view aggregate results. Multiple strategies have been incorporated to maintain confidentiality.
This tool was rapidly adopted; since its release, 241 users representing a wide range of disciplines from 17 institutions have signed the user agreement and used the query tool. Three examples are discussed: pregnancy complications co-occurring with cardiovascular disease; spondyloarthritis; and breast cancer.
Discussion and conclusions
This query tool was designed to provide results as pre-research so that institutional review board approval would not be required. This architecture uses well-described technologies that should be within the reach of most institutions.
PMCID: PMC3555332  PMID: 23059733
medical informatics applications; clinical trials; patient selection; public health informatics; clinical research informatics
5.  qDIET: toward an automated, self-sustaining knowledge base to facilitate linking point-of-sale grocery items to nutritional content 
The United States, indeed the world, struggles with a serious obesity epidemic. The costs of this epidemic in terms of healthcare dollar expenditures and human morbidity/mortality are staggering. Surprisingly, clinicians are ill-equipped in general to advise patients on effective, longitudinal weight loss strategies. We argue that one factor hindering clinicians and patients in effective shared decision-making about weight loss is the absence of a metric that can be reasoned about and monitored over time, as clinicians do routinely with, say, serum lipid levels or HgA1C. We propose that a dietary quality measure championed by the USDA and NCI, the HEI-2005/2010, is an ideal metric for this purpose. We describe a new tool, the quality Dietary Information Extraction Tool (qDIET), which is a step toward an automated, self-sustaining process that can link retail grocery purchase data to the appropriate USDA databases to permit the calculation of the HEI-2005/2010.
PMCID: PMC3900174  PMID: 24551333
6.  Text summarization as a decision support aid 
PubMed data potentially can provide decision support information, but PubMed was not exclusively designed to be a point-of-care tool. Natural language processing applications that summarize PubMed citations hold promise for extracting decision support information. The objective of this study was to evaluate the efficiency of a text summarization application called Semantic MEDLINE, enhanced with a novel dynamic summarization method, in identifying decision support data.
We downloaded PubMed citations addressing the prevention and drug treatment of four disease topics. We then processed the citations with Semantic MEDLINE, enhanced with the dynamic summarization method. We also processed the citations with a conventional summarization method, as well as with a baseline procedure. We evaluated the results using clinician-vetted reference standards built from recommendations in a commercial decision support product, DynaMed.
For the drug treatment data, Semantic MEDLINE enhanced with dynamic summarization achieved average recall and precision scores of 0.848 and 0.377, while conventional summarization produced 0.583 average recall and 0.712 average precision, and the baseline method yielded average recall and precision values of 0.252 and 0.277. For the prevention data, Semantic MEDLINE enhanced with dynamic summarization achieved average recall and precision scores of 0.655 and 0.329. The baseline technique resulted in recall and precision scores of 0.269 and 0.247. No conventional Semantic MEDLINE method accommodating summarization for prevention exists.
Semantic MEDLINE with dynamic summarization outperformed conventional summarization in terms of recall, and outperformed the baseline method in both recall and precision. This new approach to text summarization demonstrates potential in identifying decision support data for multiple needs.
PMCID: PMC3461485  PMID: 22621674
7.  Sentiment Analysis of Suicide Notes: A Shared Task 
Biomedical informatics insights  2012;5(Suppl 1):3-16.
This paper reports on a shared task involving the assignment of emotions to suicide notes. Two features distinguished this task from previous shared tasks in the biomedical domain. One is that it resulted in the corpus of fully anonymized clinical text and annotated suicide notes. This resource is permanently available and will (we hope) facilitate future research. The other key feature of the task is that it required categorization with respect to a large set of labels. The number of participants was larger than in any previous biomedical challenge task. We describe the data production process and the evaluation measures, and give a preliminary analysis of the results. Many systems performed at levels approaching the inter-coder agreement, suggesting that human-like performance on this task is within the reach of currently available technologies.
PMCID: PMC3299408  PMID: 22419877
Sentiment analysis; suicide; suicide notes; natural language processing; computational linguistics; shared task; challenge 2011
8.  Automatically Detecting Medications and the Reason for their Prescription in Clinical Narrative Text Documents 
An important proportion of the information about the medications a patient is taking is mentioned only in narrative text in the electronic health record. Automated information extraction can make this information accessible for decision-support, research, or any other automated processing. In the context of the “i2b2 medication extraction challenge,” we have developed a new NLP application called Textractor to automatically extract medications and details about them (e.g., dosage, frequency, reason for their prescription). This application and its evaluation with part of the reference standard for this “challenge” are presented here, along with an analysis of the development of this reference standard. During this evaluation, Textractor reached a system-level overall F1-measure, the reference metric for this challenge, of about 77% for exact matches. The best performance was measured with medication routes (F1-measure 86.4%), and the worst with prescription reasons (F1-measure 29%). These results are consistent with the agreement observed between human annotators when developing the reference standard, and with other published research.
PMCID: PMC3238676  PMID: 20841823
Pharmaceutical Preparations; Drug Prescriptions; Natural Language Processing; Program Evaluation; Knowledge Bases
9.  Textractor: a hybrid system for medications and reason for their prescription extraction from clinical text documents 
To describe a new medication information extraction system—Textractor—developed for the ‘i2b2 medication extraction challenge’. The development, functionalities, and official evaluation of the system are detailed.
Textractor is based on the Apache Unstructured Information Management Architecture (UMIA) framework, and uses methods that are a hybrid between machine learning and pattern matching. Two modules in the system are based on machine learning algorithms, while other modules use regular expressions, rules, and dictionaries, and one module embeds MetaMap Transfer.
The official evaluation was based on a reference standard of 251 discharge summaries annotated by all teams participating in the challenge. The metrics used were recall, precision, and the F1-measure. They were calculated with exact and inexact matches, and were averaged at the level of systems and documents.
The reference metric for this challenge, the system-level overall F1-measure, reached about 77% for exact matches, with a recall of 72% and a precision of 83%. Performance was the best with route information (F1-measure about 86%), and was good for dosage and frequency information, with F1-measures of about 82–85%. Results were not as good for durations, with F1-measures of 36–39%, and for reasons, with F1-measures of 24–27%.
The official evaluation of Textractor for the i2b2 medication extraction challenge demonstrated satisfactory performance. This system was among the 10 best performing systems in this challenge.
PMCID: PMC2995680  PMID: 20819864
10.  Dynamic summarization of bibliographic-based data 
Traditional information retrieval techniques typically return excessive output when directed at large bibliographic databases. Natural Language Processing applications strive to extract salient content from the excessive data. Semantic MEDLINE, a National Library of Medicine (NLM) natural language processing application, highlights relevant information in PubMed data. However, Semantic MEDLINE implements manually coded schemas, accommodating few information needs. Currently, there are only five such schemas, while many more would be needed to realistically accommodate all potential users. The aim of this project was to develop and evaluate a statistical algorithm that automatically identifies relevant bibliographic data; the new algorithm could be incorporated into a dynamic schema to accommodate various information needs in Semantic MEDLINE, and eliminate the need for multiple schemas.
We developed a flexible algorithm named Combo that combines three statistical metrics, the Kullback-Leibler Divergence (KLD), Riloff's RlogF metric (RlogF), and a new metric called PredScal, to automatically identify salient data in bibliographic text. We downloaded citations from a PubMed search query addressing the genetic etiology of bladder cancer. The citations were processed with SemRep, an NLM rule-based application that produces semantic predications. SemRep output was processed by Combo, in addition to the standard Semantic MEDLINE genetics schema and independently by the two individual KLD and RlogF metrics. We evaluated each summarization method using an existing reference standard within the task-based context of genetic database curation.
Combo asserted 74 genetic entities implicated in bladder cancer development, whereas the traditional schema asserted 10 genetic entities; the KLD and RlogF metrics individually asserted 77 and 69 genetic entities, respectively. Combo achieved 61% recall and 81% precision, with an F-score of 0.69. The traditional schema achieved 23% recall and 100% precision, with an F-score of 0.37. The KLD metric achieved 61% recall, 70% precision, with an F-score of 0.65. The RlogF metric achieved 61% recall, 72% precision, with an F-score of 0.66.
Semantic MEDLINE summarization using the new Combo algorithm outperformed a conventional summarization schema in a genetic database curation task. It potentially could streamline information acquisition for other needs without having to hand-build multiple saliency schemas.
PMCID: PMC3042900  PMID: 21284871
11.  Using UMLS Lexical Resources to Disambiguate Abbreviations in Clinical Text 
Clinical text is rich in acronyms and abbreviations, and they are highly ambiguous. As a pre-processing step before subsequent NLP analysis, we are developing and evaluating clinical abbreviation disambiguation methods. The evaluation of two sequential steps, the detection and the disambiguation of abbreviations, is reported here, for various types of clinical notes. For abbreviations detection, our result indicated the SPECIALIST Lexicon LRABR needed to be revised for better abbreviation detection. Our semi-supervised method using generated training data based on expanded form matching for 12 frequent abbreviations in our clinical notes reached over 90% accuracy in five-fold cross validation and unsupervised approach produced comparable results with the semi-supervised methods.
PMCID: PMC3243121  PMID: 22195128
12.  From simply inaccurate to complex and inaccurate: complexity in standards-based quality measures 
Quality measurement has been slow to make a major impact in health care. Initial measures were too simple to affect outcomes of importance. Incentive programs such as Meaningful Use encourage better measures, but in process may become more complex. We evaluated the measures selected for Meaningful Use in two ways: we counted unique concept identifiers, taxonomies, and aggregated concepts as measures of complexity; and we surveyed informatics professionals to assess difficulty. There were 20,316 unique concept identifiers, 35 taxonomies, and 317 aggregated concepts across the 45 measures. Half the respondents reported measures at least moderately difficult. The number of identifiers was associated with fewer implementations (r=−.37); rating-of-difficulty was associated with more taxonomies (r=.24). The impact on accuracy may be substantial when moving to measures intended to be more relevant to clinical outcomes but requiring the use of more taxonomies, unused structured concept identifiers, or concepts only in free text fields.
PMCID: PMC3243137  PMID: 22195085
13.  Linking Supermarket Sales Data To Nutritional Information: An Informatics Feasibility Study 
Grocery sales are a data source of potential value to dietary assessment programs in public health informatics. However, the lack of a computable method for mapping between nutrient and food item information represents a major obstacle. We studied the feasibility of linking point-of-sale data to USDA-SR nutrient database information in a sustainable way. We analyzed 2,009,533 de-identified sales items purchased by 32,785 customers over a two-week period. We developed a method using the item category hierarchy in the supermarket’s database to link purchased items to records from the USDA-SR. We describe our methodology and its rationale and limitations. Approximately 70% of all items were mapped and linked to the SR; approximately 90% of all items could be mapped with an equivalent expenditure of additional effort. 100% of all items were mapped to USDA standard food groups. We conclude that mapping grocery sales data to nutritional information is feasible.
PMCID: PMC3243220  PMID: 22195115
14.  Document Clustering of Clinical Narratives: a Systematic Study of Clinical Sublanguages 
AMIA Annual Symposium Proceedings  2011;2011:1099-1107.
It is widely believed that different clinical domains use their own sublanguage in clinical notes, complicating natural language processing, but this has never been demonstrated on a broad selection of note types. Starting from formal sublanguage theory, we constructed a feature space based on vocabulary and semantic types used in 17 different clinical domains by three author types (physicians, nurses, and social workers) in both the in- and outpatient settings. We supplied the resulting vectors to CLUTO, a robust clustering tool suitable for this high-dimensional space. Our results confirm that note types with a broad clinical scope, e.g, History & Physicals and Discharge Summaries, cluster together, while note types with a narrow clinical scope form surprisingly pure, disjoint sublanguages. A reasonable conclusion from this study is that any tool relying on term statistics or semantics trained on one clinical note type may not work well on any other.
PMCID: PMC3243234  PMID: 22195171
15.  Biomedical text summarization to support genetic database curation: using Semantic MEDLINE to create a secondary database of genetic information 
This paper examines the development and evaluation of an automatic summarization system in the domain of molecular genetics. The system is a potential component of an advanced biomedical information management application called Semantic MEDLINE and could assist librarians in developing secondary databases of genetic information extracted from the primary literature.
An existing summarization system was modified for identifying biomedical text relevant to the genetic etiology of disease. The summarization system was evaluated on the task of identifying data describing genes associated with bladder cancer in MEDLINE citations. A gold standard was produced using records from Genetics Home Reference and Online Mendelian Inheritance in Man. Genes in text found by the system were compared to the gold standard. Recall, precision, and F-measure were calculated.
The system achieved recall of 46%, and precision of 88% (F-measure = 0.61) by taking Gene References into Function (GeneRIFs) into account.
The new summarization schema for genetic etiology has potential as a component in Semantic MEDLINE to support the work of data curators.
PMCID: PMC2947139  PMID: 20936065
16.  Automatic Acquisition of Sublanguage Semantic Schema: Towards the Word Sense Disambiguation of Clinical Narratives 
Natural language processing of clinical notes is challenging due to a high degree of semantic ambiguity. Previous research has uncovered ways to improve disambiguation accuracy using manually created rules of semantic sentence structure. However, applying a natural language processing system in a new clinical domain using this method is very labor intensive. This paper presents an automatic method of developing such disambiguation rules for a wide range of clinical domains. Our rules are based on the co-occurrence patterns of semantic types of terms unambiguously mapped to UMLS concepts by MetaMap. These patterns are combined into a sublanguage semantic schema that can be used by an existing natural language processing system such as MetaMap. The differences of co-occurrence patterns across clinical notes of different domains are presented here as evidence of clinical sublanguages.
PMCID: PMC3041300  PMID: 21347051
17.  Integrating a Federated Healthcare Data Query Platform With Electronic IRB Information Systems 
Human subjects are indispensable for clinical and translational research. Federal and local agencies issue regulations governing the conduct of research involving human subjects in order to properly protect study participants. Institutional Review Boards (IRBs) have the authority to review human subject research to ensure concordance with these regulations. One of the primary goals of the IRB oversight is to protect research participants’ privacy by carefully reviewing the data used and disclosed during a study. However, there are major challenges for IRBs in the typical research process. Due to the information disconnect between the data providers (e.g., a clinical data warehouse) and the IRB, it is often impossible to tell exactly what data has been disclosed to investigators. This causes time-consuming, inefficient, and often ineffective monitoring of clinical studies. This paper proposes an integrated architecture that interconnects a federated healthcare data query platform with an electronic IRB system.
PMCID: PMC3041360  PMID: 21346987
18.  A Code of Professional Ethical Conduct for the American Medical Informatics Association 
The AMIA Board of Directors has decided to periodically publish AMIA’s Code of Professional Ethical Conduct for its members in the Journal of the American Medical Informatics Association. The Code also will be available on the AMIA Web site at as it continues to evolve in response to feedback from the AMIA membership. The AMIA Board acknowledges the continuing work and dedication of the AMIA Ethics Committee. AMIA is the copyright holder of this work.
PMCID: PMC2244909  PMID: 17460125
19.  Measuring Diagnoses: ICD Code Accuracy 
Health Services Research  2005;40(5 Pt 2):1620-1639.
To examine potential sources of errors at each step of the described inpatient International Classification of Diseases (ICD) coding process.
Data Sources/Study Setting
The use of disease codes from the ICD has expanded from classifying morbidity and mortality information for statistical purposes to diverse sets of applications in research, health care policy, and health care finance. By describing a brief history of ICD coding, detailing the process for assigning codes, identifying where errors can be introduced into the process, and reviewing methods for examining code accuracy, we help code users more systematically evaluate code accuracy for their particular applications.
Study Design/Methods
We summarize the inpatient ICD diagnostic coding process from patient admission to diagnostic code assignment. We examine potential sources of errors at each step and offer code users a tool for systematically evaluating code accuracy.
Principle Findings
Main error sources along the “patient trajectory” include amount and quality of information at admission, communication among patients and providers, the clinician's knowledge and experience with the illness, and the clinician's attention to detail. Main error sources along the “paper trail” include variance in the electronic and written records, coder training and experience, facility quality-control efforts, and unintentional and intentional coder errors, such as misspecification, unbundling, and upcoding.
By clearly specifying the code assignment process and heightening their awareness of potential error sources, code users can better evaluate the applicability and limitations of codes for their particular situations. ICD codes can then be used in the most appropriate ways.
PMCID: PMC1361216  PMID: 16178999
ICD codes; accuracy; error sources
20.  Leveraging Semantic Knowledge in IRB Databases to Improve Translation Science 
We introduce the notion that research administrative databases (RADs), such as those increasingly used to manage information flow in the Institutional Review Board (IRB), offer a novel, useful, and mine-able data source overlooked by informaticists. As a proof of concept, using an IRB database we extracted all titles and abstracts from system startup through January 2007 (n=1,876); formatted these in a pseudo-MEDLINE format; and processed them through the SemRep semantic knowledge extraction system. Even though SemRep is tuned to find semantic relations in MEDLINE citations, we found that it performed comparably well on the IRB texts. When adjusted to eliminate non-healthcare IRB submissions (e.g., economic and education studies), SemRep extracted an average of 7.3 semantic relations per IRB abstract (compared to an average of 11.1 for MEDLINE citations) with a precision of 70% (compared to 78% for MEDLINE). We conclude that RADs, as represented by IRB data, are mine-able with existing tools, but that performance will improve as these tools are tuned for RAD structures.
PMCID: PMC2655842  PMID: 18693856
21.  Developing a Taxonomy for Research in Adverse Drug Events: Potholes and Signposts 
Computerized decision support and order entry shows great promise for reducing adverse drug events (ADEs). The evaluation of these solutions depends on a framework of definitions and classifications that is clear and practical. Unfortunately the literature does not always provide a clear path to defining and classifying adverse drug events. While not a systematic review, this paper uses examples from the literature to illustrate problems that investigators will confront as they develop a conceptual framework for their research. It also proposes a targeted taxonomy that can facilitate a clear and consistent approach to the research of ADEs and aid in the comparison to results of past and future studies. This paper outlines the ambiguity in definitions of ADEs that has arisen from the conflation of regulatory and quality terminology. It proposes a typology for ADEs by drug and disease effect and outlines problems inherent in the study of ADEs related to disease effects, errors, and omitted therapies. The paper also highlights difficulty in assessing seriousness and causality and the problems with commonly used scales for these assessments. Finally, although national or international agreement on taxonomy for ADEs is a distant or unachievable goal, individual investigations and the literature as a whole will be improved by prospective, explicit classification of ADEs and inclusion of the study’s approach to classification in publications.
PMCID: PMC419426
22.  Critical Gaps in the World’s Largest Electronic Medical Record: Ad Hoc Nursing Narratives and Invisible Adverse Drug Events 
The Veterans Health Administration (VHA), of the U.S. Department of Veteran Affairs, operates one of the largest healthcare networks in the world. Its electronic medical record (EMR) is fully integrated into clinical practice, having evolved over several decades of design, testing, trial, and error. It is unarguably the world’s largest EMR, and as such it makes an important case study for a host of timely informatics issues. The VHA consistently has been at the vanguard of patient safety, especially in its provider-oriented EMR. We describe here a study of a large set of adverse drug events (ADEs) that eluded a rigorous ADE survey based on prospective EMR chart review. These numerous ADEs were undetected (and hence invisible) in the EMR, missed by an otherwise sophisticated ADE detection scheme. We speculate how these invisible nursing ADE narratives persist and what they portend for safety re-engineering.
PMCID: PMC1480185  PMID: 14728184
23.  Laboratory Computerization: The Case for a Prospective Analysis 
The argument is made that computerization of a laboratory should be preceeded by a thorough prospective analysis of laboratory operations. Points to be pondered include complementation of retrospective data, system cost justification, system performance justification, post-installation personnel adjustments, improved system utilization, improved manual performance, and insight into “how much” system to buy. A brief, general outline is offered describing how to approach such a study.
PMCID: PMC2580220

Results 1-23 (23)