The current study aims to fill the gap in available healthcare de-identification resources by creating a new sharable dataset with realistic Protected Health Information (PHI) without reducing the value of the data for de-identification research. By releasing the annotated gold standard corpus with Data Use Agreement we would like to encourage other Computational Linguists to experiment with our data and develop new machine learning models for de-identification. This paper describes: (1) the modifications required by the Institutional Review Board before sharing the de-identification gold standard corpus; (2) our efforts to keep the PHI as realistic as possible; (3) and the tests to show the effectiveness of these efforts in preserving the value of the modified data set for machine learning model development.
Material and Methods
In a previous study we built an original de-identification gold standard corpus annotated with true Protected Health Information (PHI) from 3,503 randomly selected clinical notes for the 22 most frequent clinical note types of our institution. In the current study we modified the original gold standard corpus to make it suitable for external sharing by replacing HIPAA-specified PHI with newly generated realistic PHI. Finally, we evaluated the research value of this new dataset by comparing the performance of an existing published in-house de-identification system, when trained on the new de-identification gold standard corpus, with the performance of the same system, when trained on the original corpus. We assessed the potential benefits of using the new de-identification gold standard corpus to identify PHI in the i2b2 and PhysioNet datasets that were released by other groups for de-identification research. We also measured the effectiveness of the i2b2 and PhysioNet de-identification gold standard corpora in identifying PHI in our original clinical notes.
Performance of the de-identification system using the new gold standard corpus as a training set was very close to training on the original corpus (92.56 vs. 93.48 overall F-measures). Best i2b2/PhysioNet/CCHMC cross-training performances were obtained when training on the new shared CCHMC gold standard corpus, although performances were still lower than corpus-specific trainings.
Discussion and conclusion
We successfully modified a de-identification dataset for external sharing while preserving the de-identification research value of the modified gold standard corpus with limited drop in machine learning de-identification performance.
Natural Language Processing; Privacy of Patient Data; Health Insurance Portability and Accountability Act; Automated De-identification; De-identification Gold Standard; Protected Health Information
Predictive models built using temporal data in electronic health records (EHRs) can potentially play a major role in improving management of chronic diseases. However, these data present a multitude of technical challenges, including irregular sampling of data and varying length of available patient history. In this paper, we describe and evaluate three different approaches that use machine learning to build predictive models using temporal EHR data of a patient.
The first approach is a commonly used non-temporal approach that aggregates values of the predictors in the patient’s medical history. The other two approaches exploit the temporal dynamics of the data. The two temporal approaches vary in how they model temporal information and handle missing data. Using data from the EHR of Mount Sinai Medical Center, we learned and evaluated the models in the context of predicting loss of estimated glomerular filtration rate (eGFR), the most common assessment of kidney function.
Our results show that incorporating temporal information in patient’s medical history can lead to better prediction of loss of kidney function. They also demonstrate that exactly how this information is incorporated is important. In particular, our results demonstrate that the relative importance of different predictors varies over time, and that using multi-task learning to account for this is an appropriate way to robustly capture the temporal dynamics in EHR data. Using a case study, we also demonstrate how the multi-task learning based model can yield predictive models with better performance for identifying patients at high risk of short-term loss of kidney function.
Electronic health records; Temporal analysis; Progression of kidney function loss; Risk stratification
Epilepsy is a common serious neurological disorder with a complex set of possible phenotypes ranging from pathologic abnormalities to variations in electroencephalogram. This paper presents a system called Phenotype Exaction in Epilepsy (PEEP) for extracting complex epilepsy phenotypes and their correlated anatomical locations from clinical discharge summaries, a primary data source for this purpose. PEEP generates candidate phenotype and anatomical location pairs by embedding a named entity recognition method, based on the Epilepsy and Seizure Ontology, into the National Library of Medicine's MetaMap program. Such candidate pairs are further processed using a correlation algorithm. The derived phenotypes and correlated locations have been used for cohort identification with an integrated ontology-driven visual query interface. To evaluate the performance of PEEP, 400 de-identified discharge summaries were used for development and an additional 262 were used as test data. PEEP achieved a micro-averaged precision of 0.924, recall of 0.931, and F1-measure of 0.927 for extracting epilepsy phenotypes. The performance on the extraction of correlated phenotypes and anatomical locations shows a micro-averaged F1-measure of 0.856 (Precision: 0.852, Recall: 0.859). The evaluation demonstrates that PEEP is an effective approach to extracting complex epilepsy phenotypes for cohort identification.
Epilepsy; Information Extraction; Cohort Identification
Targeted drugs dramatically improve the treatment outcomes in cancer patients; however, these innovative drugs are often associated with unexpectedly high cardiovascular toxicity. Currently, cardiovascular safety represents both a challenging issue for drug developers, regulators, researchers, and clinicians and a concern for patients. While FDA drug labels have captured many of these events, spontaneous reporting systems are a main source for post-marketing drug safety surveillance in ‘real-world’ (outside of clinical trials) cancer patients. In this study, we present approaches to extracting, prioritizing, filtering, and confirming cardiovascular events associated with targeted cancer drugs from the FDA Adverse Event Reporting System (FAERS).
Data and Methods
The dataset includes records of 4,285,097 patients from FAERS. We first extracted drug-cardiovascular event (drug-CV) pairs from FAERS through named entity recognition and mapping processes. We then compared six ranking algorithms in prioritizing true positive signals among extracted pairs using known drug-CV pairs derived from FDA drug labels. We also developed three filtering algorithms to further improve precision. Finally, we manually validated extracted drug-CV pairs using 21 million published MEDLINE records.
We extracted a total of 11,173 drug-CV pairs from FAERS. We showed that ranking by frequency is significantly more effective than by the five standard signal detection methods (246% improvement in precision for top-ranked pairs). The filtering algorithm we developed further improved overall precision by 91.3%. By manual curation using literature evidence, we show that about 51.9% of the 617 drug-CV pairs that appeared in both FAERS and MEDLINE sentences are true positives. In addition, 80.6% of these positive pairs have not been captured by FDA drug labeling.
The unique drug-CV association dataset that we created based on FAERS could facilitate our understanding and prediction of cardiotoxic events associated with targeted cancer drugs.
targeted cancer therapy; cardiotoxicity; data mining; post-market drug safety surveillance; Personalized Medicine
Personalized medicine is to deliver the right drug to the right patient in the right dose. Pharmacogenomics (PGx) is to identify genetic variants that may affect drug efficacy and toxicity. The availability of a comprehensive and accurate PGx-specific drug-gene relationship knowledge base is important for personalized medicine. However, building a large-scale PGx-specific drug-gene knowledge base is a difficult task. In this study, we developed a bootstrapping, semi-supervised learning approach to iteratively extract and rank drug-gene pairs according to their relevance to drug pharmacogenomics. Starting with a single PGx-specific seed pair and 20 million MEDLINE abstracts, the extraction algorithm achieved a precision of 0.219, recall of 0.368 and F1 of 0.274 after two iterations, a significant improvement over the results of using non-PGx-specific seeds (precision: 0.011, recall: 0.018, and F1: 0.014) or co-occurrence (precision: 0.015, recall: 1.000, and F1: 0.030). After the extraction step, the ranking algorithm further improved the precision from 0.219 to 0.561 for top ranked pairs. By comparing to a dictionary-based approach with PGx-specific gene lexicon as input, we showed that the bootstrapping approach has better performance in terms of both precision and F1 (precision: 0.251 vs. 0.152, recall: 0.396 vs. 0.856 and F1: 0.292 vs. 0.254). By integrative analysis using a large drug adverse event database, we have shown that the extracted drug-gene pairs strongly correlate with drug adverse events. In conclusion, we developed a novel semi-supervised bootstrapping approach for effective PGx-specific drug-gene pair extraction from large number of MEDLINE articles with minimal human input.
pharmacogenomics; text mining; information extraction; personalized medicine
Systems approaches to analyzing disease phenotype networks in combination with protein functional interaction networks have great potential in illuminating disease pathophysiological mechanisms. While many genetic networks are readily available, disease phenotype networks remain largely incomplete. In this study, we built a large-scale Disease Manifestation Network (DMN) from 50,543 highly accurate disease-manifestation semantic relationships in the United Medical Language System (UMLS). Our new phenotype network contains 2305 nodes and 373,527 weighted edges to represent the disease phenotypic similarities. We first compared DMN with the networks representing genetic relationships among diseases, and demonstrated that the phenotype clustering in DMN reflects common disease genetics. Then we compared DMN with a widely-used disease phenotype network in previous gene discovery studies, called mimMiner, which was extracted from the textual descriptions in Online Mendelian Inheritance in Man (OMIM). We demonstrated that DMN contains different knowledge from the existing phenotype data source. Finally, a case study on Marfan syndrome further proved that DMN contains useful information and can provide leads to discover unknown disease causes. Integrating DMN in systems approaches with mimMiner and other data offers the opportunities to predict novel disease genetics. We made DMN publicly available at nlp/case.edu/public/data/DMN.
Ontology; Disease phenotype network; Network analysis
In light of the heightened problems of polysemy, synonymy, and hyponymy in clinical text, we hypothesize that patient cohort identification can be improved by using a large, in-domain clinical corpus for query expansion. We evaluate the utility of four auxiliary collections for the Text REtrieval Conference task of IR-based cohort retrieval, considering the effects of collection size, the inherent difficulty of a query, and the interaction between the collections. Each collection was applied to aid in cohort retrieval from the Pittsburgh NLP Repository by using a mixture of relevance models. Measured by mean average precision, performance using any auxiliary resource (MAP=0.386 and above) is shown to improve over the baseline query likelihood model (MAP=0.373). Considering subsets of the Mayo Clinic collection, we found that after including 2.5 billion term instances, retrieval is not improved by adding more instances. However, adding the Mayo Clinic collection did improve performance significantly over any existing setup, with a system using all four auxiliary collections obtaining the best results (MAP=0.4223). Because optimal results in the mixture of relevance models would require selective sampling of the collections, the common sense approach of “use all available data” is inappropriate. However, we found that it was still beneficial to add the Mayo corpus to any mixture of relevance models. On the task of IR-based cohort identification, query expansion with the Mayo Clinic corpus resulted in consistent and significant improvements. As such, any IR query expansion with access to a large clinical corpus could benefit from the additional resource. Additionally, we have shown that more data is not necessarily better, implying that there is value in collection curation.
Cohort Identification; Information retrieval; Query expansion; Clinical text; Electronic Medical Records
To address the need for greater evidence-based evaluation of Health Information Technology (HIT) systems we introduce a method of usability testing termed tree testing. In a tree test, participants are presented with an abstract hierarchical tree of the system taxonomy and asked to navigate through the tree in completing representative tasks. We apply tree testing to a commercially available health application, demonstrating a use case and providing a comparison with more traditional in-person usability testing methods. Online tree tests (N=54) and in-person usability tests (N=15) were conducted from August to September 2013. Tree testing provided a method to quantitatively evaluate the information structure of a system using various navigational metrics including completion time, task accuracy, and path length. The results of the analyses compared favorably to the results seen from the traditional usability test. Tree testing provides a flexible, evidence-based approach for researchers to evaluate the information structure of HITs. In addition, remote tree testing provides a quick, flexible, and high volume method of acquiring feedback in a structured format that allows for quantitative comparisons. With the diverse nature and often large quantities of health information available, addressing issues of terminology and concept classifications during the early development process of a health information system will improve navigation through the system and save future resources. Tree testing is a usability method that can be used to quickly and easily assess information hierarchy of health information systems.
User-Computer Interface; Usability Methods; Information System Evaluation
Time motion studies were first described in the early 20th century in industrial engineering, referring to a quantitative data collection method where an external observer captured detailed data on the duration and movements required to accomplish a specific task, coupled with an analysis focused on improving efficiency. Since then, they have been broadly adopted by biomedical researchers and have become a focus of attention due to the current interest in clinical workflow related factors. However, attempts to aggregate results from these studies have been difficult, resulting from a significant variability in the implementation and reporting of methods. While efforts have been made to standardize the reporting of such data and findings, a lack of common understanding on what “time motion studies” are remains, which not only hinders reviews, but could also partially explain the methodological variability in the domain literature (duration of the observations, number of tasks, multitasking, training rigor and reliability assessments) caused by an attempt to cluster dissimilar sub-techniques. A crucial milestone towards the standardization and validation of time motion studies corresponds to a common understanding, accompanied by a proper recognition of the distinct techniques it encompasses. Towards this goal, we conducted a review of the literature aiming at identifying what is being referred to as “time motion studies”. We provide a detailed description of the distinct methods used in articles referenced or classified as “time motion studies”, and conclude that currently it is used not only to define the original technique, but also to describe a broad spectrum of studies whose only common factor is the capture and/or analysis of the duration of one or more events. To maintain alignment with the existing broad scope of the term, we propose a disambiguation approach by preserving the expanded conception, while recommending the use of a specific qualifier “continuous observation time motion studies” to refer to variations of the original method (the use of an external observer recording data continuously). In addition, we present a more granular naming for sub-techniques within continuous observation time motion studies, expecting to reduce the methodological variability within each sub-technique and facilitate future results aggregation.
Time and Motion Studies; methods; standardization
In this study we report on potential drug-drug interactions between drugs occurring in patient clinical data. Results are based on relationships in SemMedDB, a database of structured knowledge extracted from all MEDLINE citations (titles and abstracts) using SemRep. The core of our methodology is to construct two potential drug-drug interaction schemas, based on relationships extracted from SemMedDB. In the first schema, Drug1 and Drug2 interact through Drug1’s effect on some gene, which in turn affects Drug2. In the second, Drug1 affects Gene1, while Drug2 affects Gene2. Gene1 and Gene2, together, then have an effect on some biological function. After checking each drug pair from the medication lists of each of 22 patients, we found 19 known and 62 unknown drug-drug interactions using both schemas. For example, our results suggest that the interaction of Lisinopril, an ACE inhibitor commonly prescribed for hypertension, and the antidepressant sertraline can potentially increase the likelihood and possibly the severity of psoriasis. We also assessed the relationships extracted by SemRep from a linguistic perspective and found that the precision of SemRep was 0.58 for 300 randomly selected sentences from MEDLINE. Our study demonstrates that the use of structured knowledge in the form of relationships from the biomedical literature can support the discovery of potential drug-drug interactions occurring in patient clinical data. Moreover, SemMedDB provides a good knowledge resource for expanding the range of drugs, genes, and biological functions considered as elements in various drug-drug interaction pathways.
Drug-drug interactions; MEDLINE; SemRep; SemMedDB; Natural language processing; Clinical data; Semantic predication
Computer-assisted image retrieval applications could assist radiologist interpretations by identifying similar images in large archives as a means to providing decision support. However, the semantic gap between low-level image features and their high level semantics may impair the system performances. Indeed, it can be challenging to comprehensively characterize the images using low-level imaging features to fully capture the visual appearance of diseases on images, and recently the use of semantic terms has been advocated to provide semantic descriptions of the visual contents of images. However, most of the existing image retrieval strategies do not consider the intrinsic properties of these terms during the comparison of the images beyond treating them as simple binary (presence/absence) features. We propose a new framework that includes semantic features in images and that enables retrieval of similar images in large databases based on their semantic relations. It is based on two main steps: (1) annotation of the images with semantic terms extracted from an ontology, and (2) evaluation of the similarity of image pairs by computing the similarity between the terms using the Hierarchical Semantic-Based Distance (HSBD) coupled to an ontological measure. The combination of these two steps provides a means of capturing the semantic correlations among the terms used to characterize the images that can be considered as a potential solution to deal with the semantic gap problem. We validate this approach in the context of the retrieval and the classification of 2D regions of interest (ROIs) extracted from computed tomographic (CT) images of the liver. Under this framework, retrieval accuracy of more than 0.96 was obtained on a 30-images dataset using the Normalized Discounted Cumulative Gain (NDCG) index that is a standard technique used to measure the effectiveness of information retrieval algorithms when a separate reference standard is available. Classification results of more than 95% were obtained on a 77-images dataset. For comparison purpose, the use of the Earth Mover's Distance (EMD), which is an alternative distance metric that considers all the existing relations among the terms, led to results retrieval accuracy of 0.95 and classification results of 93% with a higher computational cost. The results provided by the presented framework are competitive with the state-of-the-art and emphasize the usefulness of the proposed methodology for radiology image retrieval and classification.
Image retrieval; classification; semantic image annotation; linguistic knowledge; semantic-based distances; ontologies; computed tomographic (CT) images; liver lesions
We aim to quantify HMG-CoA reductase inhibitor (statin) prescriber-intended exposure-time using a generalizable algorithm that interrogates data stored in the electronic health record (EHR).
Materials and methods
This study was conducted using the Marshfield Clinic (MC) Personalized Medicine Research Project (PMRP) a central Wisconsin-based population and biobank with, on average, 30 years of electronic health data available in the independently-developed MC Cattails MD EHR. Individuals with evidence of statin exposure were identified from the electronic records, and manual chart abstraction of all mentions of prescribed statins was completed. We then performed electronic chart abstraction of prescriber-intended exposure time for statins, using previously identified logic to capture pill-splitting events, normalizing dosages to atorvastatin-equivalent dose. Four models using iterative training sets were tested to capture statin end-dates. Calculated cumulative provider-intended exposures were compared to manually abstracted gold-standard measures of ordered statin prescriptions, and aggregate model results (totals) for training and validation populations were compared. The most successful model was the one with the smallest discordance between modeled and manually abstracted Atorvastatin 10 mg/year Equivalents (AEs).
Of the approximately 20,000 patients enrolled in the PMRP, 6243 were identified with statin exposure during the study period (1997–2011), 59.8% of whom had been prescribed multiple statins over an average of approximately 11 years. When the best-fit algorithm was implemented and validated by manual chart review for the statin-ordered population, it was found to capture 95.9% of the correlation between calculated and expected statin provider-intended exposure time for a random validation set, and the best-fit model was able to predict intended statin exposure to within a standard deviation of 2.6 AEs, with a standard error of +0.23 AEs.
We demonstrate that normalized provider-intended statin exposure time can be estimated using a combination of structured clinical data sources, including a medications ordering system and a clinical appointment coordination system, supplemented with text data from clinical notes.
Anticholesteremic agents; Algorithm; Electronic health records; Statins; HMG-CoA; Drug dosage calculations
Intensive care monitoring systems are typically developed from population data, but do not take into account the variability among individual patients’ characteristics. This study develops patient-specific alarm algorithms in real time. Classification tree and neural network learning were carried out in batch mode on individual patients’ vital sign numerics in successive intervals of incremental duration to generate binary classifiers of patient state and thus to determine when to issue an alarm. Results suggest that the performance of these classifiers follows the course of a learning curve. After eight hours of patient-specific training during each of ten monitoring sessions, our neural networks reached average sensitivity, specificity, positive predictive value, and accuracy of 0.96, 0.99, 0.79, and 0.99 respectively. The classification trees achieved 0.84, 0.98, 0.72, and 0.98 respectively. Thus, patient-specific modeling in real time is not only feasible but also effective in generating alerts at the bedside.
patient-specific adaptivity; real-time batch learning; classification tree; neural network; patient monitoring; critical care
Medical message boards are online resources where users with a particular condition exchange information, some of which they might not otherwise share with medical providers. Many of these boards contain a large number of posts and contain patient opinions and experiences that would be potentially useful to clinicians and researchers. We present an approach that is able to collect a corpus of medical message board posts, de-identify the corpus, and extract information on potential adverse drug effects discussed by users. Using a corpus of posts to breast cancer message boards, we identified drug event pairs using co-occurrence statistics. We then compared the identified drug event pairs with adverse effects listed on the package labels of tamoxifen, anastrozole, exemestane, and letrozole. Of the pairs identified by our system, 75–80% were documented on the drug labels. Some of the undocumented pairs may represent previously unidentified adverse drug effects.
data mining; information extraction; medical message board; drug adverse effect
Standardized terminological systems for biomedical information have provided considerable benefits to biomedical applications and research. However, practical use of this information often requires mapping across terminological systems—a complex and time-consuming process. This paper demonstrates the complexity and challenges of mapping across terminological systems in the context of medication information. It provides a review of medication terminological systems and their linkages, then describes a case study in which we mapped proprietary medication codes from an electronic health record to SNOMED-CT and the UMLS Metathesaurus. The goal was to create a polyhierarchical classification system for querying an i2b2 clinical data warehouse. We found that three methods were required to accurately map the majority of actively prescribed medications. Only 62.5% of source medication codes could be mapped automatically. The remaining codes were mapped using a combination of semi-automated string comparison with expert selection, and a completely manual approach. Compound drugs were especially difficult to map: only 7.5% could be mapped using the automatic method. General challenges to mapping across terminological systems include (1) the availability of up-to-date information to assess the suitability of a given terminological system for a particular use case, and to assess the quality and completeness of cross-terminology links; (2) the difficulty of correctly using complex, rapidly evolving, modern terminologies; (3) the time and effort required to complete and evaluate the mapping; (4) the need to address differences in granularity between the source and target terminologies; and (5) the need to continuously update the mapping as terminological systems evolve.
Medication terminological systems; Standards; Terminology mapping; Review of medication terminological systems
Reducing care variability through guidelines has significantly benefited patients. Nonetheless, guideline-based clinical decision support (CDS) systems are not widely implemented or used, are frequently out-of-date, and cannot address complex care for which guidelines do not exist. Here, we develop and evaluate a complementary approach - using Bayesian network (BN) learning to generate adaptive, context-specific treatment menus based on local order-entry data. These menus can be used as a draft for expert review, in order to minimize development time for local decision support content. This is in keeping with the vision outlined in the US Health Information Technology Strategic Plan, which describes a healthcare system that learns from itself.
Materials and Methods
We used the Greedy Equivalence Search algorithm to learn four 50-node domain-specific BNs from 11,344 encounters: abdominal pain in the emergency department, inpatient pregnancy, hypertension in the urgent visit clinic, and altered mental state in the intensive care unit. We developed a system to produce situation-specific, rank-ordered treatment menus from these networks. We evaluated this system with a hospital-simulation methodology and computed Area Under the Receiver-Operator Curve (AUC) and average menu position at time of selection. We also compared this system with a similar association-rule-mining approach.
A short order menu on average contained the next order (weighted average length 3.91–5.83 items). Overall predictive ability was good: average AUC above 0.9 for 25% of order types and overall average AUC .714–.844 (depending on domain). However, AUC had high variance (.50–.99). Higher AUC correlated with tighter clusters and more connections in the graphs, indicating importance of appropriate contextual data. Comparison with an association rule mining approach showed similar performance for only the most common orders with dramatic divergence as orders are less frequent.
Discussion and Conclusion
This study demonstrates that local clinical knowledge can be extracted from treatment data for decision support. This approach is appealing because: it reflects local standards; it uses data already being captured; and it produces human-readable treatment-diagnosis networks that could be curated by a human expert to reduce workload in developing localized CDS content. The BN methodology captured transitive associations and co-varying relationships, which existing approaches do not. It also performs better as orders become less frequent and require more context. This system is a step forward in harnessing local, empirical data to enhance decision support.
clinical decision support; data mining; Bayesian Analysis
Biomedical prediction based on clinical and genome-wide data has become increasingly important in disease diagnosis and classification. To solve the prediction problem in an effective manner for the improvement of clinical care, we develop a novel Artificial Neural Network (ANN) method based on Matrix Pseudo-Inversion (MPI) for use in biomedical applications. The MPI-ANN is constructed as a three-layer (i.e., input, hidden, and output layers) feed-forward neural network, and the weights connecting the hidden and output layers are directly determined based on MPI without a lengthy learning iteration. The LASSO (Least Absolute Shrinkage and Selection Operator) method is also presented for comparative purposes. Single Nucleotide Polymorphism (SNP) simulated data and real breast cancer data are employed to validate the performance of the MPI-ANN method via 5-fold cross validation. Experimental results demonstrate the efficacy of the developed MPI-ANN for disease classification and prediction, in view of the significantly superior accuracy (i.e., the rate of correct predictions), as compared with LASSO. The results based on the real breast cancer data also show that the MPI-ANN has better performance than other machine learning methods (including support vector machine (SVM), logistic regression (LR), and an iterative ANN). In addition, experiments demonstrate that our MPI-ANN could be used for bio-marker selection as well.
Biomedical prediction and classification; Neural networks; Matrix pseudo-inversion; Least Absolute Shrinkage and Selection Operator (LASSO); Single Nucleotide Polymorphism (SNP); Cancer
Correlation of data within electronic health records is necessary for implementation of various clinical decision support functions, including patient summarization. A key type of correlation is linking medications to clinical problems; while some databases of problem-medication links are available, they are not robust and depend on problems and medications being encoded in particular terminologies. Crowdsourcing represents one approach to generating robust knowledge bases across a variety of terminologies, but more sophisticated approaches are necessary to improve accuracy and reduce manual data review requirements.
We sought to develop and evaluate a clinician reputation metric to facilitate the identification of appropriate problem-medication pairs through crowdsourcing without requiring extensive manual review.
We retrieved medications from our clinical data warehouse that had been prescribed and manually linked to one or more problems by clinicians during e-prescribing between June 1, 2010 and May 31, 2011. We identified measures likely to be associated with the percentage of accurate problem-medication links made by clinicians. Using logistic regression, we created a metric for identifying clinicians who had made greater than or equal to 95% appropriate links. We evaluated the accuracy of the approach by comparing links made by those physicians identified as having appropriate links to a previously manually validated subset of problem-medication pairs.
Of 867 clinicians who asserted a total of 237,748 problem-medication links during the study period, 125 had a reputation metric that predicted the percentage of appropriate links greater than or equal to 95%. These clinicians asserted a total of 2464 linked problem-medication pairs (983 distinct pairs). Compared to a previously validated set of problem-medication pairs, the reputation metric achieved a specificity of 99.5% and marginally improved the sensitivity of previously described knowledge bases.
A reputation metric may be a valuable measure for identifying high quality clinician-entered, crowdsourced data.
Electronic health records; Crowdsourcing; Knowledge bases; Medical records; Problem-oriented
The benefits of using ontology subsets versus full ontologies are well-documented for many applications. In this study, we propose an efficient subset extraction approach for a domain using a biomedical ontology repository with mappings, a cross-ontology, and a source subset from a related domain. As a case study, we extracted a subset of drugs from RxNorm using the UMLS Metathesaurus, the NDF-RT cross-ontology, and the CORE problem list subset of SNOMED CT. The extracted subset, which we termed RxNorm/CORE, was 4% the size of the full RxNorm (0.4% when considering ingredients only). For evaluation, we used CORE and RxNorm/CORE as thesauri for the annotation of clinical documents and compared their performance to that of their respective full ontologies (i.e., SNOMED CT and RxNorm). The wide range in recall of both CORE (29–69%) and RxNorm/CORE (21–35%) suggests that more quantitative research is needed to assess the benefits of using ontology subsets as thesauri in annotation applications. Our approach to subset extraction, however, opens a door to help create other types of clinically useful domain specific subsets and acts as an alternative in scenarios where well-established subset extraction techniques might suffer from difficulties or cannot be applied.
Ontologies; SNOMED CT; RxNorm; NDF-RT; UMLS; Medical records; Annotation
Information encoded in natural language in biomedical literature publications is only useful if efficient and reliable ways of accessing and analyzing that information are available. Natural language processing and text mining tools are therefore essential for extracting valuable information, however, the development of powerful, highly effective tools to automatically detect central biomedical concepts such as diseases is conditional on the availability of annotated corpora.
This paper presents the disease name and concept annotations of the NCBI disease corpus, a collection of 793 PubMed abstracts fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community. Each PubMed abstract was manually annotated by two annotators with disease mentions and their corresponding concepts in Medical Subject Headings (MeSH®) or Online Mendelian Inheritance in Man (OMIM®). Manual curation was performed using PubTator, which allowed the use of pre-annotations as a pre-step to manual annotations. Fourteen annotators were randomly paired and differing annotations were discussed for reaching a consensus in two annotation phases. In this setting, a high inter-annotator agreement was observed. Finally, all results were checked against annotations of the rest of the corpus to assure corpus-wide consistency.
The public release of the NCBI disease corpus contains 6,892 disease mentions, which are mapped to 790 unique disease concepts. Of these, 88% link to a MeSH identifier, while the rest contain an OMIM identifier. We were able to link 91% of the mentions to a single disease concept, while the rest are described as a combination of concepts. In order to help researchers use the corpus to design and test disease identification methods, we have prepared the corpus as training, testing and development sets. To demonstrate its utility, we conducted a benchmarking experiment where we compared three different knowledge-based disease normalization methods with a best performance in F-measure of 63.7%. These results show that the NCBI disease corpus has the potential to significantly improve the state-of-the-art in disease name recognition and normalization research, by providing a high-quality gold standard thus enabling the development of machine-learning based approaches for such tasks.
Disease name recognition; Named entity recognition; Disease name normalization; Corpus annotation; Disease name corpus
Polysemy is a frequent issue in biomedical terminologies. In the Unified
Medical Language System (UMLS), polysemous terms are either represented as several
independent concepts, or clustered into a single, multiply-categorized concept. The
objective of this study is to analyze polysemous concepts in the UMLS through their
categorization and hierarchical relations for auditing purposes.
We used the association of a concept with multiple Semantic Groups (SGs) as a
surrogate for polysemy. We first extracted multi-SG (MSG) concepts from the UMLS
Metathesaurus and characterized them in terms of the combinations of SGs with which they
are associated. We then clustered MSG concepts in order to identify major types of
polysemy. We also analyzed the inheritance of SGs in MSG concepts. Finally, we manually
reviewed the categorization of the MSG concepts for auditing purposes.
The 1208 MSG concepts in the Metathesaurus are associated with 30 distinct
pairs of SGs. We created 75 semantically homogeneous clusters of MSG concepts, and 276
MSG concepts could not be clustered for lack of hierarchical relations. The clusters
were characterized by the most frequent pairs of semantic types of their constituent MSG
concepts. MSG concepts exhibit limited semantic compatibility with their parent and
child concepts. A large majority of MSG concepts (92%) are adequately categorized.
Examples of miscategorized concepts are presented.
This work is a systematic analysis and manual review of all concepts
categorized by multiple SGs in the UMLS. The correctly-categorized MSG concepts do
reflect polysemy in the UMLS Metathesaurus. The analysis of inheritance of SGs proved
useful for auditing concept categorization in the UMLS.
Biomedical terminologies; Auditing methods; Unified Medical Language System (UMLS); Polysemy; Semantic categorization
Electronic health records (EHR) offer medical and pharmacogenomics research unprecedented opportunities to identify and classify patients at risk. EHRs are collections of highly inter-dependent records that include biological, anatomical, physiological, and behavioral observations. They comprise a patient’s clinical phenome, where each patient has thousands of date-stamped records distributed across many relational tables. Development of EHR computer-based phenotyping algorithms require time and medical insight from clinical experts, who most often can only review a small patient subset representative of the total EHR records, to identify phenotype features. In this research we evaluate whether relational machine learning (ML) using Inductive Logic Programming (ILP) can contribute to addressing these issues as a viable approach for EHR-based phenotyping.
Two relational learning ILP approaches and three well-known WEKA (Waikato Environment for Knowledge Analysis) implementations of non-relational approaches (PART, J48, and JRIP) were used to develop models for nine phenotypes. International Classification of Diseases, Ninth Revision (ICD-9) coded EHR data were used to select training cohorts for the development of each phenotypic model. Accuracy, precision, recall, F-Measure, and Area Under the Receiver Operating Characteristic (AUROC) curve statistics were measured for each phenotypic model based on independent manually verified test cohorts. A two-sided binomial distribution test (sign test) compared the five ML approaches across phenotypes for statistical significance.
We developed an approach to automatically label training examples using ICD-9 diagnosis codes for the ML approaches being evaluated. Nine phenotypic models for each MLapproach were evaluated, resulting in better overall model performance in AUROC using ILP when compared to PART (p=0.039), J48 (p=0.003) and JRIP (p=0.003).
ILP has the potential to improve phenotyping by independently delivering clinically expert interpretable rules for phenotype definitions, or intuitive phenotypes to assist experts.
Relational learning using ILP offers a viable approach to EHR-driven phenotyping.
Machine learning; Electronic health record; Inductive logic programming; Phenotyping; Relational learning
Clinical text, such as clinical trial eligibility criteria, is largely underused in state-of-the-art medical search engines due to difficulties of accurate parsing. This paper proposes a novel methodology to derive a semantic index for clinical eligibility documents based on a controlled vocabulary of frequent tags, which are automatically mined from the text. We applied this method to eligibility criteria on ClinicalTrials.gov and report that frequent tags (1) define an effective and efficient index of clinical trials and (2) are unlikely to grow radically when the repository increases. We proposed to apply the semantic index to filter clinical trial search results and we concluded that frequent tags reduce the result space more efficiently than an uncontrolled set of UMLS concepts. Overall, unsupervised mining of frequent tags from clinical text leads to an effective semantic index for the clinical eligibility documents and promotes their computational reuse.
Information Storage and Retrieval; Clinical Trials; Tags; Information Filtering; Eligibility Criteria; Controlled Vocabulary
Information overload is a significant problem facing online clinical trial searchers. We present eTACTS, a novel interactive retrieval framework using common eligibility tags to dynamically filter clinical trial search results.
Materials and Methods
eTACTS mines frequent eligibility tags from free-text clinical trial eligibility criteria and uses these tags for trial indexing. After an initial search, eTACTS presents to the user a tag cloud representing the current results. When the user selects a tag, eTACTS retains only those trials containing that tag in their eligibility criteria and generates a new cloud based on tag frequency and co-occurrences in the remaining trials. The user can then select a new tag or unselect a previous tag. The process iterates until a manageable number of trials is returned. We evaluated eTACTS in terms of filtering efficiency, diversity of the search results, and user eligibility to the filtered trials using both qualitative and quantitative methods.
eTACTS (1) rapidly reduced search results from over a thousand trials to ten; (2) highlighted trials that are generally not top-ranked by conventional search engines; and (3) retrieved a greater number of suitable trials than existing search engines.
eTACTS enables intuitive clinical trial searches by indexing eligibility criteria with effective tags. User evaluation was limited to one case study and a small group of evaluators due to the long duration of the experiment. Although a larger-scale evaluation could be conducted, this feasibility study demonstrated significant advantages of eTACTS over existing clinical trial search engines.
A dynamic eligibility tag cloud can potentially enhance state-of-the-art clinical trial search engines by allowing intuitive and efficient filtering of the search result space.
Information Storage and Retrieval; Clinical Trials; Dynamic Information Filtering; Interactive Information Retrieval; Tag Cloud; Association Rules; Eligibility Criteria
We describe a domain-independent methodology to extend SemRep coverage beyond the biomedical domain. SemRep, a natural language processing application originally designed for biomedical texts, uses the knowledge sources provided by the Unified Medical Language System (UMLS©). Ontological and terminological extensions to the system are needed in order to support other areas of knowledge. We extended SemRep's application by developing a semantic representation of a previously unsupported domain. This was achieved by adapting well-known ontology engineering phases and integrating them with the UMLS knowledge sources on which SemRep crucially depends. While the process to extend SemRep coverage has been successfully applied in earlier projects, this paper presents in detail the stepwise approach we followed and the mechanisms implemented. A case study in the field of medical informatics illustrates how the ontology engineering phases have been adapted for optimal integration with the UMLS. We provide qualitative and quantitative results, which indicate the validity and usefulness of our methodology.
Natural Language Processing Application; Domain-Independent Ontology Development Methodology; Semantic Predications; UMLS Knowledge Sources