Search tips
Search criteria

Results 1-19 (19)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
Document Types
1.  Text mining facilitates database curation - extraction of mutation-disease associations from Bio-medical literature 
BMC Bioinformatics  2015;16(1):185.
Advances in the next generation sequencing technology has accelerated the pace of individualized medicine (IM), which aims to incorporate genetic/genomic information into medicine. One immediate need in interpreting sequencing data is the assembly of information about genetic variants and their corresponding associations with other entities (e.g., diseases or medications). Even with dedicated effort to capture such information in biological databases, much of this information remains ‘locked’ in the unstructured text of biomedical publications. There is a substantial lag between the publication and the subsequent abstraction of such information into databases. Multiple text mining systems have been developed, but most of them focus on the sentence level association extraction with performance evaluation based on gold standard text annotations specifically prepared for text mining systems.
We developed and evaluated a text mining system, MutD, which extracts protein mutation-disease associations from MEDLINE abstracts by incorporating discourse level analysis, using a benchmark data set extracted from curated database records. MutD achieves an F-measure of 64.3 % for reconstructing protein mutation disease associations in curated database records. Discourse level analysis component of MutD contributed to a gain of more than 10 % in F-measure when compared against the sentence level association extraction. Our error analysis indicates that 23 of the 64 precision errors are true associations that were not captured by database curators and 68 of the 113 recall errors are caused by the absence of associated disease entities in the abstract. After adjusting for the defects in the curated database, the revised F-measure of MutD in association detection reaches 81.5 %.
Our quantitative analysis reveals that MutD can effectively extract protein mutation disease associations when benchmarking based on curated database records. The analysis also demonstrates that incorporating discourse level analysis significantly improved the performance of extracting the protein-mutation-disease association. Future work includes the extension of MutD for full text articles.
PMCID: PMC4457984  PMID: 26047637
Mutation mining; Text mining; Protein mutation disease association
The creation of biological pathway knowledge bases is largely driven by manual effort to curate based on evidences from the scientific literature. It is highly challenging for the curators to keep up with the literature. Text mining applications have been developed in the last decade to assist human curators to speed up the curation pace where majority of them aim to identify the most relevant papers for curation with little attempt to directly extract the pathway information from text. In this paper, we describe a rule-based literature mining system to extract pathway information from text. We evaluated the system using curated pharmacokinetic (PK) and pharmacodynamic (PD) pathways in PharmGKB. The system achieved an F-measure of 63.11% and 34.99% for entity extraction and event extraction respectively against all PubMed abstracts cited in PharmGKB. It may be possible to improve the system performance by incorporating using statistical machine learning approaches. This study also helped us gain insights into the barriers towards automated event extraction from text for pathway curation.
PMCID: PMC3902837  PMID: 24297561
3.  Automated chart review for asthma cohort identification using natural language processing: an exploratory study 
A significant proportion of children with asthma have delayed diagnosis of asthma by health care providers. Manual chart review according to established criteria is more accurate than directly using diagnosis codes, which tend to under-identify asthmatics, but chart reviews are more costly and less timely.
To evaluate the accuracy of a computational approach to asthma ascertainment, characterizing its utility and feasibility toward large-scale deployment in electronic medical records.
A natural language processing (NLP) system was developed for extracting predetermined criteria for asthma from unstructured text in electronic medical records and then inferring asthma status based on these criteria. Using manual chart reviews as a gold standard, asthma status (yes vs no) and identification date (first date of a “yes” asthma status) were determined by the NLP system.
Patients were a group of children (n =112, 84% Caucasian, 49% girls) younger than 4 years (mean 2.0 years, standard deviation 1.03 years) who participated in previous studies. The NLP approach to asthma ascertainment showed sensitivity, specificity, positive predictive value, negative predictive value, and median delay in diagnosis of 84.6%, 96.5%, 88.0%, 95.4%, and 0 months, respectively; this compared favorably with diagnosis codes, at 30.8%, 93.2%, 57.1%, 82.2%, and 2.3 months, respectively.
Automated asthma ascertainment from electronic medical records using NLP is feasible and more accurate than traditional approaches such as diagnosis codes. Considering the difficulty of labor-intensive manual record review, NLP approaches for asthma ascertainment should be considered for improving clinical care and research, especially in large-scale efforts.
PMCID: PMC3839107  PMID: 24125142
4.  Automated Recommendation for Cervical Cancer Screening and Surveillance 
Cancer Informatics  2014;13(Suppl 3):1-6.
Because of the complexity of cervical cancer prevention guidelines, clinicians often fail to follow best-practice recommendations. Moreover, existing clinical decision support (CDS) systems generally recommend a cervical cytology every three years for all female patients, which is inappropriate for patients with abnormal findings that require surveillance at shorter intervals. To address this problem, we developed a decision tree-based CDS system that integrates national guidelines to provide comprehensive guidance to clinicians. Validation was performed in several iterations by comparing recommendations generated by the system with those of clinicians for 333 patients. The CDS system extracted relevant patient information from the electronic health record and applied the guideline model with an overall accuracy of 87%. Providers without CDS assistance needed an average of 1 minute 39 seconds to decide on recommendations for management of abnormal findings. Overall, our work demonstrates the feasibility and potential utility of automated recommendation system for cervical cancer screening and surveillance.
PMCID: PMC4214690  PMID: 25368505
cervical cancer; clinical decision support; natural language processing; Papanicolaou test; colposcopy
5.  Comprehensive temporal information detection from clinical text: medical events, time, and TLINK identification 
Temporal information detection systems have been developed by the Mayo Clinic for the 2012 i2b2 Natural Language Processing Challenge.
To construct automated systems for EVENT/TIMEX3 extraction and temporal link (TLINK) identification from clinical text.
Materials and methods
The i2b2 organizers provided 190 annotated discharge summaries as the training set and 120 discharge summaries as the test set. Our Event system used a conditional random field classifier with a variety of features including lexical information, natural language elements, and medical ontology. The TIMEX3 system employed a rule-based method using regular expression pattern match and systematic reasoning to determine normalized values. The TLINK system employed both rule-based reasoning and machine learning. All three systems were built in an Apache Unstructured Information Management Architecture framework.
Our TIMEX3 system performed the best (F-measure of 0.900, value accuracy 0.731) among the challenge teams. The Event system produced an F-measure of 0.870, and the TLINK system an F-measure of 0.537.
Our TIMEX3 system demonstrated good capability of regular expression rules to extract and normalize time information. Event and TLINK machine learning systems required well-defined feature sets to perform well. We could also leverage expert knowledge as part of the machine learning features to further improve TLINK identification performance.
PMCID: PMC3756269  PMID: 23558168
6.  Detecting concept mentions in biomedical text using hidden Markov model: multiple concept types at once or one at a time? 
Identifying phrases that refer to particular concept types is a critical step in extracting information from documents. Provided with annotated documents as training data, supervised machine learning can automate this process. When building a machine learning model for this task, the model may be built to detect all types simultaneously (all-types-at-once) or it may be built for one or a few selected types at a time (one-type- or a-few-types-at-a-time). It is of interest to investigate which strategy yields better detection performance.
Hidden Markov models using the different strategies were evaluated on a clinical corpus annotated with three concept types (i2b2/VA corpus) and a biology literature corpus annotated with five concept types (JNLPBA corpus). Ten-fold cross-validation tests were conducted and the experimental results showed that models trained for multiple concept types consistently yielded better performance than those trained for a single concept type. F-scores observed for the former strategies were higher than those observed for the latter by 0.9 to 2.6% on the i2b2/VA corpus and 1.4 to 10.1% on the JNLPBA corpus, depending on the target concept types. Improved boundary detection and reduced type confusion were observed for the all-types-at-once strategy.
The current results suggest that detection of concept phrases could be improved by simultaneously tackling multiple concept types. This also suggests that we should annotate multiple concept types in developing a new corpus for machine learning models. Further investigation is expected to gain insights in the underlying mechanism to achieve good performance when multiple concept types are considered.
PMCID: PMC3908466  PMID: 24438362
Natural language processing; Information storage and retrieval; Data mining; Electronic health records
7.  Identifying protein complexes with fuzzy machine learning model 
Proteome Science  2013;11(Suppl 1):S21.
Many computational approaches have been developed to detect protein complexes from protein-protein interaction (PPI) networks. However, these PPI networks are always built from high-throughput experiments. The presence of unreliable interactions in PPI network makes this task very challenging.
In this study, we proposed a Genetic-Algorithm Fuzzy Naïve Bayes (GAFNB) filter to classify the protein complexes from candidate subgraphs. It takes unreliability into consideration and tackles the presence of unreliable interactions in protein complex. We first got candidate protein complexes through existed popular methods. Each candidate protein complex is represented by 29 graph features and 266 biological property based features. GAFNB model is then applied to classify the candidate complexes into positive or negative.
Our evaluation indicates that the protein complex identification algorithms using the GAFNB model filtering outperform original ones. For evaluation of GAFNB model, we also compared the performance of GAFNB with Naïve Bayes (NB). Results show that GAFNB performed better than NB. It indicates that a fuzzy model is more suitable when unreliability is present.
We conclude that filtering candidate protein complexes with GAFNB model can improve the effectiveness of protein complex identification. It is necessary to consider the unreliability in this task.
PMCID: PMC3908516  PMID: 24565338
8.  Coreference analysis in clinical notes: a multi-pass sieve with alternate anaphora resolution modules 
This paper describes the coreference resolution system submitted by Mayo Clinic for the 2011 i2b2/VA/Cincinnati shared task Track 1C. The goal of the task was to construct a system that links the markables corresponding to the same entity.
Materials and methods
The task organizers provided progress notes and discharge summaries that were annotated with the markables of treatment, problem, test, person, and pronoun. We used a multi-pass sieve algorithm that applies deterministic rules in the order of preciseness and simultaneously gathers information about the entities in the documents. Our system, MedCoref, also uses a state-of-the-art machine learning framework as an alternative to the final, rule-based pronoun resolution sieve.
The best system that uses a multi-pass sieve has an overall score of 0.836 (average of B3, MUC, Blanc, and CEAF F score) for the training set and 0.843 for the test set.
A supervised machine learning system that typically uses a single function to find coreferents cannot accommodate irregularities encountered in data especially given the insufficient number of examples. On the other hand, a completely deterministic system could lead to a decrease in recall (sensitivity) when the rules are not exhaustive. The sieve-based framework allows one to combine reliable machine learning components with rules designed by experts.
Using relatively simple rules, part-of-speech information, and semantic type properties, an effective coreference resolution system could be designed. The source code of the system described is available at
PMCID: PMC3422831  PMID: 22707745
Natural language processing; machine learning; information extraction; electronic medical record; coreference resolution; text mining; computational linguistics; named entity recognition; distributional semantics; relationship extraction; information storage and retrieval (text and images)
9.  Analysis of Cross-Institutional Medication Description Patterns in Clinical Narratives 
Biomedical Informatics Insights  2013;6(Suppl 1):7-16.
A large amount of medication information resides in the unstructured text found in electronic medical records, which requires advanced techniques to be properly mined. In clinical notes, medication information follows certain semantic patterns (eg, medication, dosage, frequency, and mode). Some medication descriptions contain additional word(s) between medication attributes. Therefore, it is essential to understand the semantic patterns as well as the patterns of the context interspersed among them (ie, context patterns) to effectively extract comprehensive medication information. In this paper we examined both semantic and context patterns, and compared those found in Mayo Clinic and i2b2 challenge data. We found that some variations exist between the institutions but the dominant patterns are common.
PMCID: PMC3702197  PMID: 23847423
medication extraction; electronic medical record; natural language processing
10.  Formative evaluation of the accuracy of a clinical decision support system for cervical cancer screening 
We previously developed and reported on a prototype clinical decision support system (CDSS) for cervical cancer screening. However, the system is complex as it is based on multiple guidelines and free-text processing. Therefore, the system is susceptible to failures. This report describes a formative evaluation of the system, which is a necessary step to ensure deployment readiness of the system.
Materials and methods
Care providers who are potential end-users of the CDSS were invited to provide their recommendations for a random set of patients that represented diverse decision scenarios. The recommendations of the care providers and those generated by the CDSS were compared. Mismatched recommendations were reviewed by two independent experts.
A total of 25 users participated in this study and provided recommendations for 175 cases. The CDSS had an accuracy of 87% and 12 types of CDSS errors were identified, which were mainly due to deficiencies in the system's guideline rules. When the deficiencies were rectified, the CDSS generated optimal recommendations for all failure cases, except one with incomplete documentation.
Discussion and conclusions
The crowd-sourcing approach for construction of the reference set, coupled with the expert review of mismatched recommendations, facilitated an effective evaluation and enhancement of the system, by identifying decision scenarios that were missed by the system's developers. The described methodology will be useful for other researchers who seek rapidly to evaluate and enhance the deployment readiness of complex decision support systems.
PMCID: PMC3721177  PMID: 23564631
Uterine Cervical Neoplasms; Decision Support Systems, Clinical; Guideline Adherence; Validation Studies as Topic; Vaginal Smears; Crowdsourcing
11.  Workflow-based Data Reconciliation for Clinical Decision Support: Case of Colorectal Cancer Screening and Surveillance  
A major barrier for computer-based clinical decision support (CDS), is the difficulty in obtaining the patient information required for decision making. The information gap is often due to deficiencies in the clinical documentation. One approach to address this gap is to gather and reconcile data from related documents or data sources. In this paper we consider the case of a CDS system for colorectal cancer screening and surveillance. We describe the use of workflow analysis to design data reconciliation processes. Further, we perform a quantitative analysis of the impact of these processes on system performance using a dataset of 106 patients. Results show that data reconciliation considerably improves the performance of the system. Our study demonstrates that, workflow-based data reconciliation can play a vital role in designing new-generation CDS systems that are based on complex guideline models and use natural language processing (NLP) to obtain patient data.
PMCID: PMC3845748  PMID: 24303280
12.  An Information Extraction Framework for Cohort Identification Using Electronic Health Records  
Information extraction (IE), a natural language processing (NLP) task that automatically extracts structured or semi-structured information from free text, has become popular in the clinical domain for supporting automated systems at point-of-care and enabling secondary use of electronic health records (EHRs) for clinical and translational research. However, a high performance IE system can be very challenging to construct due to the complexity and dynamic nature of human language. In this paper, we report an IE framework for cohort identification using EHRs that is a knowledge-driven framework developed under the Unstructured Information Management Architecture (UIMA). A system to extract specific information can be developed by subject matter experts through expert knowledge engineering of the externalized knowledge resources used in the framework.
PMCID: PMC3845757  PMID: 24303255
13.  Pooling annotated corpora for clinical concept extraction 
The availability of annotated corpora has facilitated the application of machine learning algorithms to concept extraction from clinical notes. However, high expenditure and labor are required for creating the annotations. A potential alternative is to reuse existing corpora from other institutions by pooling with local corpora, for training machine taggers. In this paper we have investigated the latter approach by pooling corpora from 2010 i2b2/VA NLP challenge and Mayo Clinic Rochester, to evaluate taggers for recognition of medical problems. The corpora were annotated for medical problems, but with different guidelines. The taggers were constructed using an existing tagging system MedTagger that consisted of dictionary lookup, part of speech (POS) tagging and machine learning for named entity prediction and concept extraction. We hope that our current work will be a useful case study for facilitating reuse of annotated corpora across institutions.
We found that pooling was effective when the size of the local corpus was small and after some of the guideline differences were reconciled. The benefits of pooling, however, diminished as more locally annotated documents were included in the training data. We examined the annotation guidelines to identify factors that determine the effect of pooling.
The effectiveness of pooling corpora, is dependent on several factors, which include compatibility of annotation guidelines, distribution of report types and size of local and foreign corpora. Simple methods to rectify some of the guideline differences can facilitate pooling. Our findings need to be confirmed with further studies on different corpora. To facilitate the pooling and reuse of annotated corpora, we suggest that – i) the NLP community should develop a standard annotation guideline that addresses the potential areas of guideline differences that are partly identified in this paper; ii) corpora should be annotated with a two-pass method that focuses first on concept recognition, followed by normalization to existing ontologies; and iii) metadata such as type of the report should be created during the annotation process.
PMCID: PMC3599895  PMID: 23294871
14.  Towards a semantic lexicon for clinical natural language processing 
A semantic lexicon which associates words and phrases in text to concepts is critical for extracting and encoding clinical information in free text and therefore achieving semantic interoperability between structured and unstructured data in Electronic Health Records (EHRs). Directly using existing standard terminologies may have limited coverage with respect to concepts and their corresponding mentions in text. In this paper, we analyze how tokens and phrases in a large corpus distribute and how well the UMLS captures the semantics. A corpus-driven semantic lexicon, MedLex, has been constructed where the semantics is based on the UMLS assisted with variants mined and usage information gathered from clinical text. The detailed corpus analysis of tokens, chunks, and concept mentions shows the UMLS is an invaluable source for natural language processing. Increasing the semantic coverage of tokens provides a good foundation in capturing clinical information comprehensively. The study also yields some insights in developing practical NLP systems.
PMCID: PMC3540492  PMID: 23304329
15.  Using machine learning for concept extraction on clinical documents from multiple data sources 
Concept extraction is a process to identify phrases referring to concepts of interests in unstructured text. It is a critical component in automated text processing. We investigate the performance of machine learning taggers for clinical concept extraction, particularly the portability of taggers across documents from multiple data sources.
We used BioTagger-GM to train machine learning taggers, which we originally developed for the detection of gene/protein names in the biology domain. Trained taggers were evaluated using the annotated clinical documents made available in the 2010 i2b2/VA Challenge workshop, consisting of documents from four data sources.
As expected, performance of a tagger trained on one data source degraded when evaluated on another source, but the degradation of the performance varied depending on data sources. A tagger trained on multiple data sources was robust, and it achieved an F score as high as 0.890 on one data source. The results also suggest that performance of machine learning taggers is likely to improve if more annotated documents are available for training.
Our study shows how the performance of machine learning taggers is degraded when they are ported across clinical documents from different sources. The portability of taggers can be enhanced by training on datasets from multiple sources. The study also shows that BioTagger-GM can be easily extended to detect clinical concept mentions with good performance.
PMCID: PMC3168314  PMID: 21709161
Natural language processing; medical informatics; medical records systems; computerized
16.  Clinical decision support with automated text processing for cervical cancer screening 
To develop a computerized clinical decision support system (CDSS) for cervical cancer screening that can interpret free-text Papanicolaou (Pap) reports.
Materials and Methods
The CDSS was constituted by two rulebases: the free-text rulebase for interpreting Pap reports and a guideline rulebase. The free-text rulebase was developed by analyzing a corpus of 49 293 Pap reports. The guideline rulebase was constructed using national cervical cancer screening guidelines. The CDSS accesses the electronic medical record (EMR) system to generate patient-specific recommendations. For evaluation, the screening recommendations made by the CDSS for 74 patients were reviewed by a physician.
Results and Discussion
Evaluation revealed that the CDSS outputs the optimal screening recommendations for 73 out of 74 test patients and it identified two cases for gynecology referral that were missed by the physician. The CDSS aided the physician to amend recommendations in six cases. The failure case was because human papillomavirus (HPV) testing was sometimes performed separately from the Pap test and these results were reported by a laboratory system that was not queried by the CDSS. Subsequently, the CDSS was upgraded to look up the HPV results missed earlier and it generated the optimal recommendations for all 74 test cases.
Single institution and single expert study.
An accurate CDSS system could be constructed for cervical cancer screening given the standardized reporting of Pap tests and the availability of explicit guidelines. Overall, the study demonstrates that free text in the EMR can be effectively utilized through natural language processing to develop clinical decision support tools.
PMCID: PMC3422840  PMID: 22542812
Cervical; clinical decision support; clinical informatics; clinical natural language processing; computerized; controlled terminologies and vocabularies; decision support; decision support systems; humans; machine learning; medical records systems; natural language processing; ontologies; uterine cervical neoplasms
17.  Using SNOMED-CT to encode summary level data – a corpus analysis 
Extracting and encoding clinical information captured in free text with standard medical terminologies is vital to enable secondary use of electronic medical records (EMRs) for clinical decision support, improved patient safety, and clinical/translational research. A critical portion of free text is comprised of ‘summary level’ information in the form of problem lists, diagnoses and reasons of visit. We conducted a systematic analysis of SNOMED-CT in representing the summary level information utilizing a large collection of summary level data in the form of itemized entries. Results indicate that about 80% of the entries can be encoded with SNOMED-CT normalized phrases. When tolerating one unmapped token, 96% of the itemized entries can be encoded with SNOMED-CT concepts. The study provides a solid foundation for developing an automated system to encode summary level data using SNOMED-CT.
PMCID: PMC3392059  PMID: 22779045
18.  Feasibility of pooling annotated corpora for clinical concept extraction 
Availability of annotated corpora has facilitated application of machine learning algorithms to concept extraction from clinical notes. However, it is expensive to prepare annotated corpora in individual institutions, and pooling of annotated corpora from other institutions is a potential solution. In this paper we investigate whether pooling of corpora from two different sources, can improve performance and portability of resultant machine learning taggers for medical problem detection. Specifically, we pool corpora from 2010 i2b2/VA NLP challenge and Mayo Clinic Rochester, to evaluate taggers for recognition of medical problems. Contrary to our expectations, pooling of corpora is found to decrease the F1-score. We examine the annotation guidelines to identify factors for incompatibility of the corpora and suggest development of a standard annotation guideline by the clinical NLP community to allow compatibility of annotated corpora.
PMCID: PMC3392069  PMID: 22779047
19.  Evaluation of the Effect of Decision Support on the Efficiency of Primary Care Providers in the Outpatient Practice 
Background: Clinical decision support (CDS) for primary care has been shown to improve delivery of preventive services. However, there is little evidence for efficiency of physicians due to CDS assistance. In this article, we report a pilot study for measuring the impact of CDS on the time spent by physicians for deciding on preventive services and chronic disease management. Methods: We randomly selected 30 patients from a primary care practice, and assigned them to 10 physicians. The physicians were requested to perform chart review to decide on preventive services and chronic disease management for the assigned patients. The patients assignment was done in a randomized crossover design, such that each patient received 2 sets of recommendations—one from a physician with CDS assistance and the other from a different physician without CDS assistance. We compared the physician recommendations made using CDS assistance, with the recommendations made without CDS assistance. Results: The physicians required an average of 1 minute 44 seconds, when they were they had access to the decision support system and 5 minutes when they were unassisted. Hence the CDS assistance resulted in an estimated saving of 3 minutes 16 seconds (65%) of the physicians’ time, which was statistically significant (P < .0001). There was no statistically significant difference in the number of recommendations. Conclusion: Our findings suggest that CDS assistance significantly reduced the time spent by physicians for deciding on preventive services and chronic disease management. The result needs to be confirmed by performing similar studies at other institutions.
PMCID: PMC4259917  PMID: 25155103
preventive health services; clinical decision support systems; task performance and analysis; time motion study; chronic diseases

Results 1-19 (19)