PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-13 (13)
 

Clipboard (0)
None

Select a Filter Below

Journals
Year of Publication
Document Types
1.  Detecting concept mentions in biomedical text using hidden Markov model: multiple concept types at once or one at a time? 
Background
Identifying phrases that refer to particular concept types is a critical step in extracting information from documents. Provided with annotated documents as training data, supervised machine learning can automate this process. When building a machine learning model for this task, the model may be built to detect all types simultaneously (all-types-at-once) or it may be built for one or a few selected types at a time (one-type- or a-few-types-at-a-time). It is of interest to investigate which strategy yields better detection performance.
Results
Hidden Markov models using the different strategies were evaluated on a clinical corpus annotated with three concept types (i2b2/VA corpus) and a biology literature corpus annotated with five concept types (JNLPBA corpus). Ten-fold cross-validation tests were conducted and the experimental results showed that models trained for multiple concept types consistently yielded better performance than those trained for a single concept type. F-scores observed for the former strategies were higher than those observed for the latter by 0.9 to 2.6% on the i2b2/VA corpus and 1.4 to 10.1% on the JNLPBA corpus, depending on the target concept types. Improved boundary detection and reduced type confusion were observed for the all-types-at-once strategy.
Conclusions
The current results suggest that detection of concept phrases could be improved by simultaneously tackling multiple concept types. This also suggests that we should annotate multiple concept types in developing a new corpus for machine learning models. Further investigation is expected to gain insights in the underlying mechanism to achieve good performance when multiple concept types are considered.
doi:10.1186/2041-1480-5-3
PMCID: PMC3908466  PMID: 24438362
Natural language processing; Information storage and retrieval; Data mining; Electronic health records
2.  Identifying protein complexes with fuzzy machine learning model 
Proteome Science  2013;11(Suppl 1):S21.
Background
Many computational approaches have been developed to detect protein complexes from protein-protein interaction (PPI) networks. However, these PPI networks are always built from high-throughput experiments. The presence of unreliable interactions in PPI network makes this task very challenging.
Methods
In this study, we proposed a Genetic-Algorithm Fuzzy Naïve Bayes (GAFNB) filter to classify the protein complexes from candidate subgraphs. It takes unreliability into consideration and tackles the presence of unreliable interactions in protein complex. We first got candidate protein complexes through existed popular methods. Each candidate protein complex is represented by 29 graph features and 266 biological property based features. GAFNB model is then applied to classify the candidate complexes into positive or negative.
Results
Our evaluation indicates that the protein complex identification algorithms using the GAFNB model filtering outperform original ones. For evaluation of GAFNB model, we also compared the performance of GAFNB with Naïve Bayes (NB). Results show that GAFNB performed better than NB. It indicates that a fuzzy model is more suitable when unreliability is present.
Conclusions
We conclude that filtering candidate protein complexes with GAFNB model can improve the effectiveness of protein complex identification. It is necessary to consider the unreliability in this task.
doi:10.1186/1477-5956-11-S1-S21
PMCID: PMC3908516  PMID: 24565338
3.  Coreference analysis in clinical notes: a multi-pass sieve with alternate anaphora resolution modules 
Objective
This paper describes the coreference resolution system submitted by Mayo Clinic for the 2011 i2b2/VA/Cincinnati shared task Track 1C. The goal of the task was to construct a system that links the markables corresponding to the same entity.
Materials and methods
The task organizers provided progress notes and discharge summaries that were annotated with the markables of treatment, problem, test, person, and pronoun. We used a multi-pass sieve algorithm that applies deterministic rules in the order of preciseness and simultaneously gathers information about the entities in the documents. Our system, MedCoref, also uses a state-of-the-art machine learning framework as an alternative to the final, rule-based pronoun resolution sieve.
Results
The best system that uses a multi-pass sieve has an overall score of 0.836 (average of B3, MUC, Blanc, and CEAF F score) for the training set and 0.843 for the test set.
Discussion
A supervised machine learning system that typically uses a single function to find coreferents cannot accommodate irregularities encountered in data especially given the insufficient number of examples. On the other hand, a completely deterministic system could lead to a decrease in recall (sensitivity) when the rules are not exhaustive. The sieve-based framework allows one to combine reliable machine learning components with rules designed by experts.
Conclusion
Using relatively simple rules, part-of-speech information, and semantic type properties, an effective coreference resolution system could be designed. The source code of the system described is available at https://sourceforge.net/projects/ohnlp/files/MedCoref.
doi:10.1136/amiajnl-2011-000766
PMCID: PMC3422831  PMID: 22707745
Natural language processing; machine learning; information extraction; electronic medical record; coreference resolution; text mining; computational linguistics; named entity recognition; distributional semantics; relationship extraction; information storage and retrieval (text and images)
4.  Analysis of Cross-Institutional Medication Description Patterns in Clinical Narratives 
Biomedical Informatics Insights  2013;6(Suppl 1):7-16.
A large amount of medication information resides in the unstructured text found in electronic medical records, which requires advanced techniques to be properly mined. In clinical notes, medication information follows certain semantic patterns (eg, medication, dosage, frequency, and mode). Some medication descriptions contain additional word(s) between medication attributes. Therefore, it is essential to understand the semantic patterns as well as the patterns of the context interspersed among them (ie, context patterns) to effectively extract comprehensive medication information. In this paper we examined both semantic and context patterns, and compared those found in Mayo Clinic and i2b2 challenge data. We found that some variations exist between the institutions but the dominant patterns are common.
doi:10.4137/BII.S11634
PMCID: PMC3702197  PMID: 23847423
medication extraction; electronic medical record; natural language processing
5.  Formative evaluation of the accuracy of a clinical decision support system for cervical cancer screening 
Objectives
We previously developed and reported on a prototype clinical decision support system (CDSS) for cervical cancer screening. However, the system is complex as it is based on multiple guidelines and free-text processing. Therefore, the system is susceptible to failures. This report describes a formative evaluation of the system, which is a necessary step to ensure deployment readiness of the system.
Materials and methods
Care providers who are potential end-users of the CDSS were invited to provide their recommendations for a random set of patients that represented diverse decision scenarios. The recommendations of the care providers and those generated by the CDSS were compared. Mismatched recommendations were reviewed by two independent experts.
Results
A total of 25 users participated in this study and provided recommendations for 175 cases. The CDSS had an accuracy of 87% and 12 types of CDSS errors were identified, which were mainly due to deficiencies in the system's guideline rules. When the deficiencies were rectified, the CDSS generated optimal recommendations for all failure cases, except one with incomplete documentation.
Discussion and conclusions
The crowd-sourcing approach for construction of the reference set, coupled with the expert review of mismatched recommendations, facilitated an effective evaluation and enhancement of the system, by identifying decision scenarios that were missed by the system's developers. The described methodology will be useful for other researchers who seek rapidly to evaluate and enhance the deployment readiness of complex decision support systems.
doi:10.1136/amiajnl-2013-001613
PMCID: PMC3721177  PMID: 23564631
Uterine Cervical Neoplasms; Decision Support Systems, Clinical; Guideline Adherence; Validation Studies as Topic; Vaginal Smears; Crowdsourcing
6.  Workflow-based Data Reconciliation for Clinical Decision Support: Case of Colorectal Cancer Screening and Surveillance  
A major barrier for computer-based clinical decision support (CDS), is the difficulty in obtaining the patient information required for decision making. The information gap is often due to deficiencies in the clinical documentation. One approach to address this gap is to gather and reconcile data from related documents or data sources. In this paper we consider the case of a CDS system for colorectal cancer screening and surveillance. We describe the use of workflow analysis to design data reconciliation processes. Further, we perform a quantitative analysis of the impact of these processes on system performance using a dataset of 106 patients. Results show that data reconciliation considerably improves the performance of the system. Our study demonstrates that, workflow-based data reconciliation can play a vital role in designing new-generation CDS systems that are based on complex guideline models and use natural language processing (NLP) to obtain patient data.
PMCID: PMC3845748  PMID: 24303280
7.  An Information Extraction Framework for Cohort Identification Using Electronic Health Records  
Information extraction (IE), a natural language processing (NLP) task that automatically extracts structured or semi-structured information from free text, has become popular in the clinical domain for supporting automated systems at point-of-care and enabling secondary use of electronic health records (EHRs) for clinical and translational research. However, a high performance IE system can be very challenging to construct due to the complexity and dynamic nature of human language. In this paper, we report an IE framework for cohort identification using EHRs that is a knowledge-driven framework developed under the Unstructured Information Management Architecture (UIMA). A system to extract specific information can be developed by subject matter experts through expert knowledge engineering of the externalized knowledge resources used in the framework.
PMCID: PMC3845757  PMID: 24303255
8.  Pooling annotated corpora for clinical concept extraction 
Background
The availability of annotated corpora has facilitated the application of machine learning algorithms to concept extraction from clinical notes. However, high expenditure and labor are required for creating the annotations. A potential alternative is to reuse existing corpora from other institutions by pooling with local corpora, for training machine taggers. In this paper we have investigated the latter approach by pooling corpora from 2010 i2b2/VA NLP challenge and Mayo Clinic Rochester, to evaluate taggers for recognition of medical problems. The corpora were annotated for medical problems, but with different guidelines. The taggers were constructed using an existing tagging system MedTagger that consisted of dictionary lookup, part of speech (POS) tagging and machine learning for named entity prediction and concept extraction. We hope that our current work will be a useful case study for facilitating reuse of annotated corpora across institutions.
Results
We found that pooling was effective when the size of the local corpus was small and after some of the guideline differences were reconciled. The benefits of pooling, however, diminished as more locally annotated documents were included in the training data. We examined the annotation guidelines to identify factors that determine the effect of pooling.
Conclusions
The effectiveness of pooling corpora, is dependent on several factors, which include compatibility of annotation guidelines, distribution of report types and size of local and foreign corpora. Simple methods to rectify some of the guideline differences can facilitate pooling. Our findings need to be confirmed with further studies on different corpora. To facilitate the pooling and reuse of annotated corpora, we suggest that – i) the NLP community should develop a standard annotation guideline that addresses the potential areas of guideline differences that are partly identified in this paper; ii) corpora should be annotated with a two-pass method that focuses first on concept recognition, followed by normalization to existing ontologies; and iii) metadata such as type of the report should be created during the annotation process.
doi:10.1186/2041-1480-4-3
PMCID: PMC3599895  PMID: 23294871
9.  Towards a semantic lexicon for clinical natural language processing 
A semantic lexicon which associates words and phrases in text to concepts is critical for extracting and encoding clinical information in free text and therefore achieving semantic interoperability between structured and unstructured data in Electronic Health Records (EHRs). Directly using existing standard terminologies may have limited coverage with respect to concepts and their corresponding mentions in text. In this paper, we analyze how tokens and phrases in a large corpus distribute and how well the UMLS captures the semantics. A corpus-driven semantic lexicon, MedLex, has been constructed where the semantics is based on the UMLS assisted with variants mined and usage information gathered from clinical text. The detailed corpus analysis of tokens, chunks, and concept mentions shows the UMLS is an invaluable source for natural language processing. Increasing the semantic coverage of tokens provides a good foundation in capturing clinical information comprehensively. The study also yields some insights in developing practical NLP systems.
PMCID: PMC3540492  PMID: 23304329
10.  Using machine learning for concept extraction on clinical documents from multiple data sources 
Objective
Concept extraction is a process to identify phrases referring to concepts of interests in unstructured text. It is a critical component in automated text processing. We investigate the performance of machine learning taggers for clinical concept extraction, particularly the portability of taggers across documents from multiple data sources.
Methods
We used BioTagger-GM to train machine learning taggers, which we originally developed for the detection of gene/protein names in the biology domain. Trained taggers were evaluated using the annotated clinical documents made available in the 2010 i2b2/VA Challenge workshop, consisting of documents from four data sources.
Results
As expected, performance of a tagger trained on one data source degraded when evaluated on another source, but the degradation of the performance varied depending on data sources. A tagger trained on multiple data sources was robust, and it achieved an F score as high as 0.890 on one data source. The results also suggest that performance of machine learning taggers is likely to improve if more annotated documents are available for training.
Conclusion
Our study shows how the performance of machine learning taggers is degraded when they are ported across clinical documents from different sources. The portability of taggers can be enhanced by training on datasets from multiple sources. The study also shows that BioTagger-GM can be easily extended to detect clinical concept mentions with good performance.
doi:10.1136/amiajnl-2011-000155
PMCID: PMC3168314  PMID: 21709161
Natural language processing; medical informatics; medical records systems; computerized
11.  Clinical decision support with automated text processing for cervical cancer screening 
Objective
To develop a computerized clinical decision support system (CDSS) for cervical cancer screening that can interpret free-text Papanicolaou (Pap) reports.
Materials and Methods
The CDSS was constituted by two rulebases: the free-text rulebase for interpreting Pap reports and a guideline rulebase. The free-text rulebase was developed by analyzing a corpus of 49 293 Pap reports. The guideline rulebase was constructed using national cervical cancer screening guidelines. The CDSS accesses the electronic medical record (EMR) system to generate patient-specific recommendations. For evaluation, the screening recommendations made by the CDSS for 74 patients were reviewed by a physician.
Results and Discussion
Evaluation revealed that the CDSS outputs the optimal screening recommendations for 73 out of 74 test patients and it identified two cases for gynecology referral that were missed by the physician. The CDSS aided the physician to amend recommendations in six cases. The failure case was because human papillomavirus (HPV) testing was sometimes performed separately from the Pap test and these results were reported by a laboratory system that was not queried by the CDSS. Subsequently, the CDSS was upgraded to look up the HPV results missed earlier and it generated the optimal recommendations for all 74 test cases.
Limitations
Single institution and single expert study.
Conclusion
An accurate CDSS system could be constructed for cervical cancer screening given the standardized reporting of Pap tests and the availability of explicit guidelines. Overall, the study demonstrates that free text in the EMR can be effectively utilized through natural language processing to develop clinical decision support tools.
doi:10.1136/amiajnl-2012-000820
PMCID: PMC3422840  PMID: 22542812
Cervical; clinical decision support; clinical informatics; clinical natural language processing; computerized; controlled terminologies and vocabularies; decision support; decision support systems; humans; machine learning; medical records systems; natural language processing; ontologies; uterine cervical neoplasms
12.  Using SNOMED-CT to encode summary level data – a corpus analysis 
Extracting and encoding clinical information captured in free text with standard medical terminologies is vital to enable secondary use of electronic medical records (EMRs) for clinical decision support, improved patient safety, and clinical/translational research. A critical portion of free text is comprised of ‘summary level’ information in the form of problem lists, diagnoses and reasons of visit. We conducted a systematic analysis of SNOMED-CT in representing the summary level information utilizing a large collection of summary level data in the form of itemized entries. Results indicate that about 80% of the entries can be encoded with SNOMED-CT normalized phrases. When tolerating one unmapped token, 96% of the itemized entries can be encoded with SNOMED-CT concepts. The study provides a solid foundation for developing an automated system to encode summary level data using SNOMED-CT.
PMCID: PMC3392059  PMID: 22779045
13.  Feasibility of pooling annotated corpora for clinical concept extraction 
Availability of annotated corpora has facilitated application of machine learning algorithms to concept extraction from clinical notes. However, it is expensive to prepare annotated corpora in individual institutions, and pooling of annotated corpora from other institutions is a potential solution. In this paper we investigate whether pooling of corpora from two different sources, can improve performance and portability of resultant machine learning taggers for medical problem detection. Specifically, we pool corpora from 2010 i2b2/VA NLP challenge and Mayo Clinic Rochester, to evaluate taggers for recognition of medical problems. Contrary to our expectations, pooling of corpora is found to decrease the F1-score. We examine the annotation guidelines to identify factors for incompatibility of the corpora and suggest development of a standard annotation guideline by the clinical NLP community to allow compatibility of annotated corpora.
PMCID: PMC3392069  PMID: 22779047

Results 1-13 (13)