Search tips
Search criteria

Results 1-12 (12)

Clipboard (0)
Year of Publication
Document Types
1.  Combining an Expert-Based Medical Entity Recognizer to a Machine-Learning System: Methods and a Case Study 
Biomedical Informatics Insights  2013;6(Suppl 1):51-62.
Medical entity recognition is currently generally performed by data-driven methods based on supervised machine learning. Expert-based systems, where linguistic and domain expertise are directly provided to the system are often combined with data-driven systems. We present here a case study where an existing expert-based medical entity recognition system, Ogmios, is combined with a data-driven system, Caramba, based on a linear-chain Conditional Random Field (CRF) classifier. Our case study specifically highlights the risk of overfitting incurred by an expert-based system. We observe that it prevents the combination of the 2 systems from obtaining improvements in precision, recall, or F-measure, and analyze the underlying mechanisms through a post-hoc feature-level analysis. Wrapping the expert-based system alone as attributes input to a CRF classifier does boost its F-measure from 0.603 to 0.710, bringing it on par with the data-driven system. The generalization of this method remains to be further investigated.
PMCID: PMC3776026  PMID: 24052691
natural language processing; information extraction; medical records; machine learning; hybrid methods; overfitting
2.  Text Categorization of Heart, Lung, and Blood Studies in the Database of Genotypes and Phenotypes (dbGaP) Utilizing n-grams and Metadata Features 
The database of Genotypes and Phenotypes (dbGaP) allows researchers to understand phenotypic contribution to genetic conditions, generate new hypotheses, confirm previous study results, and identify control populations. However, effective use of the database is hindered by suboptimal study retrieval. Our objective is to evaluate text classification techniques to improve study retrieval in the context of the dbGaP database. We utilized standard machine learning algorithms (naive Bayes, support vector machines, and the C4.5 decision tree) trained on dbGaP study text and incorporated n-gram features and study metadata to identify heart, lung, and blood studies. We used the χ2 feature selection algorithm to identify features that contributed most to classification performance and experimented with dbGaP associated PubMed papers as a proxy for topicality. Classifier performance was favorable in comparison to keyword-based search results. It was determined that text categorization is a useful complement to document retrieval techniques in the dbGaP.
PMCID: PMC3728208  PMID: 23926434
text classification; text categorization; database; genome-wide association studies; GWAS; natural language processing 35/45
3.  Using Conversation Topics for Predicting Therapy Outcomes in Schizophrenia 
Biomedical Informatics Insights  2013;6(Suppl 1):39-50.
Previous research shows that aspects of doctor-patient communication in therapy can predict patient symptoms, satisfaction and future adherence to treatment (a significant problem with conditions such as schizophrenia). However, automatic prediction has so far shown success only when based on low-level lexical features, and it is unclear how well these can generalize to new data, or whether their effectiveness is due to their capturing aspects of style, structure or content. Here, we examine the use of topic as a higher-level measure of content, more likely to generalize and to have more explanatory power. Investigations show that while topics predict some important factors such as patient satisfaction and ratings of therapy quality, they lack the full predictive power of lower-level features. For some factors, unsupervised methods produce models comparable to manual annotation.
PMCID: PMC3740209  PMID: 23943658
topic modelling; LDA; doctor-patient communication
4.  Towards Converting Clinical Phrases into SNOMED CT Expressions 
Biomedical Informatics Insights  2013;6(Suppl 1):29-37.
Converting information contained in natural language clinical text into computer-amenable structured representations can automate many clinical applications. As a step towards that goal, we present a method which could help in converting novel clinical phrases into new expressions in SNOMED CT, a standard clinical terminology. Since expressions in SNOMED CT are written in terms of their relations with other SNOMED CT concepts, we formulate the important task of identifying relations between clinical phrases and SNOMED CT concepts. We present a machine learning approach for this task and using the dataset of existing SNOMED CT relations we show that it performs well.
PMCID: PMC3702194  PMID: 23847425
SNOMED CT; clinical phrases; relation identification; natural language processing
5.  Using Empirically Constructed Lexical Resources for Named Entity Recognition 
Biomedical Informatics Insights  2013;6(Suppl 1):17-27.
Because of privacy concerns and the expense involved in creating an annotated corpus, the existing small-annotated corpora might not have sufficient examples for learning to statistically extract all the named-entities precisely. In this work, we evaluate what value may lie in automatically generated features based on distributional semantics when using machine-learning named entity recognition (NER). The features we generated and experimented with include n-nearest words, support vector machine (SVM)-regions, and term clustering, all of which are considered distributional semantic features. The addition of the n-nearest words feature resulted in a greater increase in F-score than by using a manually constructed lexicon to a baseline system. Although the need for relatively small-annotated corpora for retraining is not obviated, lexicons empirically derived from unannotated text can not only supplement manually created lexicons, but also replace them. This phenomenon is observed in extracting concepts from both biomedical literature and clinical notes.
PMCID: PMC3702195  PMID: 23847424
natural language processing; distributional semantics; concept extraction; named entity recognition; empirical lexical resources
6.  Computational Semantics in Clinical Text 
Biomedical Informatics Insights  2013;6(Suppl 1):3-5.
PMCID: PMC3702196  PMID: 23847422
7.  Analysis of Cross-Institutional Medication Description Patterns in Clinical Narratives 
Biomedical Informatics Insights  2013;6(Suppl 1):7-16.
A large amount of medication information resides in the unstructured text found in electronic medical records, which requires advanced techniques to be properly mined. In clinical notes, medication information follows certain semantic patterns (eg, medication, dosage, frequency, and mode). Some medication descriptions contain additional word(s) between medication attributes. Therefore, it is essential to understand the semantic patterns as well as the patterns of the context interspersed among them (ie, context patterns) to effectively extract comprehensive medication information. In this paper we examined both semantic and context patterns, and compared those found in Mayo Clinic and i2b2 challenge data. We found that some variations exist between the institutions but the dominant patterns are common.
PMCID: PMC3702197  PMID: 23847423
medication extraction; electronic medical record; natural language processing
8.  Computational Semantics in Clinical Text Supplement 
Biomedical Informatics Insights  2013;6(Suppl 1):1-2.
PMCID: PMC3702198  PMID: 23847421
9.  Using n-Grams for Syndromic Surveillance in a Turkish Emergency Department Without English Translation: A Feasibility Study 
Syndromic surveillance is designed for early detection of disease outbreaks. An important data source for syndromic surveillance is free-text chief complaints (CCs), which are generally recorded in the local language. For automated syndromic surveillance, CCs must be classified into predefined syndromic categories. The n-gram classifier is created by using text fragments to measure associations between chief complaints (CC) and a syndromic grouping of ICD codes.
The objective was to create a Turkish n-gram CC classifier for the respiratory syndrome and then compare daily volumes between the n-gram CC classifier and a respiratory ICD-10 code grouping on a test set of data.
The design was a feasibility study based on retrospective cohort data. The setting was a university hospital emergency department (ED) in Turkey. Included were all ED visits in the 2002 database of this hospital. Two of the authors created a respiratory grouping of International Classification of Diseases, 10th Revision ICD-10-CM codes by consensus, chosen to be similar to a standard respiratory (RESP) grouping of ICD codes created by the Electronic Surveillance System for Early Notification of Community-based Epidemics (ESSENCE), a project of the Centers for Disease Control and Prevention. An n-gram method adapted from AT&T Labs’ technologies was applied to the first 10 months of data as a training set to create a Turkish CC RESP classifier. The classifier was then tested on the subsequent 2 months of visits to generate a time series graph and determine the correlation with daily volumes measured by the CC classifier versus the RESP ICD-10 grouping.
The Turkish ED database contained 30,157 visits. The correlation (R2) of n-gram versus ICD-10 for the test set was 0.78.
The n-gram method automatically created a CC RESP classifier of the Turkish CCs that performed similarly to the ICD-10 RESP grouping. The n-gram technique has the advantage of systematic, consistent, and rapid deployment as well as language independence.
PMCID: PMC3653813  PMID: 23700370
disease outbreaks; epidemiology; public health; surveillance; n-gram
10.  Recognizing Scientific Artifacts in Biomedical Literature 
Today’s search engines and digital libraries offer little or no support for discovering those scientific artifacts (hypotheses, supporting/contradicting statements, or findings) that form the core of scientific written communication. Consequently, we currently have no means of identifying central themes within a domain or to detect gaps between accepted knowledge and newly emerging knowledge as a means for tracking the evolution of hypotheses from incipient phases to maturity or decline. We present a hybrid Machine Learning approach using an ensemble of four classifiers, for recognizing scientific artifacts (ie, hypotheses, background, motivation, objectives, and findings) within biomedical research publications, as a precursory step to the general goal of automatically creating argumentative discourse networks that span across multiple publications. The performance achieved by the classifiers ranges from 15.30% to 78.39%, subject to the target class. The set of features used for classification has led to promising results. Furthermore, their use strictly in a local, publication scope, ie, without aggregating corpus-wide statistics, increases the versatility of the ensemble of classifiers and enables its direct applicability without the necessity of re-training.
PMCID: PMC3623603  PMID: 23645987
scientific artifacts; conceptualization zones; information extraction
11.  Decomposing Phenotype Descriptions for the Human Skeletal Phenome 
Over the course of the last few years there has been a significant amount of research performed on ontology-based formalization of phenotype descriptions. The intrinsic value and knowledge captured within such descriptions can only be expressed by taking advantage of their inner structure that implicitly combines qualities and anatomical entities. We present a meta-model (the Phenotype Fragment Ontology) and a processing pipeline that enable together the automatic decomposition and conceptualization of phenotype descriptions for the human skeletal phenome. We use this approach to showcase the usefulness of the generic concept of phenotype decomposition by performing an experimental study on all skeletal phenotype concepts defined in the Human Phenotype Ontology.
PMCID: PMC3572876  PMID: 23440304
human skeletal phenome; phenotype decomposition; phenotype segmentation; ontologies
12.  What’s In a Note: Construction of a Suicide Note Corpus 
This paper reports on the results of an initiative to create and annotate a corpus of suicide notes that can be used for machine learning. Ultimately, the corpus included 1,278 notes that were written by someone who died by suicide. Each note was reviewed by at least three annotators who mapped words or sentences to a schema of emotions. This corpus has already been used for extensive scientific research.
PMCID: PMC3500150  PMID: 23170067
natural language processing; computational linguistics; corpus; suicide

Results 1-12 (12)