The benefits of using ontology subsets versus full ontologies are well-documented for many applications. In this study, we propose an efficient subset extraction approach for a domain using a biomedical ontology repository with mappings, a cross-ontology, and a source subset from a related domain. As a case study, we extracted a subset of drugs from RxNorm using the UMLS Metathesaurus, the NDF-RT cross-ontology, and the CORE problem list subset of SNOMED CT. The extracted subset, which we termed RxNorm/CORE, was 4% the size of the full RxNorm (0.4% when considering ingredients only). For evaluation, we used CORE and RxNorm/CORE as thesauri for the annotation of clinical documents and compared their performance to that of their respective full ontologies (i.e., SNOMED CT and RxNorm). The wide range in recall of both CORE (29–69%) and RxNorm/CORE (21–35%) suggests that more quantitative research is needed to assess the benefits of using ontology subsets as thesauri in annotation applications. Our approach to subset extraction, however, opens a door to help create other types of clinically useful domain specific subsets and acts as an alternative in scenarios where well-established subset extraction techniques might suffer from difficulties or cannot be applied.
Ontologies; SNOMED CT; RxNorm; NDF-RT; UMLS; Medical records; Annotation
Information encoded in natural language in biomedical literature publications is only useful if efficient and reliable ways of accessing and analyzing that information are available. Natural language processing and text mining tools are therefore essential for extracting valuable information, however, the development of powerful, highly effective tools to automatically detect central biomedical concepts such as diseases is conditional on the availability of annotated corpora.
This paper presents the disease name and concept annotations of the NCBI disease corpus, a collection of 793 PubMed abstracts fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community. Each PubMed abstract was manually annotated by two annotators with disease mentions and their corresponding concepts in Medical Subject Headings (MeSH®) or Online Mendelian Inheritance in Man (OMIM®). Manual curation was performed using PubTator, which allowed the use of pre-annotations as a pre-step to manual annotations. Fourteen annotators were randomly paired and differing annotations were discussed for reaching a consensus in two annotation phases. In this setting, a high inter-annotator agreement was observed. Finally, all results were checked against annotations of the rest of the corpus to assure corpus-wide consistency.
The public release of the NCBI disease corpus contains 6,892 disease mentions, which are mapped to 790 unique disease concepts. Of these, 88% link to a MeSH identifier, while the rest contain an OMIM identifier. We were able to link 91% of the mentions to a single disease concept, while the rest are described as a combination of concepts. In order to help researchers use the corpus to design and test disease identification methods, we have prepared the corpus as training, testing and development sets. To demonstrate its utility, we conducted a benchmarking experiment where we compared three different knowledge-based disease normalization methods with a best performance in F-measure of 63.7%. These results show that the NCBI disease corpus has the potential to significantly improve the state-of-the-art in disease name recognition and normalization research, by providing a high-quality gold standard thus enabling the development of machine-learning based approaches for such tasks.
Disease name recognition; Named entity recognition; Disease name normalization; Corpus annotation; Disease name corpus
Polysemy is a frequent issue in biomedical terminologies. In the Unified
Medical Language System (UMLS), polysemous terms are either represented as several
independent concepts, or clustered into a single, multiply-categorized concept. The
objective of this study is to analyze polysemous concepts in the UMLS through their
categorization and hierarchical relations for auditing purposes.
We used the association of a concept with multiple Semantic Groups (SGs) as a
surrogate for polysemy. We first extracted multi-SG (MSG) concepts from the UMLS
Metathesaurus and characterized them in terms of the combinations of SGs with which they
are associated. We then clustered MSG concepts in order to identify major types of
polysemy. We also analyzed the inheritance of SGs in MSG concepts. Finally, we manually
reviewed the categorization of the MSG concepts for auditing purposes.
The 1208 MSG concepts in the Metathesaurus are associated with 30 distinct
pairs of SGs. We created 75 semantically homogeneous clusters of MSG concepts, and 276
MSG concepts could not be clustered for lack of hierarchical relations. The clusters
were characterized by the most frequent pairs of semantic types of their constituent MSG
concepts. MSG concepts exhibit limited semantic compatibility with their parent and
child concepts. A large majority of MSG concepts (92%) are adequately categorized.
Examples of miscategorized concepts are presented.
This work is a systematic analysis and manual review of all concepts
categorized by multiple SGs in the UMLS. The correctly-categorized MSG concepts do
reflect polysemy in the UMLS Metathesaurus. The analysis of inheritance of SGs proved
useful for auditing concept categorization in the UMLS.
Biomedical terminologies; Auditing methods; Unified Medical Language System (UMLS); Polysemy; Semantic categorization
Electronic health records (EHR) offer medical and pharmacogenomics research unprecedented opportunities to identify and classify patients at risk. EHRs are collections of highly inter-dependent records that include biological, anatomical, physiological, and behavioral observations. They comprise a patient’s clinical phenome, where each patient has thousands of date-stamped records distributed across many relational tables. Development of EHR computer-based phenotyping algorithms require time and medical insight from clinical experts, who most often can only review a small patient subset representative of the total EHR records, to identify phenotype features. In this research we evaluate whether relational machine learning (ML) using Inductive Logic Programming (ILP) can contribute to addressing these issues as a viable approach for EHR-based phenotyping.
Two relational learning ILP approaches and three well-known WEKA (Waikato Environment for Knowledge Analysis) implementations of non-relational approaches (PART, J48, and JRIP) were used to develop models for nine phenotypes. International Classification of Diseases, Ninth Revision (ICD-9) coded EHR data were used to select training cohorts for the development of each phenotypic model. Accuracy, precision, recall, F-Measure, and Area Under the Receiver Operating Characteristic (AUROC) curve statistics were measured for each phenotypic model based on independent manually verified test cohorts. A two-sided binomial distribution test (sign test) compared the five ML approaches across phenotypes for statistical significance.
We developed an approach to automatically label training examples using ICD-9 diagnosis codes for the ML approaches being evaluated. Nine phenotypic models for each MLapproach were evaluated, resulting in better overall model performance in AUROC using ILP when compared to PART (p=0.039), J48 (p=0.003) and JRIP (p=0.003).
ILP has the potential to improve phenotyping by independently delivering clinically expert interpretable rules for phenotype definitions, or intuitive phenotypes to assist experts.
Relational learning using ILP offers a viable approach to EHR-driven phenotyping.
Machine learning; Electronic health record; Inductive logic programming; Phenotyping; Relational learning
Clinical text, such as clinical trial eligibility criteria, is largely underused in state-of-the-art medical search engines due to difficulties of accurate parsing. This paper proposes a novel methodology to derive a semantic index for clinical eligibility documents based on a controlled vocabulary of frequent tags, which are automatically mined from the text. We applied this method to eligibility criteria on ClinicalTrials.gov and report that frequent tags (1) define an effective and efficient index of clinical trials and (2) are unlikely to grow radically when the repository increases. We proposed to apply the semantic index to filter clinical trial search results and we concluded that frequent tags reduce the result space more efficiently than an uncontrolled set of UMLS concepts. Overall, unsupervised mining of frequent tags from clinical text leads to an effective semantic index for the clinical eligibility documents and promotes their computational reuse.
Information Storage and Retrieval; Clinical Trials; Tags; Information Filtering; Eligibility Criteria; Controlled Vocabulary
Information overload is a significant problem facing online clinical trial searchers. We present eTACTS, a novel interactive retrieval framework using common eligibility tags to dynamically filter clinical trial search results.
Materials and Methods
eTACTS mines frequent eligibility tags from free-text clinical trial eligibility criteria and uses these tags for trial indexing. After an initial search, eTACTS presents to the user a tag cloud representing the current results. When the user selects a tag, eTACTS retains only those trials containing that tag in their eligibility criteria and generates a new cloud based on tag frequency and co-occurrences in the remaining trials. The user can then select a new tag or unselect a previous tag. The process iterates until a manageable number of trials is returned. We evaluated eTACTS in terms of filtering efficiency, diversity of the search results, and user eligibility to the filtered trials using both qualitative and quantitative methods.
eTACTS (1) rapidly reduced search results from over a thousand trials to ten; (2) highlighted trials that are generally not top-ranked by conventional search engines; and (3) retrieved a greater number of suitable trials than existing search engines.
eTACTS enables intuitive clinical trial searches by indexing eligibility criteria with effective tags. User evaluation was limited to one case study and a small group of evaluators due to the long duration of the experiment. Although a larger-scale evaluation could be conducted, this feasibility study demonstrated significant advantages of eTACTS over existing clinical trial search engines.
A dynamic eligibility tag cloud can potentially enhance state-of-the-art clinical trial search engines by allowing intuitive and efficient filtering of the search result space.
Information Storage and Retrieval; Clinical Trials; Dynamic Information Filtering; Interactive Information Retrieval; Tag Cloud; Association Rules; Eligibility Criteria
We describe a domain-independent methodology to extend SemRep coverage beyond the biomedical domain. SemRep, a natural language processing application originally designed for biomedical texts, uses the knowledge sources provided by the Unified Medical Language System (UMLS©). Ontological and terminological extensions to the system are needed in order to support other areas of knowledge. We extended SemRep's application by developing a semantic representation of a previously unsupported domain. This was achieved by adapting well-known ontology engineering phases and integrating them with the UMLS knowledge sources on which SemRep crucially depends. While the process to extend SemRep coverage has been successfully applied in earlier projects, this paper presents in detail the stepwise approach we followed and the mechanisms implemented. A case study in the field of medical informatics illustrates how the ontology engineering phases have been adapted for optimal integration with the UMLS. We provide qualitative and quantitative results, which indicate the validity and usefulness of our methodology.
Natural Language Processing Application; Domain-Independent Ontology Development Methodology; Semantic Predications; UMLS Knowledge Sources
The role of social media in biomedical knowledge mining, including clinical, medical and healthcare informatics, prescription drug abuse epidemiology and drug pharmacology, has become increasingly significant in recent years. Social media offers opportunities for people to share opinions and experiences freely in online communities, which may contribute information beyond the knowledge of domain professionals. This paper describes the development of a novel Semantic Web platform called PREDOSE (PREscription Drug abuse Online Surveillance and Epidemiology), which is designed to facilitate the epidemiologic study of prescription (and related) drug abuse practices using social media. PREDOSE uses web forum posts and domain knowledge, modeled in a manually created Drug Abuse Ontology (DAO) (pronounced dow), to facilitate the extraction of semantic information from User Generated Content (UGC). A combination of lexical, pattern-based and semantics-based techniques is used together with the domain knowledge to extract fine-grained semantic information from UGC. In a previous study, PREDOSE was used to obtain the datasets from which new knowledge in drug abuse research was derived. Here, we report on various platform enhancements, including an updated DAO, new components for relationship and triple extraction, and tools for content analysis, trend detection and emerging patterns exploration, which enhance the capabilities of the PREDOSE platform. Given these enhancements, PREDOSE is now more equipped to impact drug abuse research by alleviating traditional labor-intensive content analysis tasks.
Using custom web crawlers that scrape UGC from publicly available web forums, PREDOSE first automates the collection of web-based social media content for subsequent semantic annotation. The annotation scheme is modeled in the DAO, and includes domain specific knowledge such as prescription (and related) drugs, methods of preparation, side effects, routes of administration, etc. The DAO is also used to help recognize three types of data, namely: 1) entities, 2) relationships and 3) triples. PREDOSE then uses a combination of lexical and semantic-based techniques to extract entities and relationships from the scraped content, and a top-down approach for triple extraction that uses patterns expressed in the DAO. In addition, PREDOSE uses publicly available lexicons to identify initial sentiment expressions in text, and then a probabilistic optimization algorithm (from related research) to extract the final sentiment expressions. Together, these techniques enable the capture of fine-grained semantic information from UGC, and querying, search, trend analysis and overall content analysis of social media related to prescription drug abuse. Moreover, extracted data are also made available to domain experts for the creation of training and test sets for use in evaluation and refinements in information extraction techniques.
A recent evaluation of the information extraction techniques applied in the PREDOSE platform indicates 85% precision and 72% recall in entity identification, on a manually created gold standard dataset. In another study, PREDOSE achieved 36% precision in relationship identification and 33% precision in triple extraction, through manual evaluation by domain experts. Given the complexity of the relationship and triple extraction tasks and the abstruse nature of social media texts, we interpret these as favorable initial results. Extracted semantic information is currently in use in an online discovery support system, by prescription drug abuse researchers at the Center for Interventions, Treatment and Addictions Research (CITAR) at Wright State University.
A comprehensive platform for entity, relationship, triple and sentiment extraction from such abstruse texts has never been developed for drug abuse research. PREDOSE has already demonstrated the importance of mining social media by providing data from which new findings in drug abuse research were uncovered. Given the recent platform enhancements, including the refined DAO, components for relationship and triple extraction, and tools for content, trend and emerging pattern analysis, it is expected that PREDOSE will play a significant role in advancing drug abuse epidemiology in future.
Entity Identification; Relationship Extraction; Triple Extraction; Sentiment Extraction; Semantic Web; Drug Abuse Ontology; Prescription Drug Abuse; Epidemiology
Over two decades of research has been conducted using mobile devices for health related behaviors yet many of these studies lack rigor. There are few evaluation frameworks for assessing the usability of mHealth, which is critical as the use of this technology proliferates. As the development of interventions using mobile technology increase, future work in this domain necessitates the use of a rigorous usability evaluation framework.
We used two exemplars to assess the appropriateness of the Health IT Usability Evaluation Model (Health-ITUEM) for evaluating the usability of mHealth technology. In the first exemplar, we conducted 6 focus group sessions to explore adolescents’ use of mobile technology for meeting their health Information needs. In the second exemplar, we conducted 4 focus group sessions following an Ecological Momentary Assessment study in which 60 adolescents were given a smartphone with pre-installed health-related applications (apps).
We coded the focus group data using the 9 concepts of the Health-ITUEM: Error prevention, Completeness, Memorability, Information needs, Flexibility/Customizability, Learnability, Performance speed, Competency, Other outcomes. To develop a finer granularity of analysis, the nine concepts were broken into positive, negative, and neutral codes. A total of 27 codes were created. Two raters (R1 & R2) initially coded all text and a third rater (R3) reconciled coding discordance between raters R1 and R2.
A total of 133 codes were applied to Exemplar 1. In Exemplar 2 there were a total of 286 codes applied to 195 excerpts. Performance speed, Other outcomes, and Information needs were among the most frequently occurring codes.
Our two exemplars demonstrated the appropriateness and usefulness of the Health-ITUEM in evaluating mobile health technology. Further assessment of this framework with other study populations should consider whether Memorability and Error prevention are necessary to include when evaluating mHealth technology.
Usability; evaluation framework; mobile health; Health-ITUEM
Temporal information in clinical narratives plays an important role in patients’ diagnosis, treatment and prognosis. In order to represent narrative information accurately, medical natural language processing (MLP) systems need to correctly identify and interpret temporal information. To promote research in this area, the Informatics for Integrating Biology and the Bedside (i2b2) project developed a temporally annotated corpus of clinical narratives. This corpus contains 310 de-identified discharge summaries, with annotations of clinical events, temporal expressions and temporal relations. This paper describes the process followed for the development of this corpus and discusses annotation guideline development, annotation methodology, and corpus quality.
Natural Language Processing; Temporal Reasoning; Medical Informatics; Corpus Building; Annotation
We address the TLINK track of the 2012 i2b2 challenge on temporal relations. Unlike other approaches to this task, we (1) employ sophisticated linguistic knowledge derived from semantic and discourse relations, rather than focus on morpho-syntactic knowledge; and (2) leverage a novel combination of rule-based and learning-based approaches, rather than rely solely on one or the other. Experiments show that our knowledge-rich, hybrid approach yields an F-score of 69.3, which is the best result reported to date on this dataset.
temporal relations; semantic relations; discourse relations; hybrid approaches
In this article, we evaluate a knowledge-based word sense disambiguation method that determines the intended concept associated with an ambiguous word in biomedical text using semantic similarity and relatedness measures. These measures quantify the degree of similarity or relatedness between concepts in the Unified Medical Language System (UMLS). The objective of this work is to develop a method that can disambiguate terms in biomedical text by exploiting similarity and relatedness information extracted from biomedical resources and to evaluate the efficacy of these measure on WSD.
We evaluate our method on a biomedical dataset (MSH-WSD) that contains 203 ambiguous terms and acronyms.
We show that information content-based measures derived from either a corpus or taxonomy obtain a higher disambiguation accuracy than path-based measures or relatedness measures on the MSH-WSD dataset.
The WSD system is open source and freely available from http://search.cpan.org/dist/UMLS-SenseRelate/. The MSH-WSD dataset is available from the National Library of Medicine http://wsd.nlm.nih.gov.
Natural Language Processing; NLP; Word Sense Disambiguation; WSD; semantic similarity and relatedness; biomedical documents
Named entity recognition is a crucial component of biomedical natural language processing, enabling information extraction and ultimately reasoning over and knowledge discovery from text. Much progress has been made in the design of rule-based and supervised tools, but they are often genre and task dependent. As such, adapting them to different genres of text or identifying new types of entities requires major effort in re-annotation or rule development. In this paper, we propose an unsupervised approach to extracting named entities from biomedical text. We describe a stepwise solution to tackle the challenges of entity boundary detection and entity type classification without relying on any handcrafted rules, heuristics, or annotated data. A noun phrase chunker followed by a filter based on inverse document frequency extracts candidate entities from free text. Classification of candidate entities into categories of interest is carried out by leveraging principles from distributional semantics. Experiments show that our system, especially the entity classification step, yields competitive results on two popular biomedical datasets of clinical notes and biological literature, and outperforms a baseline dictionary match approach. Detailed error analysis provides a road map for future work.
Natural Language Processing; Named entity recognition; Distributional Semantics; UMLS; Chunking
Patients increasingly visit online health communities to get help on managing health. The large scale of these online communities makes it impossible for the moderators to engage in all conversations; yet, some conversations need their expertise. Our work explores low-cost text classification methods to this new domain of determining whether a thread in an online health forum needs moderators’ help.
We employed a binary classifier on WebMD’s online diabetes community data. To train the classifier, we considered three feature types: (1) word unigram, (2) sentiment analysis features, and (3) thread length. We applied feature selection methods based on χ2 statistics and under sampling to account for unbalanced data. We then performed a qualitative error analysis to investigate the appropriateness of the gold standard.
Using sentiment analysis features, feature selection methods, and balanced training data increased the AUC value up to 0.75 and the F1-score up to 0.54 compared to the baseline of using word unigrams with no feature selection methods on unbalanced data (0.65 AUC and 0.40 F1-score). The error analysis uncovered additional reasons for why moderators respond to patients’ posts.
We showed how feature selection methods and balanced training data can improve the overall classification performance. We present implications of weighing precision versus recall for assisting moderators of online health communities. Our error analysis uncovered social, legal, and ethical issues around addressing community members’ needs. We also note challenges in producing a gold standard, and discuss potential solutions for addressing these challenges.
Social media environments provide popular venues in which patients gain health-related information. Our work contributes to understanding scalable solutions for providing moderators’ expertise in these large-scale, social media environments.
Online health communities; consumer health; human-computer interaction; text mining; health information seeking
Building classification models from clinical data using machine learning methods often relies on labeling of patient examples by human experts. Standard machine learning framework assumes the labels are assigned by a homogeneous process. However, in reality the labels may come from multiple experts and it may be difficult to obtain a set of class labels everybody agrees on; it is not uncommon that different experts have different subjective opinions on how a specific patient example should be classified. In this work we propose and study a new multi-expert learning framework that assumes the class labels are provided by multiple experts and that these experts may differ in their class label assessments. The framework explicitly models different sources of disagreements and lets us naturally combine labels from different human experts to obtain: (1) a consensus classification model representing the model the group of experts converge to, as well as, and (2) individual expert models. We test the proposed framework by building a model for the problem of detection of the Heparin Induced Thrombocytopenia (HIT) where examples are labeled by three experts. We show that our framework is superior to multiple baselines (including standard machine learning framework in which expert differences are ignored) and that our framework leads to both improved consensus and individual expert models.
Clinical records include both coded and free-text fields that interact to reflect complicated patient stories. The information often covers not only the present medical condition and events experienced by the patient, but also refers to relevant events in the past (such as signs, symptoms, tests or treatments). In order to automatically construct a timeline of these events, we first need to extract the temporal relations between pairs of events or time expressions presented in the clinical notes. We designed separate extraction components for different types of temporal relations, utilizing a novel hybrid system that combines machine learning with a graph-based inference mechanism to extract the temporal links. The temporal graph is a directed graph based on parse tree dependencies of the simplified sentences and frequent pattern clues. We generalized the sentences in order to discover patterns that, given the complexities of natural language, might not be directly discoverable in the original sentences. The proposed hybrid system performance reached an F-measure of 0.63, with precision at 0.76 and recall at 0.54 on the 2012 i2b2 Natural Language Processing corpus for the temporal relation (TLink) extraction task, achieving the highest precision and third highest f-measure among participating teams in the TLink track.
Temporal relation extraction; Clinical text mining; Automatic patient timeline; Natural Language Processing; Machine learning; Temporal graph
A new model of health care is emerging in which individuals can take charge of their health by connecting to online communities and social networks for personalized support and collective knowledge. Web 2.0 technologies expand the traditional notion of online support groups into a broad and evolving range of informational, emotional, as well as community-based concepts of support. In order to apply these technologies to patient-centered care, it is necessary to incorporate more inclusive conceptual frameworks of social support and community-based research methodologies. This paper introduces a conceptualization of online social support, reviews current challenges in online support research, and outlines six recommendations for the design, evaluation, and implementation of social support in online communities, networks, and groups. The six recommendations are illustrated by CanConnect, an online community for cancer survivors in Middle Tennessee. These recommendations address the interdependencies between online and real-world support and emphasize an inclusive framework of interpersonal and community-based support. The applications of these six recommendations are illustrated through a discussion of online support for cancer survivors.
Social support; online community; social networking; consumer informatics; cancer survivorship
•Addressing the challenge of the second translational gap is key to improving healthcare processes.•Data-driven methodologies improve likelihood of success.•We propose the Improvement Data Model (IDM) for data collection and reporting for local improvement.•WISH, a prototype software tool based on IDM is used by over 600 users in 50+ improvement projects.
Continuous data collection and analysis have been shown essential to achieving improvement in healthcare. However, the data required for local improvement initiatives are often not readily available from hospital Electronic Health Record (EHR) systems or not routinely collected. Furthermore, improvement teams are often restricted in time and funding thus requiring inexpensive and rapid tools to support their work. Hence, the informatics challenge in healthcare local improvement initiatives consists of providing a mechanism for rapid modelling of the local domain by non-informatics experts, including performance metric definitions, and grounded in established improvement techniques. We investigate the feasibility of a model-driven software approach to address this challenge, whereby an improvement model designed by a team is used to automatically generate required electronic data collection instruments and reporting tools. To that goal, we have designed a generic Improvement Data Model (IDM) to capture the data items and quality measures relevant to the project, and constructed Web Improvement Support in Healthcare (WISH), a prototype tool that takes user-generated IDM models and creates a data schema, data collection web interfaces, and a set of live reports, based on Statistical Process Control (SPC) for use by improvement teams. The software has been successfully used in over 50 improvement projects, with more than 700 users. We present in detail the experiences of one of those initiatives, Chronic Obstructive Pulmonary Disease project in Northwest London hospitals. The specific challenges of improvement in healthcare are analysed and the benefits and limitations of the approach are discussed.
D2.1 (Software engineering) requirements/specification J.3 (life and medical sciences): Health model-driven architectures; Healthcare analytics; Quality improvement; Data collection; Metrics; Performance analytics
The protection of privacy of individual-level information in genome-wide association study (GWAS) databases has been a major concern of researchers following the publication of “an attack” on GWAS data by Homer et al. . Traditional statistical methods for confidentiality and privacy protection of statistical databases do not scale well to deal with GWAS data, especially in terms of guarantees regarding protection from linkage to external information. The more recent concept of differential privacy, introduced by the cryptographic community, is an approach that provides a rigorous definition of privacy with meaningful privacy guarantees in the presence of arbitrary external information, although the guarantees may come at a serious price in terms of data utility. Building on such notions, Uhler et al.  proposed new methods to release aggregate GWAS data without compromising an individual's privacy. We extend the methods developed in  for releasing differentially-private χ2-statistics by allowing for arbitrary number of cases and controls, and for releasing differentially-private allelic test statistics. We also provide a new interpretation by assuming the controls’ data are known, which is a realistic assumption because some GWAS use publicly available data as controls. We assess the performance of the proposed methods through a risk-utility analysis on a real data set consisting of DNA samples collected by the Wellcome Trust Case Control Consortium and compare the methods with the differentially-private release mechanism proposed by Johnson and Shmatikov .
differential privacy; genome-wide association study (GWAS); Pearson χ2-test; allelic test; contingency table; single-nucleotide polymorphism (SNP)
Due to the upcoming data deluge of genome data, the need for storing and processing large-scale genome data, easy access to biomedical analyses tools, efficient data sharing and retrieval has presented significant challenges. The variability in data volume results in variable computing and storage requirements, therefore biomedical researchers are pursuing more reliable, dynamic and convenient methods for conducting sequencing analyses. This paper proposes a Cloud-based bioinformatics workflow platform for large-scale next-generation sequencing analyses, which enables reliable and highly scalable execution of sequencing analyses workflows in a fully automated manner. Our platform extends the existing Galaxy workflow system by adding data management capabilities for transferring large quantities of data efficiently and reliably (via Globus Transfer), domain-specific analyses tools preconfigured for immediate use by researchers (via user-specific tools integration), automatic deployment on Cloud for on-demand resource allocation and pay-as-you-go pricing (via Globus Provision), a Cloud provisioning tool for auto-scaling (via HTCondor scheduler), and the support for validating the correctness of workflows (via semantic verification tools). Two bioinformatics workflow use cases as well as performance evaluation are presented to validate the feasibility of the proposed approach.
Bioinformatics; Scientific workflow; Sequencing analyses; Cloud computing; Galaxy
Biomedical taxonomies, thesauri and ontologies in the form of the International Classification of Diseases as a taxonomy or the National Cancer Institute Thesaurus as an OWL-based ontology, play a critical role in acquiring, representing and processing information about human health. With increasing adoption and relevance, biomedical ontologies have also significantly increased in size. For example, the 11th revision of the International Classification of Diseases, which is currently under active development by the World Health Organization contains nearly 50, 000 classes representing a vast variety of different diseases and causes of death. This evolution in terms of size was accompanied by an evolution in the way ontologies are engineered. Because no single individual has the expertise to develop such large-scale ontologies, ontology-engineering projects have evolved from small-scale efforts involving just a few domain experts to large-scale projects that require effective collaboration between dozens or even hundreds of experts, practitioners and other stakeholders. Understanding the way these different stakeholders collaborate will enable us to improve editing environments that support such collaborations. In this paper, we uncover how large ontology-engineering projects, such as the International Classification of Diseases in its 11th revision, unfold by analyzing usage logs of five different biomedical ontology-engineering projects of varying sizes and scopes using Markov chains. We discover intriguing interaction patterns (e.g., which properties users frequently change after specific given ones) that suggest that large collaborative ontology-engineering projects are governed by a few general principles that determine and drive development. From our analysis, we identify commonalities and differences between different projects that have implications for project managers, ontology editors, developers and contributors working on collaborative ontology-engineering projects and tools in the biomedical domain.
Collaborative ontology engineering; Markov chains; sequential patterns; collaboration; ontology-engineering tool; user interface
To compare linear and Laplacian SVMs on a clinical text classification task; to evaluate the effect of unlabeled training data on Laplacian SVM performance.
The development of machine-learning based clinical text classifiers requires the creation of labeled training data, obtained via manual review by clinicians. Due to the effort and expense involved in labeling data, training data sets in the clinical domain are of limited size. In contrast, electronic medical record (EMR) systems contain hundreds of thousands of unlabeled notes that are not used by supervised machine learning approaches. Semi-supervised learning algorithms use both labeled and unlabeled data to train classifiers, and can outperform their supervised counterparts.
We trained support vector machines (SVMs) and Laplacian SVMs on a training reference standard of 820 abdominal CT, MRI, and Ultrasound reports labeled for the presence of potentially malignant liver lesions that require follow up (positive class prevalence 77%). The Laplacian SVM used 19,845 randomly sampled unlabeled notes in addition to the training reference standard. We evaluated SVMs and Laplacian SVMs on a test set of 520 labeled reports.
The Laplacian SVM trained on labeled and unlabeled radiology reports significantly outperformed supervised SVMs (Macro-F1 0.773 vs. 0.741, Sensitivity 0.943 vs. 0.911, Positive Predictive value 0.877 vs. 0.883). Performance improved with the number of labeled and unlabeled notes used to train the Laplacian SVM (pearson’s ρ=0.529 for correlation between number of unlabeled notes and macro-F1 score). These results suggest that practical semi-supervised methods such as the Laplacian SVM can leverage the large, unlabeled corpora that reside within EMRs to improve clinical text classification.
Semi-supervised learning; Support vector machine; Graph Laplacian; Natural language processing
We demonstrate the importance of explicit definitions of electronic health record (EHR) data completeness and how different conceptualizations of completeness may impact findings from EHR-derived datasets. This study has important repercussions for researchers and clinicians engaged in the secondary use of EHR data. We describe four prototypical definitions of EHR completeness: documentation, breadth, density, and predictive completeness. Each definition dictates a different approach to the measurement of completeness. These measures were applied to representative data from NewYork-Presbyterian Hospital’s clinical data warehouse. We found that according to any definition, the number of complete records in our clinical database is far lower than the nominal total. The proportion that meets criteria for completeness is heavily dependent on the definition of completeness used, and the different definitions generate different subsets of records. We conclude that the concept of completeness in EHR is contextual. We urge data consumers to be explicit in how they define a complete record and transparent about the limitations of their data.
data quality; electronic health records; secondary use; completeness