Identifying partial mappings between two terminologies is of special importance when one terminology is finer-grained than the other, as is the case for the Human Phenotype Ontology (HPO), mainly used for research purposes, and SNOMED CT, mainly used in healthcare.
To investigate and contrast lexical and logical approaches to deriving partial mappings between HPO and SNOMED CT.
1) Lexical approach—We identify modifiers in HPO terms and attempt to map demodified terms to SNOMED CT through UMLS; 2) Logical approach—We leverage subsumption relations in HPO to infer partial mappings to SNOMED CT; 3) Comparison—We analyze the specific contribution of each approach and evaluate the quality of the partial mappings through manual review.
There are 7358 HPO concepts with no complete mapping to SNOMED CT. We identified partial mappings lexically for 33 % of them and logically for 82 %. We identified partial mappings both lexically and logically for 27 %. The clinical relevance of the partial mappings (for a cohort selection use case) is 49 % for lexical mappings and 67 % for logical mappings.
Through complete and partial mappings, 92 % of the 10,454 HPO concepts can be mapped to SNOMED CT (30 % complete and 62 % partial). Equivalence mappings between HPO and SNOMED CT allow for interoperability between data described using these two systems. However, due to differences in focus and granularity, equivalence is only possible for 30 % of HPO classes. In the remaining cases, partial mappings provide a next-best approach for traversing between the two systems. Both lexical and logical mapping techniques produce mappings that cannot be generated by the other technique, suggesting that the two techniques are complementary to each other. Finally, this work demonstrates interesting properties (both lexical and logical) of HPO and SNOMED CT and illustrates some limitations of mapping through UMLS.
Partial mapping; Human phenotype; Ontology; Standard terminologies; Interoperability
To investigate approaches to supporting the analysis of historical medication datasets with RxNorm.
We created two sets of National Drug Codes (NDCs). One is based on historical NDCs harvested from versions of RxNorm from 2007 to present. The other comprises all sources of NDCs in the current release of RxNorm, including proprietary sources. We evaluated these two resources against four sets of NDCs obtained from various sources. Results: In two historical medication datasets, 14–19% of the NDCs were obsolete, but 91–96% of these obsolete NDCs could be recovered and mapped to active drug concepts. Conclusion: Adding historical data significantly increases NDC mapping to active RxNorm drugs. A service for mapping historical NDC datasets leveraging RxNorm was added to the RxNorm API and is available at https://rxnav.nlm.nih.gov/.
There is limited consensus among drug information sources on what constitutes drug-drug interactions (DDIs). We investigate DDI information in two publicly available sources, NDF-RT and DrugBank.
We acquire drug-drug interactions from NDF-RT and DrugBank, and normalize the drugs to RxNorm. We compare interactions between NDF-RT and DrugBank and evaluate both sources against a reference list of 360 critical interactions. We compare the interactions detected with NDF-RT and DrugBank on a large prescription dataset. Finally, we contrast NDF-RT and DrugBank against a commercial source.
DrugBank drug-drug interaction information has limited overlap with NDF-RT (24-30%). The coverage of the reference set by both sources is about 60%. Applied to a prescription dataset of 35.5M pairs of co-prescribed systemic clinical drugs, NDF-RT would have identified 808,285 interactions, while DrugBank would have identified 1,170,693. Of these, 382,833 are common. The commercial source Multum provides a more systematic coverage (91%) of the reference list.
This investigation confirms the limited overlap of DDI information between NDF-RT and DrugBank. Additional research is required to determine which source is better, if any. Usage of any of these sources in clinical decision systems should disclose these limitations.
Electronic supplementary material
The online version of this article (doi:10.1186/s13326-015-0018-0) contains supplementary material, which is available to authorized users.
Drug-drug interactions; NDF-RT; DrugBank
While the association between a drug and an adverse event (ADE) is generally detected at the level of individual drugs, ADEs are often discussed at the class level, i.e., at the level of pharmacologic classes (e.g., in drug labels). We propose two approaches, one visual and one computational, to exploring the contribution of individual drugs to the class signal.
Having established a dataset of ADEs from MEDLINE, we aggregate drugs into ATC classes and ADEs into high-level MeSH terms. We compute statistical associations between drugs and ADEs at the drug level and at the class level. Finally, we visualize the signals at increasing levels of resolution using heat maps. We also automate the exploration of drug-ADE associations at the class level using clustering techniques.
Using our visual approach, we were able to uncover known associations, e.g., between fluoroquinolones and tendon injuries, and between statins and rhabdomyolysis. Using our computational approach, we systematically analyzed 488 associations between a drug class and an ADE.
The findings gained from our exploratory techniques should be of interest to the curators of ADE repositories and drug safety professionals. Our approach can be applied to different drug-ADE datasets, using different drug classification systems and different signal detection algorithms.
Electronic supplementary material
The online version of this article (doi:10.1186/s13326-015-0017-1) contains supplementary material, which is available to authorized users.
Adverse drug events; Drug classes; Anatomical Therapeutic Chemical (ATC) drug classification system; Class effect; Heat maps; Pharmacovigilance
Statements about RDF statements, or meta triples, provide additional information about individual triples, such as the source, the occurring time or place, or the certainty. Integrating such meta triples into semantic knowledge bases would enable the querying and reasoning mechanisms to be aware of provenance, time, location, or certainty of triples. However, an efficient RDF representation for such meta knowledge of triples remains challenging. The existing standard reification approach allows such meta knowledge of RDF triples to be expressed using RDF by two steps. The first step is representing the triple by a Statement instance which has subject, predicate, and object indicated separately in three different triples. The second step is creating assertions about that instance as if it is a statement. While reification is simple and intuitive, this approach does not have formal semantics and is not commonly used in practice as described in the RDF Primer.
In this paper, we propose a novel approach called Singleton Property for representing statements about statements and provide a formal semantics for it. We explain how this singleton property approach fits well with the existing syntax and formal semantics of RDF, and the syntax of SPARQL query language. We also demonstrate the use of singleton property in the representation and querying of meta knowledge in two examples of Semantic Web knowledge bases: YAGO2 and BKR. Our experiments on the BKR show that the singleton property approach gives a decent performance in terms of number of triples, query length and query execution time compared to existing approaches. This approach, which is also simple and intuitive, can be easily adopted for representing and querying statements about statements in other knowledge bases.
Semantic Web; Meta triples; RDF; SPARQL; Reification; RDF Singleton Property
The NDF-RT (National Drug File Reference Terminology) is an ontology, which describes drugs and their properties and supports computerized physician order entry systems. NDF-RT’s classes are mostly specified using only necessary conditions and lack sufficient conditions, making its use limited until recently, when asserted drug-class relations were added. The addition of these asserted drug-class relations presents an opportunity to compare them with drug-class relations that can be inferred using the properties of drugs and drug classes in NDF-RT.
We enriched NDF-RT’s drug-classes with sufficient conditions, added property equivalences, and then used an OWL reasoner to infer drug-class membership relations. We compared the inferred class relations to the recently added asserted relations derived from FDA Structured Product Labels.
The inferred and asserted relations only match in about 50% of the cases, due to incompleteness of the drug descriptions and quality issues in the class definitions.
This investigation quantifies and categorizes the disparities between asserted and inferred drug-class relations and illustrates issues with class definitions and drug descriptions. In addition, it serves as an example of the benefits DL can add to ontology development and evaluation.
Electronic supplementary material
The online version of this article (doi:10.1186/s13326-015-0007-3) contains supplementary material, which is available to authorized users.
Ontology; Description logics; Quality assurance; National drug file-reference terminology
Relation reversals in ontological systems refer to such patterns as a path from concept A to concept B in one version becoming a path with the position of A and B switched in another version. We present a scalable approach, using cloud computing, to systematically extract all hierarchical relation reversals among 8 SNOMED CT versions from 2009 to 2014. Taking advantage of our MapReduce algorithms for computing transitive closure and large-scale set operations, 48 reversals were found through 28 pairwise comparison of the 8 versions in 18 minutes using a 30-node local cloud, to completely cover all possible scenarios. Except for one, all such reversals occurred in three sub-hierarchies: Body Structure, Clinical Finding, and Procedure. Two (2) reversal pairs involved an uncoupling of the pair before the is-a coupling is reversed. Twelve (12) reversal pairs involved paths of length-two, and none (0) involved paths beyond length-two. Such reversals not only represent areas of potential need for additional modeling work, but also are important for identifying and handling cycles for comparative visualization of ontological evolution.
Medical terminology collects and organizes the many different kinds of terms employed in the biomedical domain both by practitioners and also in the course of biomedical research. In addition to serving as labels for biomedical classes, these names reflect the organizational principles of biomedical vocabularies and ontologies. Some names represent invariant features (classes, universals) of biomedical reality (i.e., they are a matter for ontology). Other names, however, convey also how this reality is perceived, measured, and understood by health professionals (i.e., they belong to the domain of epistemology). We analyze terms from several biomedical vocabularies in order to throw light on the interactions between ontological and epistemological components of these terminologies. We identify four cases: 1) terms containing classification criteria, 2) terms reflecting detectability, modality, uncertainty, and vagueness, 3) terms created in order to obtain a complete partition of a given domain, and 4) terms reflecting mere fiat boundaries. We show that epistemology-loaded terms are pervasive in biomedical vocabularies, that the “classes” they name often do not comply with sound classification principles, and that they are therefore likely to cause problems in the evolution and alignment of terminologies and associated ontologies.
The objective of this study is to compare description logics (DLs) and frames for representing large-scale biomedical ontologies and reasoning with them. The ontology under investigation is the Foundational Model of Anatomy (FMA). We converted it from its frame-based representation in Protégé into OWL DL. The OWL reasoner Racer helped identify unsatisfiable classes in the FMA. Support for consistency checking is clearly an advantage of using DLs rather than frames. The interest of reclassification was limited, due to the difficulty of defining necessary and sufficient conditions for anatomical entities. The sheer size and complexity of the FMA was also an issue.
Non-lattice fragments are often indicative of structural anomalies in ontological systems and, as such, represent possible areas of focus for subsequent quality assurance work. However, extracting the non-lattice fragments in large ontological systems is computationally expensive if not prohibitive, using a traditional sequential approach. In this paper we present a general MapReduce pipeline, called MaPLE (MapReduce Pipeline for Lattice-based Evaluation), for extracting non-lattice fragments in large partially ordered sets and demonstrate its applicability in ontology quality assurance. Using MaPLE in a 30-node Hadoop local cloud, we systematically extracted non-lattice fragments in 8 SNOMED CT versions from 2009 to 2014 (each containing over 300k concepts), with an average total computing time of less than 3 hours per version. With dramatically reduced time, MaPLE makes it feasible not only to perform exhaustive structural analysis of large ontological hierarchies, but also to systematically track structural changes between versions. Our change analysis showed that the average change rates on the non-lattice pairs are up to 38.6 times higher than the change rates of the background structure (concept nodes). This demonstrates that fragments around non-lattice pairs exhibit significantly higher rates of change in the process of ontological evolution.
Many complex information needs that arise in biomedical disciplines require exploring multiple documents in order to obtain information. While traditional information retrieval techniques that return a single ranked list of documents are quite common for such tasks, they may not always be adequate. The main issue is that ranked lists typically impose a significant burden on users to filter out irrelevant documents. Additionally, users must intuitively reformulate their search query when relevant documents have not been not highly ranked. Furthermore, even after interesting documents have been selected, very few mechanisms exist that enable document-to-document transitions. In this paper, we demonstrate the utility of assertions extracted from biomedical text (called semantic predications) to facilitate retrieving relevant documents for complex information needs. Our approach offers an alternative to query reformulation by establishing a framework for transitioning from one document to another. We evaluate this novel knowledge-driven approach using precision and recall metrics on the 2006 TREC Genomics Track.
semantic predications; question answering; background knowledge; literature-based discovery; text mining
We present a scalable, SPARQL-based computational pipeline for testing the lattice-theoretic properties of partial orders represented as RDF triples. The use case for this work is quality assurance in biomedical ontologies, one desirable property of which is conformance to lattice structures. At the core of our pipeline is the algorithm called NuMi, for detecting the Number of Minimal upper bounds of any pair of elements in a given finite partial order. Our technical contribution is the coding of NuMi completely in SPARQL. To show its scalability, we applied NuMi to the entirety of SNOMED CT, the largest clinical ontology (over 300,000 conepts). Our experimental results have been groundbreaking: for the first time, all non-lattice pairs in SNOMED CT have been identified exhaustively from 34 million candidate pairs using over 2.5 billion queries issued to Virtuoso. The percentage of non-lattice pairs ranges from 0 to 1.66 among the 19 SNOMED CT hierarchies. These non-lattice pairs represent target areas for focused curation by domain experts. RDF, SPARQL and related tooling provide an e cient platform for implementing lattice algorithms on large data structures.
There is a need to develop methods to automatically incorporate prior knowledge to support the prediction and validation of novel functional associations. One such important source is represented by the Gene Ontology (GO)™ and the many model organism databases of gene products annotated to the GO. We investigated quantitative relationships between the GO-driven similarity of genes and their functional interactions by analyzing different types of associations in Saccharomyces cerevisiae and Caenorhabditis elegans. Interacting genes exhibited significantly higher levels of GO-driven similarity (GOS) in comparison to random pairs of genes used as a surrogate for negative interactions. The Biological Process hierarchy provides more reliable results for co-regulatory and protein-protein interactions. GOS represent a relevant resource to support prediction of functional networks in combination with other resources.
The Gene Ontology and annotations derived from the S. cerivisiae Genome Database were analyzed to calculate functional similarity of gene products. Three methods for measuring similarity (including a distance-based approach) were implemented. Significant, quantitative relationships between similarity and expression correlation of pairs of genes were detected. Using a known gene expression dataset in yeast, this study compared more than three million pairs of gene products on the basis of these functional properties. Highly correlated genes exhibit strong similarity based on information originating from the gene ontology taxonomies. Such a similarity is significantly stronger than that observed between weakly correlated genes. This study supports the feasibility of applying gene ontology-driven similarity methods to functional prediction tasks, such as the validation of gene expression analyses and the identification of false positives in protein interaction studies.
Gene Ontology; functional similarity; gene expression correlation
This research explores the feasibility of semantic similarity approaches to supporting predictive tasks in functional genomics. It aims to establish potential relationships between ontology-based similarity of gene products and important functional properties, such as gene expression correlation. Similarity measures based on the information content of the Gene Ontology (GO) were analyzed. Models have been implemented using data obtained from well-known studies in S. cerevisiae. Results suggest that there may exist significant relationships between gene expression correlation and semantic similarity. Analyses of protein complex data show that, in general, there is a significant correlation between the semantic similarity exhibited by a pair of genes and the probability of finding them in the same complex. These results can also be interpreted as an assessment of the quality and consistency of the information represented in the GO.
Association rules mining methods have been recently applied to gene expression data analysis to reveal relationships between genes and different conditions and features. However, not much effort has focused on detecting the relation between gene expression maps and related gene functions. Here we describe such an approach to mine association rules among gene functions in clusters of similar gene expression maps on mouse brain. The experimental results show that the detected association rules make sense biologically. By inspecting the obtained clusters and the genes having the gene functions of frequent itemsets, interesting clues were discovered that provide valuable insight to biological scientists. Moreover, discovered association rules can be potentially used to predict gene functions based on similarity of gene expression maps.
association rules mining; gene expression maps; gene functions; clustering; voxelation
Taxonomies are commonly used for organizing knowledge, particularly in biomedicine where the taxonomy of living organisms and the classification of diseases are central to the domain. The principles used to produce taxonomies are either intrinsic (properties of the partial ordering relation) or added to make knowledge more manageable (opposition of siblings and economy). The applicability of these principles in the biomedical domain is presented using the Unified Medical Language System (UMLS) and issues raised by the application of these principles are illustrated. While intrinsic principles are not challenged, we argue that the opposition of siblings brings to bear excessive constraints on a domain ontology and that the adverse effects of economy may outweigh its benefits. The two-level structure used in the UMLS is discussed.
Theory; Taxonomic relation; Ontology; Biomedical domain; Unified Medical Language System
Undetected adverse drug reactions (ADRs) pose a major burden on the health system. Data mining methodologies designed to identify signals of novel ADRs are of deep importance for drug safety surveillance. The development and evaluation of these methodologies requires proper reference benchmarks. While progress has recently been made in developing such benchmarks, our understanding of the performance characteristics of the data mining methodologies is limited because existing benchmarks do not support prospective performance evaluations. We address this shortcoming by providing a reference standard to support prospective performance evaluations. The reference standard was systematically curated from drug labeling revisions, such as new warnings, which were issued and communicated by the US Food and Drug Administration in 2013. The reference standard includes 62 positive test cases and 75 negative controls, and covers 44 drugs and 38 events. We provide usage guidance and empirical support for the reference standard by applying it to analyze two data sources commonly mined for drug safety surveillance.
The Resource Description Framework (RDF) format is being used by a large number of scientific applications to store and disseminate their datasets. The provenance information, describing the source or lineage of the datasets, is playing an increasingly significant role in ensuring data quality, computing trust value of the datasets, and ranking query results. Current provenance tracking approaches using the RDF reification vocabulary suffer from a number of known issues, including lack of formal semantics, use of blank nodes, and application-dependent interpretation of reified RDF triples. In this paper, we introduce a new approach called Provenance Context Entity (PaCE) that uses the notion of provenance context to create provenance-aware RDF triples. We also define the formal semantics of PaCE through a simple extension of the existing RDF(S) semantics that ensures compatibility of PaCE with existing Semantic Web tools and implementations. We have implemented the PaCE approach in the Biomedical Knowledge Repository (BKR) project at the US National Library of Medicine. The evaluations demonstrate a minimum of 49% reduction in total number of provenance-specific RDF triples generated using the PaCE approach as compared to RDF reification. In addition, performance for complex queries improves by three orders of magnitude and remains comparable to the RDF reification approach for simpler provenance queries.
Provenance context entity; Biomedical knowledge repository; Context theory; RDF reification; Provenir ontology
To investigate the extent to which pharmacoepidemiologic groupings are
homogeneous in terms of clinical properties.
In our analysis, we classified drug subgroups from the pharmacoepidemiologic
Anatomical Therapeutic Chemical (ATC) classification system based on clinical drug
properties. We established mappings from ATC fifth level drug entities to drug property
annotations in the National Drug File Reference Terminology (NDF-RT), including
therapeutic categories, mechanisms of action, and physiologic effects. Based on the
annotations for the individual drugs we computed homogeneity scores for all ATC groups
and analyzed their distribution.
We found ATC groups to be generally homogeneous, more so for mechanisms of
action, and physiologic effects than for therapeutic intent. However, only half of all
ATC drugs can be analyzed with this approach, in part because of missing properties in
ATC; NDF-RT; pharmacoepidemiologic groups; clinical use
To investigate errors identified in SNOMED CT by human reviewers with help from the Abstraction Network methodology and examine why they had escaped detection by the Description Logic (DL) classifier. Case study; Two examples of errors are presented in detail (one missing IS-A relation and one duplicate concept). After correction, SNOMED CT is reclassified to ensure that no new inconsistency was introduced.
DL-based auditing techniques built in terminology development environments ensure the logical consistency of the terminology. However, complementary approaches are needed for identifying and addressing other types of errors.
Systematized nomenclature of medicine; Comparative study; Quality assurance; Description logics; Abstraction network
The UMLS Semantic Network and Metathesaurus are two complementary knowledge sources. While many studies compare relationships across the two structures, their alignment has never been attempted. We applied two methods based on lexical and conceptual similarity to aligning the Semantic Network with the UMLS Metathesaurus. Approximately two thirds of the semantic types could be aligned by lexical similarity. Conceptual similarity suggested mappings in all but ten cases. Potential applications enabled by the alignment are discussed, namely auditing the consistency between the Semantic Network and the Metathesaurus and extending the Semantic Network downwards. The relative contribution and limitations of the two methods used for the alignment are also discussed.
UMLS; Alignment; Hierarchical relationships
In French hospitals, medical diagnosis coding with the ICD10 is commonly performed and the use of effective tools would help coders in their task.
Aim of this work
to ameliorate an existing coding help system. This system, which already consists of the ICD10 analytical index, would be increased with the terms of the alphabetical index that includes lexical variants and additional terms as well. The addition of the second volume of the ICD would allow the coding of more terms and would lessen documentary silence.
the first step of this work was a careful study of a theoretical model of the ICD content. Then the alphabetical index file was submitted to a lexical analysis, and it was automatically transformed to be integrated into the existing coding help system.
Compromise had to be made between a theoretical model and between what could be obtained in practice by an automatic processing of the file. Finally the alphabetical index was added to the initial thesaurus, which represents 42,000 terms and 4,000 additional words. Links between words and codes were also considerably increased, which has enhanced the searching possibilities of the tool and lessen documentary silence. Conversely the research time has been increased.
difficulties have to be encountered when trying to turn a manual tool into an automatic research tool.
Coding; Disease Classification; Medical Diagnosis; ICD10; Knowledge Representation
Our objective was to enable an end-user to create complex queries to drug information sources through functional composition, by creating sequences of functions from application program interfaces (API) to drug terminologies. The development of a functional composition model seeks to link functions from two distinct APIs. An ontology was developed using Protégé to model the functions of the RxNorm and NDF-RT APIs by describing the semantics of their input and output. A set of rules were developed to define the interoperable conditions for functional composition. The operational definition of interoperability between function pairs is established by executing the rules on the ontology. We illustrate that the functional composition model supports common use cases, including checking interactions for RxNorm drugs and deploying allergy lists defined in reference to drug properties in NDF-RT. This model supports the RxMix application (http://mor.nlm.nih.gov/RxMix/), an application we developed for enabling complex queries to the RxNorm and NDF-RT APIs.
RxNorm; NDF-RT; application programming interface; web service composition; complex queries
To quantify semantic inconsistency in UMLS concepts from the perspective of their hierarchical relations and to show through examples how semantically-inconsistent concepts can help reveal erroneous synonymy relations.
Inconsistency is defined in reference to concepts from the UMLS Metathesaurus. Consistency is evaluated by comparing the semantic groups of the two concepts in each pair of hierarchically-related concepts. A limited number of inconsistent concepts was inspected manually.
81,512 concepts are inconsistent due to the differences in semantic groups between a concept and its parent. Four examples of wrong synonymy are presented.
A vast majority of inconsistent hierarchical relations are not indicative of any errors. We discovered an interesting semantic pattern along hierarchies, which seems associated with wrong synonymy.
Unified medical language system; Semantic consistency