To assess whether errors can be found in LOINC by changing its representation to OWL DL and comparing its classification to that of SNOMED CT.
We created Description Logic definitions for LOINC concepts in OWL and merged the ontology with SNOMED CT to enrich the relatively flat hierarchy of LOINC parts. LOINC - SNOMED CT mappings were acquired through UMLS. The resulting ontology was classified with the ConDOR reasoner.
Transformation into DL helped to identify 427 sets of logically equivalent LOINC codes, 676 sets of logically equivalent LOINC parts, and 239 inconsistencies in LOINC multiaxial hierarchy. Automatic classification of LOINC and SNOMED CT combined increased the connectivity within LOINC hierarchy and increased its coverage by an additional 9,006 LOINC codes.
LOINC is a well-maintained terminology. While only a relatively small number of logical inconsistencies were found, we identified a number of areas where LOINC could benefit from the application of Description Logic.
To develop methods for assessing the validity, consistency and currency of value sets for clinical quality measures, in order to support the developers of quality measures in which such value sets are used.
We assessed the well-formedness of the codes (in a given code system), the existence and currency of the codes in the corresponding code system, using the UMLS and RxNorm terminology services. We also investigated the overlap among value sets using the Jaccard similarity measure.
We extracted 163,788 codes (76,062 unique codes) from 1463 unique value sets in the 113 quality measures published by the National Quality Forum (NQF) in December 2011. Overall, 5% of the codes are invalid (4% of the unique codes). We also found 67 duplicate value sets and 10 pairs of value sets exhibiting a high degree of similarity (Jaccard > .9).
Invalid codes affect a large proportion of the value sets (19%). 79% of the quality Measures have at least one value set exhibiting errors. However, 50% of the quality measures exhibit errors in less than 10 % of their value sets. The existence of duplicate and highly-similar value sets suggests the need for an authoritative repository of value sets and related tooling in order to support the development of quality measures.
Motivation: A major goal of biomedical research in personalized medicine is to find relationships between mutations and their corresponding disease phenotypes. However, most of the disease-related mutational data are currently buried in the biomedical literature in textual form and lack the necessary structure to allow easy retrieval and visualization. We introduce a high-throughput computational method for the identification of relevant disease mutations in PubMed abstracts applied to prostate (PCa) and breast cancer (BCa) mutations.
Results: We developed the extractor of mutations (EMU) tool to identify mutations and their associated genes. We benchmarked EMU against MutationFinder—a tool to extract point mutations from text. Our results show that both methods achieve comparable performance on two manually curated datasets. We also benchmarked EMU's performance for extracting the complete mutational information and phenotype. Remarkably, we show that one of the steps in our approach, a filter based on sequence analysis, increases the precision for that task from 0.34 to 0.59 (PCa) and from 0.39 to 0.61 (BCa). We also show that this high-throughput approach can be extended to other diseases.
Discussion: Our method improves the current status of disease-mutation databases by significantly increasing the number of annotated mutations. We found 51 and 128 mutations manually verified to be related to PCa and Bca, respectively, that are not currently annotated for these cancer types in the OMIM or Swiss-Prot databases. EMU's retrieval performance represents a 2-fold improvement in the number of annotated mutations for PCa and BCa. We further show that our method can benefit from full-text analysis once there is an increase in Open Access availability of full-text articles.
Availability: Freely available at: http://bioinf.umbc.edu/EMU/ftp.
Supplementary information: Supplementary data are available at Bioinformatics online.
A critical aspect of the NIH Translational Research roadmap, which seeks to accelerate the delivery of "bench-side" discoveries to patient's "bedside," is the management of the provenance metadata that keeps track of the origin and history of data resources as they traverse the path from the bench to the bedside and back. A comprehensive provenance framework is essential for researchers to verify the quality of data, reproduce scientific results published in peer-reviewed literature, validate scientific process, and associate trust value with data and results. Traditional approaches to provenance management have focused on only partial sections of the translational research life cycle and they do not incorporate "domain semantics", which is essential to support domain-specific querying and analysis by scientists.
We identify a common set of challenges in managing provenance information across the pre-publication and post-publication phases of data in the translational research lifecycle. We define the semantic provenance framework (SPF), underpinned by the Provenir upper-level provenance ontology, to address these challenges in the four stages of provenance metadata:
(a) Provenance collection - during data generation
(b) Provenance representation - to support interoperability, reasoning, and incorporate domain semantics
(c) Provenance storage and propagation - to allow efficient storage and seamless propagation of provenance as the data is transferred across applications
(d) Provenance query - to support queries with increasing complexity over large data size and also support knowledge discovery applications
We apply the SPF to two exemplar translational research projects, namely the Semantic Problem Solving Environment for Trypanosoma cruzi (T.cruzi SPSE) and the Biomedical Knowledge Repository (BKR) project, to demonstrate its effectiveness.
The SPF provides a unified framework to effectively manage provenance of translational research data during pre and post-publication phases. This framework is underpinned by an upper-level provenance ontology called Provenir that is extended to create domain-specific provenance ontologies to facilitate provenance interoperability, seamless propagation of provenance, automated querying, and analysis.
To facilitate the integration of terminologies into applications, various terminology services application programming interfaces (API) have been developed in the recent past. In this study, three publicly available terminology services API, RxNav, UMLSKS and LexBIG, are compared and functionally evaluated with respect to the retrieval of information from one biomedical terminology, RxNorm, to which all three services provide access. A list of queries is established covering a wide spectrum of terminology services functionalities such as finding RxNorm concepts by their name, or navigating different types of relationships. Test data were generated from the RxNorm dataset to evaluate the implementation of the functionalities in the three API. The results revealed issues with various aspects of the API implementation (eg, handling of obsolete terms by LexBIG) and documentation (eg, navigational paths used in RxNav) that were subsequently addressed by the development teams of the three API investigated. Knowledge about such discrepancies helps inform the choice of an API for a given use case.
The authors used the i2b2 Medication Extraction Challenge to evaluate their entity extraction methods, contribute to the generation of a publicly available collection of annotated clinical notes, and start developing methods for ontology-based reasoning using structured information generated from the unstructured clinical narrative.
Extraction of salient features of medication orders from the text of de-identified hospital discharge summaries was addressed with a knowledge-based approach using simple rules and lookup lists. The entity recognition tool, MetaMap, was combined with dose, frequency, and duration modules specifically developed for the Challenge as well as a prototype module for reason identification.
Evaluation metrics and corresponding results were provided by the Challenge organizers.
The results indicate that robust rule-based tools achieve satisfactory results in extraction of simple elements of medication orders, but more sophisticated methods are needed for identification of reasons for the orders and durations.
Owing to the time constraints and nature of the Challenge, some obvious follow-on analysis has not been completed yet.
The authors plan to integrate the new modules with MetaMap to enhance its accuracy. This integration effort will provide guidance in retargeting existing tools for better processing of clinical text.
Translational medicine requires the integration of knowledge using heterogeneous data from health care to the life sciences. Here, we describe a collaborative effort to produce a prototype Translational Medicine Knowledge Base (TMKB) capable of answering questions relating to clinical practice and pharmaceutical drug discovery.
We developed the Translational Medicine Ontology (TMO) as a unifying ontology to integrate chemical, genomic and proteomic data with disease, treatment, and electronic health records. We demonstrate the use of Semantic Web technologies in the integration of patient and biomedical data, and reveal how such a knowledge base can aid physicians in providing tailored patient care and facilitate the recruitment of patients into active clinical trials. Thus, patients, physicians and researchers may explore the knowledge base to better understand therapeutic options, efficacy, and mechanisms of action.
This work takes an important step in using Semantic Web technologies to facilitate integration of relevant, distributed, external sources and progress towards a computational platform to support personalized medicine.
TMO can be downloaded from http://code.google.com/p/translationalmedicineontology and TMKB can be accessed at http://tm.semanticscience.org/sparql.
To develop an approximate matching method for finding the closest drug names within existing RxNorm content for drug name variants found in local drug formularies.
We used a drug-centric algorithm to determine the closest strings between the RxNorm data set and local variants which failed the exact and normalized string matching searches. Aggressive measures such as token splitting, drug name expansion and spelling correction are used to try and resolve drug names. The algorithm is evaluated against three sets containing a total of 17,164 drug name variants.
Mapping of the local variant drug names to the targeted concept descriptions ranged from 83.8% to 92.8% in three test sets. The algorithm identified the appropriate RxNorm concepts as the top candidate in 76.8%, 67.9% and 84.8% of the cases in the three test sets and among the top three candidates in 90–96% of the cases.
Using a drug-centric token matching approach with aggressive measures to resolve unknown names provides effective mappings to clinical drug names and has the potential of facilitating the work of drug terminology experts in mapping local formularies to reference terminologies.
This paper proposes a novel semantic method for auditing associative relations in biomedical terminologies. We tested our methodology on two Unified Medical Language System (UMLS) knowledge sources.
We use the UMLS semantic groups as high-level representations of the domain and range of relationships in the Metathesaurus and in the Semantic Network. A mapping created between Metathesaurus relationships and Semantic Network relationships forms the basis for comparing the signatures of a given Metathesaurus relationship to the signatures of the semantic relationship to which it is mapped. The consistency of Metathesaurus relations is studied for each relationship.
Of the 177 associative relationships in the Metathesaurus, 84 (48%) exhibit a high degree of consistency with the corresponding Semantic Network relationships. Overall, 63% of the 1.8M associative relations in the Metathesaurus are consistent with relations in the Semantic Network.
The semantics of associative relationships in biomedical terminologies should be defined explicitly by their developers. The Semantic Network would benefit from being extended with new relationships and with new relations for some existing relationships. The UMLS editing environment could take advantage of the correspondence established between relationships in the Metathesaurus and the Semantic Network. Finally, the auditing method also yielded useful information for refining the mapping of associative relationships between the two sources.
Biomedical terminologies; Associative relationships; Auditing methods; Unified Medical Language System (UMLS)
The era of applied genomic medicine is quickly approaching accompanied by the increasing availability of detailed genetic information. Understanding the genetic etiology behind complex, multi-gene diseases remains an important challenge. In order to uncover the putative genetic etiology of complex diseases, we designed a method that explores the relationships between two major terminological and ontological resources: the Unified Medical Language System (UMLS) and the Gene Ontology (GO). The UMLS has a mainly clinical emphasis; Gene Ontology (GO) has become the standard for biological annotations of genes and gene products. Using statistical and semantic relationships within and between the two resources, we are able to infer relationships between disease concepts in the UMLS and gene products annotated using GO and its associated databases. We validated our inferences by comparing them to the known gene-disease relationships, as defined in the Online Mendelian Inheritance in Man1s morbidmap. The proof-of-concept methods presented here are unique in that they bypass the ambiguity of the direct extraction of gene or disease term from MEDLINE. Additionally, our methods provide direct links to clinically significant diseases through established terminologies or ontologies. The preliminary results presented here indicate the potential utility of exploiting the existing, manually curated relationships in biomedical resources as a tool for the discovery of potentially valuable new gene-disease relationships. The GenesTrace system may be accessed at the following URL: http://phene .cpmc.columbia.edu:8080/genesTrace/index.jsp
RxNorm is a standardized nomenclature for clinical drug entities developed by the National Library of Medicine. In this paper, we audit relations in RxNorm for consistency and completeness through the systematic analysis of the graph of its concepts and relationships.
The representation of multi-ingredient drugs is normalized in order to make it compatible with that of single-ingredient drugs. All meaningful paths between two nodes in the type graph are computed and instantiated. Alternate paths are automatically compared and manually inspected in case of inconsistency.
The 115 meaningful paths identified in the type graph can be grouped into 28 groups with respect to start and end nodes. Of the 19 groups of alternate paths (i.e., with two or more paths) between the start and end nodes, 9 (47%) exhibit inconsistencies. Overall, 28 (24%) of the 115 paths are inconsistent with other alternate paths. A total of 348 inconsistencies were identified in the April 2008 version of RxNorm and reported to the RxNorm team, of which 215 (62%) had been corrected in the January 2009 version of RxNorm.
The inconsistencies identified involve missing nodes (93), missing links (17), extraneous links (237) and one case of mix-up between two ingredients. Our auditing method proved effective in identifying a limited number of errors that had defeated the quality assurance mechanisms currently in place in the RxNorm production system. Some recommendations for the development of RxNorm are provided.
Biomedical terminologies; Auditing methods; RxNorm; Quality assurance; Graphs
The National Cancer Institute (NCI) is developing an integrated biomedical informatics infrastructure, the cancer Biomedical Informatics Grid (caBIG®), to support collaboration within the cancer research community. A key part of the caBIG architecture is the establishment of terminology standards for representing data. In order to evaluate the suitability of existing controlled terminologies, the caBIG Vocabulary and Data Elements Workspace (VCDE WS) working group has developed a set of criteria that serve to assess a terminology's structure, content, documentation, and editorial process. This paper describes the evolution of these criteria and the results of their use in evaluating four standard terminologies: the Gene Ontology (GO), the NCI Thesaurus (NCIt), the Common Terminology for Adverse Events (known as CTCAE), and the laboratory portion of the Logical Objects, Identifiers, Names and Codes (LOINC). The resulting caBIG criteria are presented as a matrix that may be applicable to any terminology standardization effort.
Terminology; Ontology; Auditing; Evaluation
To develop normalization methods for managing the variation in clinical drug names.
Manual examination of drug names from RxNorm and local variants collected from formularies led to the identification of three types of drug-specific normalization rules: expansion of abbreviations (e.g., tab to tablet);reformatting of specific elements (e.g., space between number and unit); and removal of salt variants (e.g., succinate from metoprolol succinate).
After drug-specific normalization, recall of 3397 previously non-matching names from formularies reaches 45% overall (70% of some subsets), compared to 10–20% after generic normalization. Ambiguity has not increased significantly in the RxNorm dataset.
A limited number of drug-specific normalization operations provide significant improvement over general language normalization.
To assess whether 1) the necessary drug classes and 2) the necessary drug-class membership relations are represented in biomedical terminologies in order to support clinical decision regarding drug-drug interactions.
In order to investigate drug classes and drug-class membership in clinical terminologies, we start by establishing a reference list of these entities. Then, we map drugs and classes to the UMLS, where we investigate their relations.
186 (83%) of the 223 names for drug classes mapped to the UMLS. The single best source is SNOMED CT with 75%. 140 (89%) of the 157 drug-membership relations were found in the UMLS.
One important category of drug classes missing from all clinical terminologies is related to drug metabolism by the Cytochrome P450 enzyme family.
One criterion for the well-formedness of ontologies is that their hierarchical structure forms a lattice. Formal Concept Analysis (FCA) has been used as a technique for assessing the quality of ontologies, but is not scalable to large ontologies such as SNOMED CT (> 300k concepts). We developed a methodology called Lattice-based Structural Auditing (LaSA), for auditing biomedical ontologies, implemented through automated SPARQL queries, in order to exhaustively identify all non-lattice pairs in SNOMED CT. The percentage of non-lattice pairs ranges from 0 to 1.66 among the 19 SNOMED CT hierarchies. Preliminary manual inspection of a limited portion of the over 544k non-lattice pairs, among over 356 million candidate pairs, revealed inconsistent use of precoordination in SNOMED CT, but also a number of false positives. Our results are consistent with those based on FCA, with the advantage that the LaSA pipeline is scalable and applicable to ontological systems consisting mostly of taxonomic links.
Health professionals are faced with challenges when they have to exploit the semantics of concepts present in clinical terminologies in support of research activities. The difficulty lies in the fact that this semantics is represented not only through the labels of concepts, but also their position in the hierarchy, and, when available, their logical and textual definitions. We investigate and contrast the lexical, hierarchical, and logical representations of concepts in SNOMED CT through the example of Anemia and three other disorders. The four use cases we developed suggest that the lexical, hierarchical, and logical representations of concepts have a limited degree of overlap, but are complementary. Finally, we draw practical implications from our findings for SNOMED CT users and developers.
This paper illustrates how Semantic Web technologies (especially RDF, OWL, and SPARQL) can support information integration and make it easy to create semantic mashups (semantically integrated resources). In the context of understanding the genetic basis of nicotine dependence, we integrate gene and pathway information and show how three complex biological queries can be answered by the integrated knowledge base.
We use an ontology-driven approach to integrate two gene resources (Entrez Gene and HomoloGene) and three pathway resources (KEGG, Reactome and BioCyc), for five organisms, including humans. We created the Entrez Knowledge Model (EKoM), an information model in OWL for the gene resources, and integrated it with the extant BioPAX ontology designed for pathway resources. The integrated schema is populated with data from the pathway resources, publicly available in BioPAX-compatible format, and gene resources for which a population procedure was created. The SPARQL query language is used to formulate queries over the integrated knowledge base to answer the three biological queries.
Simple SPARQL queries could easily identify hub genes, i.e., those genes whose gene products participate in many pathways or interact with many other gene products. The identification of the genes expressed in the brain turned out to be more difficult, due to the lack of a common identification scheme for proteins.
Semantic Web technologies provide a valid framework for information integration in the life sciences. Ontology-driven integration represents a flexible, sustainable and extensible solution to the integration of large volumes of information. Additional resources, which enable the creation of mappings between information sources, are required to compensate for heterogeneity across namespaces.
Semantic Web; Semantic mashup; Nicotine dependence; Information integration; Ontologies
Reliable identification and assignment of cis-regulatory elements in promoter regions is a challenging problem in biology. The sophistication of transcriptional regulation in higher eukaryotes, particularly in metazoans, could be an important factor contributing to their organismal complexity. Here we present an integrated approach where networks of co-expressed genes are combined with gene ontology–derived functional networks to discover clusters of genes that share both similar expression patterns and functions. Regulatory elements are identified in the promoter regions of these gene clusters using a Gibbs sampling algorithm implemented in the A-GLAM software package. Using this approach, we analyze the cell-cycle co-expression network of the yeast Saccharomyces cerevisiae, showing that this approach correctly identifies cis-regulatory elements present in clusters of co-expressed genes.
Promoter sequences; transcription factor–binding sites; co-expression; networks; gene ontology; Gibbs sampling
Motivation: For many years, the Unified Medical Language System (UMLS) semantic network (SN) has been used as an upper-level semantic framework for the categorization of terms from terminological resources in biomedicine. BioTop has recently been developed as an upper-level ontology for the biomedical domain. In contrast to the SN, it is founded upon strict ontological principles, using OWL DL as a formal representation language, which has become standard in the semantic Web. In order to make logic-based reasoning available for the resources annotated or categorized with the SN, a mapping ontology was developed aligning the SN with BioTop.
Methods: The theoretical foundations and the practical realization of the alignment are being described, with a focus on the design decisions taken, the problems encountered and the adaptations of BioTop that became necessary. For evaluation purposes, UMLS concept pairs obtained from MEDLINE abstracts by a named entity recognition system were tested for possible semantic relationships. Furthermore, all semantic-type combinations that occur in the UMLS Metathesaurus were checked for satisfiability.
Results: The effort-intensive alignment process required major design changes and enhancements of BioTop and brought up several design errors that could be fixed. A comparison between a human curator and the ontology yielded only a low agreement. Ontology reasoning was also used to successfully identify 133 inconsistent semantic-type combinations.
Availability: BioTop, the OWL DL representation of the UMLS SN, and the mapping ontology are available at http://www.purl.org/biotop/.
Evolutionary knowledge is often used to facilitate computational attempts at gene function prediction. One rich source of evolutionary information is the relative rates of gene sequence divergence, and in this report we explore the connection between gene evolutionary rates and function. We performed a genome-scale evaluation of the relationship between evolutionary rates and functional annotations for the yeast Saccharomyces cerevisiae. Non-synonymous (dN) and synonymous (dS) substitution rates were calculated for 1,095 orthologous gene sets common to S. cerevisiae and six other closely related yeast species. Differences in evolutionary rates between pairs of genes (ΔdN & ΔdS) were then compared to their functional similarities (sGO), which were measured using Gene Ontology (GO) annotations. Substantial and statistically significant correlations were found between ΔdN and sGO, whereas there is no apparent relationship between ΔdS and sGO. These results are consistent with a mode of action for natural selection that is based on similar rates of elimination of deleterious protein coding sequence variants for functionally related genes. The connection between gene evolutionary rates and function was stronger than seen for phylogenetic profiles, which have previously been employed to inform functional inference. The co-evolution of functionally related yeast genes points to the relevance of specific function for the efficacy of natural selection and underscores the utility of gene evolutionary rates for functional predictions.
Functional inference; Co-evolution; natural selection; genome evolution; gene ontology
To investigate the feasibility of using SNOMED CT as an entry point for coding adverse drug reactions and map them automatically to MedDRA for reporting purposes and interoperability with legacy repositories.
On the one hand, we attempt to map SNOMED CT concepts to MedDRA concepts through the UMLS, using synonymy and explicit mapping relations. On the other, we compute the set of all fine-grained concepts that can be reached from concepts having a mapping to MedDRA.
58% of the Preferred Terms in MedDRA have a mapping to SNOMED CT. Through the descendants in SNOMED CT, 108,305 additional SNOMED CT concepts can be linked to MedDRA.
Fine-grained SNOMED CT concepts can be mapped automatically to MedDRA. This approach has the potential to enable the collection of adverse events related to drugs directly from clinical repositories. The quality of the mapping needs to be evaluated.
Linkages between animal models of diseases and human data enable the development of translational research hypotheses. The objective of this study is to investigate two approaches to integrating phenotype and clinical information. On the one hand, we develop a terminology mapping between phenotypes from the Mammalian Phenotype Ontology (MPO) and Online Mendelian Inheritance in Man (OMIM) through the Unified Medical Language System (UMLS). On the other, we associate MPO phenotypes with OMIM manifestations through annotations made to orthologous genes. 1,469 MPO concepts (22%) were mapped successfully to some disease concept in the UMLS, of which 869 were present in OMIM. Among the 16,764 distinct MGI genes associated with human orthologs, 1,968 distinct genes were associated with both MPO and OMIM annotations. The UMLS is a valuable resource for linking phenotype terms to clinical terminologies, and these mappings between terminologies can help enrich gene annotation databases and unify phenotype representation.
An ontology is a formal representation of a domain modeling the entities in the domain and their relations. When a domain is represented by multiple ontologies, there is need for creating mappings among these ontologies in order to facilitate the integration of data annotated with these ontologies and reasoning across ontologies. The objective of this paper is to recapitulate our experience in aligning large anatomical ontologies and to reflect on some of the issues and challenges encountered along the way. The four anatomical ontologies under investigation are the Foundational Model of Anatomy, GALEN, the Adult Mouse Anatomical Dictionary and the NCI Thesaurus. Their underlying representation formalisms are all different. Our approach to aligning concepts (directly) is automatic, rule-based, and operates at the schema level, generating mostly point-to-point mappings. It uses a combination of domain-specific lexical techniques and structural and semantic techniques (to validate the mappings suggested lexically). It also takes advantage of domain-specific knowledge (lexical knowledge from external resources such as the Unified Medical Language System, as well as knowledge augmentation and inference techniques). In addition to point-to-point mapping of concepts, we present the alignment of relationships and the mapping of concepts group-to-group. We have also successfully tested an indirect alignment through a domain-specific reference ontology. We present an evaluation of our techniques, both against a gold standard established manually and against a generic schema matching system. The advantages and limitations of our approach are analyzed and discussed throughout the paper.
Ontology; ontology alignment; knowledge representation; anatomy; Semantic Web
Entrez Gene (EG), Online Mendelian Inheritance in Man (OMIM) and the Gene Ontology (GO) are three complementary knowledge resources that can be used to correlate genomic data with disease information. However, bridging between genotype and phenotype through these resources currently requires manual effort or the development of customized software. In this paper, we argue that integrating EG and GO provides a robust and flexible solution to this problem. We demonstrate how the Resource Description Framework (RDF) developed for the Semantic Web can be used to represent and integrate these resources and enable seamless access to them as a unified resource. We illustrate the effectiveness of our approach by answering a real-world biomedical query linking a specific molecular function, glycosyltransferase, to the disorder congenital muscular dystrophy.
knowledge integration; Semantic Web; RDF; Entrez Gene; Gene Ontology