Non-lattice fragments are often indicative of structural anomalies in ontological systems and, as such, represent possible areas of focus for subsequent quality assurance work. However, extracting the non-lattice fragments in large ontological systems is computationally expensive if not prohibitive, using a traditional sequential approach. In this paper we present a general MapReduce pipeline, called MaPLE (MapReduce Pipeline for Lattice-based Evaluation), for extracting non-lattice fragments in large partially ordered sets and demonstrate its applicability in ontology quality assurance. Using MaPLE in a 30-node Hadoop local cloud, we systematically extracted non-lattice fragments in 8 SNOMED CT versions from 2009 to 2014 (each containing over 300k concepts), with an average total computing time of less than 3 hours per version. With dramatically reduced time, MaPLE makes it feasible not only to perform exhaustive structural analysis of large ontological hierarchies, but also to systematically track structural changes between versions. Our change analysis showed that the average change rates on the non-lattice pairs are up to 38.6 times higher than the change rates of the background structure (concept nodes). This demonstrates that fragments around non-lattice pairs exhibit significantly higher rates of change in the process of ontological evolution.
Many complex information needs that arise in biomedical disciplines require exploring multiple documents in order to obtain information. While traditional information retrieval techniques that return a single ranked list of documents are quite common for such tasks, they may not always be adequate. The main issue is that ranked lists typically impose a significant burden on users to filter out irrelevant documents. Additionally, users must intuitively reformulate their search query when relevant documents have not been not highly ranked. Furthermore, even after interesting documents have been selected, very few mechanisms exist that enable document-to-document transitions. In this paper, we demonstrate the utility of assertions extracted from biomedical text (called semantic predications) to facilitate retrieving relevant documents for complex information needs. Our approach offers an alternative to query reformulation by establishing a framework for transitioning from one document to another. We evaluate this novel knowledge-driven approach using precision and recall metrics on the 2006 TREC Genomics Track.
semantic predications; question answering; background knowledge; literature-based discovery; text mining
We present a scalable, SPARQL-based computational pipeline for testing the lattice-theoretic properties of partial orders represented as RDF triples. The use case for this work is quality assurance in biomedical ontologies, one desirable property of which is conformance to lattice structures. At the core of our pipeline is the algorithm called NuMi, for detecting the Number of Minimal upper bounds of any pair of elements in a given finite partial order. Our technical contribution is the coding of NuMi completely in SPARQL. To show its scalability, we applied NuMi to the entirety of SNOMED CT, the largest clinical ontology (over 300,000 conepts). Our experimental results have been groundbreaking: for the first time, all non-lattice pairs in SNOMED CT have been identified exhaustively from 34 million candidate pairs using over 2.5 billion queries issued to Virtuoso. The percentage of non-lattice pairs ranges from 0 to 1.66 among the 19 SNOMED CT hierarchies. These non-lattice pairs represent target areas for focused curation by domain experts. RDF, SPARQL and related tooling provide an e cient platform for implementing lattice algorithms on large data structures.
There is a need to develop methods to automatically incorporate prior knowledge to support the prediction and validation of novel functional associations. One such important source is represented by the Gene Ontology (GO)™ and the many model organism databases of gene products annotated to the GO. We investigated quantitative relationships between the GO-driven similarity of genes and their functional interactions by analyzing different types of associations in Saccharomyces cerevisiae and Caenorhabditis elegans. Interacting genes exhibited significantly higher levels of GO-driven similarity (GOS) in comparison to random pairs of genes used as a surrogate for negative interactions. The Biological Process hierarchy provides more reliable results for co-regulatory and protein-protein interactions. GOS represent a relevant resource to support prediction of functional networks in combination with other resources.
The Gene Ontology and annotations derived from the S. cerivisiae Genome Database were analyzed to calculate functional similarity of gene products. Three methods for measuring similarity (including a distance-based approach) were implemented. Significant, quantitative relationships between similarity and expression correlation of pairs of genes were detected. Using a known gene expression dataset in yeast, this study compared more than three million pairs of gene products on the basis of these functional properties. Highly correlated genes exhibit strong similarity based on information originating from the gene ontology taxonomies. Such a similarity is significantly stronger than that observed between weakly correlated genes. This study supports the feasibility of applying gene ontology-driven similarity methods to functional prediction tasks, such as the validation of gene expression analyses and the identification of false positives in protein interaction studies.
Gene Ontology; functional similarity; gene expression correlation
This research explores the feasibility of semantic similarity approaches to supporting predictive tasks in functional genomics. It aims to establish potential relationships between ontology-based similarity of gene products and important functional properties, such as gene expression correlation. Similarity measures based on the information content of the Gene Ontology (GO) were analyzed. Models have been implemented using data obtained from well-known studies in S. cerevisiae. Results suggest that there may exist significant relationships between gene expression correlation and semantic similarity. Analyses of protein complex data show that, in general, there is a significant correlation between the semantic similarity exhibited by a pair of genes and the probability of finding them in the same complex. These results can also be interpreted as an assessment of the quality and consistency of the information represented in the GO.
Association rules mining methods have been recently applied to gene expression data analysis to reveal relationships between genes and different conditions and features. However, not much effort has focused on detecting the relation between gene expression maps and related gene functions. Here we describe such an approach to mine association rules among gene functions in clusters of similar gene expression maps on mouse brain. The experimental results show that the detected association rules make sense biologically. By inspecting the obtained clusters and the genes having the gene functions of frequent itemsets, interesting clues were discovered that provide valuable insight to biological scientists. Moreover, discovered association rules can be potentially used to predict gene functions based on similarity of gene expression maps.
association rules mining; gene expression maps; gene functions; clustering; voxelation
Taxonomies are commonly used for organizing knowledge, particularly in biomedicine where the taxonomy of living organisms and the classification of diseases are central to the domain. The principles used to produce taxonomies are either intrinsic (properties of the partial ordering relation) or added to make knowledge more manageable (opposition of siblings and economy). The applicability of these principles in the biomedical domain is presented using the Unified Medical Language System (UMLS) and issues raised by the application of these principles are illustrated. While intrinsic principles are not challenged, we argue that the opposition of siblings brings to bear excessive constraints on a domain ontology and that the adverse effects of economy may outweigh its benefits. The two-level structure used in the UMLS is discussed.
Theory; Taxonomic relation; Ontology; Biomedical domain; Unified Medical Language System
Undetected adverse drug reactions (ADRs) pose a major burden on the health system. Data mining methodologies designed to identify signals of novel ADRs are of deep importance for drug safety surveillance. The development and evaluation of these methodologies requires proper reference benchmarks. While progress has recently been made in developing such benchmarks, our understanding of the performance characteristics of the data mining methodologies is limited because existing benchmarks do not support prospective performance evaluations. We address this shortcoming by providing a reference standard to support prospective performance evaluations. The reference standard was systematically curated from drug labeling revisions, such as new warnings, which were issued and communicated by the US Food and Drug Administration in 2013. The reference standard includes 62 positive test cases and 75 negative controls, and covers 44 drugs and 38 events. We provide usage guidance and empirical support for the reference standard by applying it to analyze two data sources commonly mined for drug safety surveillance.
The Resource Description Framework (RDF) format is being used by a large number of scientific applications to store and disseminate their datasets. The provenance information, describing the source or lineage of the datasets, is playing an increasingly significant role in ensuring data quality, computing trust value of the datasets, and ranking query results. Current provenance tracking approaches using the RDF reification vocabulary suffer from a number of known issues, including lack of formal semantics, use of blank nodes, and application-dependent interpretation of reified RDF triples. In this paper, we introduce a new approach called Provenance Context Entity (PaCE) that uses the notion of provenance context to create provenance-aware RDF triples. We also define the formal semantics of PaCE through a simple extension of the existing RDF(S) semantics that ensures compatibility of PaCE with existing Semantic Web tools and implementations. We have implemented the PaCE approach in the Biomedical Knowledge Repository (BKR) project at the US National Library of Medicine. The evaluations demonstrate a minimum of 49% reduction in total number of provenance-specific RDF triples generated using the PaCE approach as compared to RDF reification. In addition, performance for complex queries improves by three orders of magnitude and remains comparable to the RDF reification approach for simpler provenance queries.
Provenance context entity; Biomedical knowledge repository; Context theory; RDF reification; Provenir ontology
To investigate the extent to which pharmacoepidemiologic groupings are
homogeneous in terms of clinical properties.
In our analysis, we classified drug subgroups from the pharmacoepidemiologic
Anatomical Therapeutic Chemical (ATC) classification system based on clinical drug
properties. We established mappings from ATC fifth level drug entities to drug property
annotations in the National Drug File Reference Terminology (NDF-RT), including
therapeutic categories, mechanisms of action, and physiologic effects. Based on the
annotations for the individual drugs we computed homogeneity scores for all ATC groups
and analyzed their distribution.
We found ATC groups to be generally homogeneous, more so for mechanisms of
action, and physiologic effects than for therapeutic intent. However, only half of all
ATC drugs can be analyzed with this approach, in part because of missing properties in
ATC; NDF-RT; pharmacoepidemiologic groups; clinical use
To investigate errors identified in SNOMED CT by human reviewers with help from the Abstraction Network methodology and examine why they had escaped detection by the Description Logic (DL) classifier. Case study; Two examples of errors are presented in detail (one missing IS-A relation and one duplicate concept). After correction, SNOMED CT is reclassified to ensure that no new inconsistency was introduced.
DL-based auditing techniques built in terminology development environments ensure the logical consistency of the terminology. However, complementary approaches are needed for identifying and addressing other types of errors.
Systematized nomenclature of medicine; Comparative study; Quality assurance; Description logics; Abstraction network
The UMLS Semantic Network and Metathesaurus are two complementary knowledge sources. While many studies compare relationships across the two structures, their alignment has never been attempted. We applied two methods based on lexical and conceptual similarity to aligning the Semantic Network with the UMLS Metathesaurus. Approximately two thirds of the semantic types could be aligned by lexical similarity. Conceptual similarity suggested mappings in all but ten cases. Potential applications enabled by the alignment are discussed, namely auditing the consistency between the Semantic Network and the Metathesaurus and extending the Semantic Network downwards. The relative contribution and limitations of the two methods used for the alignment are also discussed.
UMLS; Alignment; Hierarchical relationships
In French hospitals, medical diagnosis coding with the ICD10 is commonly performed and the use of effective tools would help coders in their task.
Aim of this work
to ameliorate an existing coding help system. This system, which already consists of the ICD10 analytical index, would be increased with the terms of the alphabetical index that includes lexical variants and additional terms as well. The addition of the second volume of the ICD would allow the coding of more terms and would lessen documentary silence.
the first step of this work was a careful study of a theoretical model of the ICD content. Then the alphabetical index file was submitted to a lexical analysis, and it was automatically transformed to be integrated into the existing coding help system.
Compromise had to be made between a theoretical model and between what could be obtained in practice by an automatic processing of the file. Finally the alphabetical index was added to the initial thesaurus, which represents 42,000 terms and 4,000 additional words. Links between words and codes were also considerably increased, which has enhanced the searching possibilities of the tool and lessen documentary silence. Conversely the research time has been increased.
difficulties have to be encountered when trying to turn a manual tool into an automatic research tool.
Coding; Disease Classification; Medical Diagnosis; ICD10; Knowledge Representation
Our objective was to enable an end-user to create complex queries to drug information sources through functional composition, by creating sequences of functions from application program interfaces (API) to drug terminologies. The development of a functional composition model seeks to link functions from two distinct APIs. An ontology was developed using Protégé to model the functions of the RxNorm and NDF-RT APIs by describing the semantics of their input and output. A set of rules were developed to define the interoperable conditions for functional composition. The operational definition of interoperability between function pairs is established by executing the rules on the ontology. We illustrate that the functional composition model supports common use cases, including checking interactions for RxNorm drugs and deploying allergy lists defined in reference to drug properties in NDF-RT. This model supports the RxMix application (http://mor.nlm.nih.gov/RxMix/), an application we developed for enabling complex queries to the RxNorm and NDF-RT APIs.
RxNorm; NDF-RT; application programming interface; web service composition; complex queries
To quantify semantic inconsistency in UMLS concepts from the perspective of their hierarchical relations and to show through examples how semantically-inconsistent concepts can help reveal erroneous synonymy relations.
Inconsistency is defined in reference to concepts from the UMLS Metathesaurus. Consistency is evaluated by comparing the semantic groups of the two concepts in each pair of hierarchically-related concepts. A limited number of inconsistent concepts was inspected manually.
81,512 concepts are inconsistent due to the differences in semantic groups between a concept and its parent. Four examples of wrong synonymy are presented.
A vast majority of inconsistent hierarchical relations are not indicative of any errors. We discovered an interesting semantic pattern along hierarchies, which seems associated with wrong synonymy.
Unified medical language system; Semantic consistency
Polysemy is a frequent issue in biomedical terminologies. In the Unified
Medical Language System (UMLS), polysemous terms are either represented as several
independent concepts, or clustered into a single, multiply-categorized concept. The
objective of this study is to analyze polysemous concepts in the UMLS through their
categorization and hierarchical relations for auditing purposes.
We used the association of a concept with multiple Semantic Groups (SGs) as a
surrogate for polysemy. We first extracted multi-SG (MSG) concepts from the UMLS
Metathesaurus and characterized them in terms of the combinations of SGs with which they
are associated. We then clustered MSG concepts in order to identify major types of
polysemy. We also analyzed the inheritance of SGs in MSG concepts. Finally, we manually
reviewed the categorization of the MSG concepts for auditing purposes.
The 1208 MSG concepts in the Metathesaurus are associated with 30 distinct
pairs of SGs. We created 75 semantically homogeneous clusters of MSG concepts, and 276
MSG concepts could not be clustered for lack of hierarchical relations. The clusters
were characterized by the most frequent pairs of semantic types of their constituent MSG
concepts. MSG concepts exhibit limited semantic compatibility with their parent and
child concepts. A large majority of MSG concepts (92%) are adequately categorized.
Examples of miscategorized concepts are presented.
This work is a systematic analysis and manual review of all concepts
categorized by multiple SGs in the UMLS. The correctly-categorized MSG concepts do
reflect polysemy in the UMLS Metathesaurus. The analysis of inheritance of SGs proved
useful for auditing concept categorization in the UMLS.
Biomedical terminologies; Auditing methods; Unified Medical Language System (UMLS); Polysemy; Semantic categorization
Methods for comparing associative relationships across ontologies often rely solely on lexical similarity between the names of the relationships, which may lead to missed matches and inaccurate matches. In this paper, we propose a novel method based on the analysis of paths between equivalent concepts across ontologies. Patterns of relationships are identified for each associative relationship. The most frequent patterns indicate a correspondence between an associative relationship in one ontology and one relationship (or combination thereof) in the other. We applied this method to two ontologies of anatomy. Our method was able to identify the correspondence between relationships even in the absence of lexical similarity between relationship names. The various types of matches identified are discussed as well as the application of this method to detecting inconsistencies across the ontologies.
Ontology; associative relationship; hierarchical relationship; ontology matching; anatomy; GALEN; Foundational Model of Anatomy
To characterize the relationships among UMLS concepts that co-occur as MeSH descriptors in MEDLINE citations (1990-1999).
18,485 UMLS concepts involved in 7,928,608 directed pairs of co-occurring concepts were studied. For each directed pair of concepts C1-C2: (i) the “family” of C1 was built, using the UMLS Metathesaurus, and we tested whether or not C2 belonged to C1's family; (ii) we used the semantic categorization of Metathesaurus concepts through the UMLS Semantic Network and Semantic Groups to represent the semantics of the relationships between C1 and C2.
In 6.5% of the directed pairs, the co-occurring concept C2 was found within the “family” of C1. Detailed results are given. The most frequent co-occurrences involved “Chemicals & Drugs” and “Chemicals & Drugs”, as well as “Disorders” and “Chemicals & Drugs”.
This work takes advantage of both symbolic and statistical information represented in the UMLS, and analyzes their overlap. Further research is suggested.
UMLS; Semantics; MeSH; co-occurrences
The conceptual complexity of a domain can make it difficult for users of information systems to comprehend and interact with the knowledge embedded in those systems. The Unified Medical Language System® (UMLS) currently integrates over 730,000 biomedical concepts from more than fifty biomedical vocabularies. The UMLS semantic network reduces the complexity of this construct by grouping concepts according to the semantic types that have been assigned to them. For certain purposes, however, an even smaller and coarser-grained set of semantic type groupings may be desirable. In this paper, we discuss our approach to creating such a set. We present six basic principles, and then apply those principles in aggregating the existing 134 semantic types into a set of 15 groupings. We present some of the difficulties we encountered and the consequences of the decisions we have made. We discuss some possible uses of the semantic groups, and we conclude with implications for future work.
Unified Medical Language System; Knowledge Representation; Medical Informatics
Animal models are a key resource for the investigation of human diseases. In contrast to functional annotation, phenotype annotation is less standard, and comparing phenotypes across species remains challenging. The objective of this paper is to propose a framework for comparing phenotype annotations of orthologous genes based on the Medical Subject Headings (MeSH) indexing of biomedical articles in which these genes are discussed.
17,769 pairs of orthologous genes (mouse and human) are downloaded from the Mouse Genome Informatics (MGI) system and linked to biomedical articles through Entrez Gene. MeSH index terms corresponding to diseases are extracted from Medline.
11,111 pairs of genes exhibited at least one phenotype annotation for each gene in the pair. Among these, 81% have at least one phenotype annotation in common, 80% have at least one annotation specific to the human gene and 84% have at least one annotation specific to the mouse gene. Four disease categories represent 54% of all phenotype annotations.
This framework supports the curation of phenotype annotation and the generation of research hypotheses based on comparative studies.
Phenotype; Comparative study; Medical informatics computing; Medical subject headings
The Value Set Authority Center (VSAC) at the National Library of Medicine (NLM) provides downloadable access to all official versions of vocabulary value sets contained in the Clinical Quality Measures (CQMs) used in the certification criteria for electronic health record systems (“Meaningful Use” incentive program). Each value set consists of the numerical values (codes) and human-readable names (descriptions), drawn from standard vocabularies such as LOINC, RxNorm and SNOMED CT®, that are used to define clinical data elements used in clinical quality measures (e.g., patients with diabetes, tricyclic antidepressants). The content of the VSAC will gradually expand to incorporate value sets for other use cases, as well as for new measures and updates to existing measures.
Value sets; Clinical quality measures
This work aims at understanding the state of the art in the broad contextual research area of “medical concept representation”. Our data support the general understanding that the focus of research has moved toward medical ontologies, which we interpret as a paradigm shift. Both the opinion of socially active groups of researchers and changes in biblio-metric data since 1988 support this opinion. Socially active researchers mention the OBO foundry, SNOMED CT, and the UMLS as anchor activities.
medical concepts; ontology; bibliometric analysis; social computing
The MAOUSSC (Model for Assistance in the Orientation of a User within Coding Systems) Web server supports a collaborative work on the description of medical procedures. The specifications for the MAOUSSC application are conceptual modeling, definition of semantically fully described procedures, re-use of an existing vocabulary, the UMLS, and sharability. This paper reports on some difficulties in applying those principles in a networked building and updating of the terminology. The users are physicians who have to represent procedure terms in the MAOUSSC formalism. They must apply the constraints of the underlying model, and re-use the representation of the UMLS knowledge base. In our experience, we found that the implementation of syntactic and semantic constraints was not sufficient. Guidelines for pragmatical aspects in representation are required to make a collaborative approach in terminology building more operational.
Medical Procedures; Nomenclature; Web; Collaborative approach; Knowledge Representation; Pragmatics
The objective of this study is to develop a framework for assessing the consistency of drug classes across sources, such as MeSH and ATC. Our framework integrates and contrasts lexical and instance-based ontology alignment techniques. Moreover, we propose metrics for assessing not only equivalence relations, but also inclusion relations among drug classes.
We identified 226 equivalence relations between MeSH and ATC classes through the lexical alignment, and 223 through the instance-based alignment, with limited overlap between the two (36). We also identified 6,257 inclusion relations. Discrepancies between lexical and instance-based alignments are illustrated and discussed.
Our work is the first attempt to align drug classes with sophisticated instance-based techniques, while also distinguishing between equivalence and inclusion relations. Additionally, it is the first application of aligning drug classes in ATC and MeSH. By providing a detailed account of similarities and differences between drug classes across sources, our framework has the prospect of effectively supporting the creation of a mapping of drug classes between ATC and MeSH by domain experts.
Drug classes; MeSH; ATC; Instance-based mapping; Lexical mapping