The Health Level 7 Clinical Document Architecture (CDA) is widely accepted as the format for electronic clinical document. With the rich ontological references in CDA documents, the ontology-based semantic query could be performed to retrieve CDA documents. In this paper, we present iSMART (interactive Semantic MedicAl Record reTrieval), a prototype system designed for ontology-based semantic query of CDA documents. The clinical information in CDA documents will be extracted into RDF triples by a declarative XML to RDF transformer. An ontology reasoner is developed to infer additional information by combining the background knowledge from SNOMED CT ontology. Then an RDF query engine is leveraged to enable the semantic queries. This system has been evaluated using the real clinical documents collected from a large hospital in southern China.
This paper presents a novel framework for Visual Exploratory Search of Relationship Graphs on Smartphones (VESRGS) that is composed of three major components: inference and representation of semantic relationship graphs on the Web via meta-search, visual exploratory search of relationship graphs through both querying and browsing strategies, and human-computer interactions via the multi-touch interface and mobile Internet on smartphones. In comparison with traditional lookup search methodologies, the proposed VESRGS system is characterized with the following perceived advantages. 1) It infers rich semantic relationships between the querying keywords and other related concepts from large-scale meta-search results from Google, Yahoo! and Bing search engines, and represents semantic relationships via graphs; 2) the exploratory search approach empowers users to naturally and effectively explore, adventure and discover knowledge in a rich information world of interlinked relationship graphs in a personalized fashion; 3) it effectively takes the advantages of smartphones’ user-friendly interfaces and ubiquitous Internet connection and portability. Our extensive experimental results have demonstrated that the VESRGS framework can significantly improve the users’ capability of seeking the most relevant relationship information to their own specific needs. We envision that the VESRGS framework can be a starting point for future exploration of novel, effective search strategies in the mobile Internet era.
A novel internet-based application is presented which provides access to anatomy knowledge through symbolic modality expressed by keywords taken from controlled or non-controlled terminology. The system is based on a database where anatomical concepts have been organized into a hierarchical framework. Along with term queries that allow retrieving concepts containing or exactly matching the used keyword, the system also provides semantic access to anatomical information. Queries can be setup, which retrieve concepts relying to a particular meaning and sharing a particular relationship. Moreover, the application has the capability to refine the search of the terms by querying the UMLS knowledge server. Anatomical image data have been integrated by using Visible Human Dataset. A set of these images has been indexed according to our anatomical classification and is used inside the application. The system has been implemented through Java client-server technology and works within standard Internet browsers.
Present solutions for the representation and retrieval of medical information from online sources are not very satisfying. Either the retrieval process lacks of precision and completeness the representation does not support the update and maintenance of the represented information. Most efforts are currently put into improving the combination of search engines and HTML based documents. However, due to the current shortcomings of methods for natural language understanding there are clear limitations to this approach. Furthermore, this approach does not solve the maintenance problem. At least medical information exceeding a certain complexity seems to afford approaches that rely on structured knowledge representation and corresponding retrieval mechanisms.
Knowledge-based information systems are based on the following fundamental ideas. The representation of information is based on ontologies that define the structure of the domain's concepts and their relations. Views on domain models are defined and represented as retrieval schemata. Retrieval schemata can be interpreted as canonical query types focussing on specific aspects of the provided information (e.g. diagnosis or therapy centred views). Based on these retrieval schemata it can be decided which parts of the information in the domain model must be represented explicitly and formalised to support the retrieval process. As representation language propositional logic is used. All other information can be represented in a structured but informal way using text, images etc. Layout schemata are used to assign layout information to retrieved domain concepts. Depending on the target environment HTML or XML can be used.
Based on this approach two knowledge-based information systems have been developed. The 'Ophthalmologic Knowledge-based Information System for Diabetic Retinopathy' (OKIS-DR) provides information on diagnoses, findings, examinations, guidelines, and reference images related to diabetic retinopathy. OKIS-DR uses combinations of findings to specify the information that must be retrieved. The second system focuses on nutrition related allergies and intolerances. Information on allergies and intolerances of a patient are used to retrieve general information on the specified combination of allergies and intolerances. As a special feature the system generates tables showing food types and products that are tolerated or not tolerated by patients. Evaluation by external experts and user groups showed that the described approach of knowledge-based information systems increases the precision and completeness of knowledge retrieval. Due to the structured and non-redundant representation of information the maintenance and update of the information can be simplified. Both systems are available as WWW based online knowledge bases and CD-ROMs (cf. http://mta.gsf.de topic: products).
Knowledge-based Information Systems; Knowledge-based Systems; Information Retrieval
This paper describes a pilot query interface that has been constructed to help us explore a “concept-based” approach for searching the Neuroscience Information Framework (NIF). The query interface is concept-based in the sense that the search terms submitted through the interface are selected from a standardized vocabulary of terms (concepts) that are structured in the form of an ontology. The NIF contains three primary resources: the NIF Resource Registry, the NIF Document Archive, and the NIF Database Mediator. These NIF resources are very different in their nature and therefore pose challenges when designing a single interface from which searches can be automatically launched against all three resources simultaneously. The paper first discusses briefly several background issues involving the use of standardized biomedical vocabularies in biomedical information retrieval, and then presents a detailed example that illustrates how the pilot concept-based query interface operates. The paper concludes by discussing certain lessons learned in the development of the current version of the interface.
Data search; Web search; Ontologies; Database mediation; Data federation; Text search; Neuroscience
Summary: FACTA is a text search engine for MEDLINE abstracts, which is designed particularly to help users browse biomedical concepts (e.g. genes/proteins, diseases, enzymes and chemical compounds) appearing in the documents retrieved by the query. The concepts are presented to the user in a tabular format and ranked based on the co-occurrence statistics. Unlike existing systems that provide similar functionality, FACTA pre-indexes not only the words but also the concepts mentioned in the documents, which enables the user to issue a flexible query (e.g. free keywords or Boolean combinations of keywords/concepts) and receive the results immediately even when the number of the documents that match the query is very large. The user can also view snippets from MEDLINE to get textual evidence of associations between the query terms and the concepts. The concept IDs and their names/synonyms for building the indexes were collected from several biomedical databases and thesauri, such as UniProt, BioThesaurus, UMLS, KEGG and DrugBank.
Availability: The system is available at http://www.nactem.ac.uk/software/facta/
The immense corpus of biomedical literature existing today poses challenges in information search and integration. Many links between pieces of knowledge occur or are significant only under certain contexts—rather than under the entire corpus. This study proposes using networks of ontology concepts, linked based on their co-occurrences in annotations of abstracts of biomedical literature and descriptions of experiments, to draw conclusions based on context-specific queries and to better integrate existing knowledge. In particular, a Bayesian network framework is constructed to allow for the linking of related terms from two biomedical ontologies under the queried context concept. Edges in such a Bayesian network allow associations between biomedical concepts to be quantified and inference to be made about the existence of some concepts given prior information about others. This approach could potentially be a powerful inferential tool for context-specific queries, applicable to ontologies in other fields as well.
We present an image retrieval framework based on automatic query expansion in a concept feature space by generalizing the vector space model of information retrieval. In this framework, images are represented by vectors of weighted concepts similar to the keyword-based representation used in text retrieval. To generate the concept vocabularies, a statistical model is built by utilizing Support Vector Machine (SVM)-based classification techniques. The images are represented as “bag of concepts” that comprise perceptually and/or semantically distinguishable color and texture patches from local image regions in a multi-dimensional feature space. To explore the correlation between the concepts and overcome the assumption of feature independence in this model, we propose query expansion techniques in the image domain from a new perspective based on both local and global analysis. For the local analysis, the correlations between the concepts based on the co-occurrence pattern, and the metrical constraints based on the neighborhood proximity between the concepts in encoded images, are analyzed by considering local feedback information. We also analyze the concept similarities in the collection as a whole in the form of a similarity thesaurus and propose an efficient query expansion based on the global analysis. The experimental results on a photographic collection of natural scenes and a biomedical database of different imaging modalities demonstrate the effectiveness of the proposed framework in terms of precision and recall.
Image Retrieval; Vector Space Model; Support Vector Machine; Relevance Feedback; Query Expansion
Our group has built an information retrieval system based on a complex semantic markup of medical textbooks. We describe the construction of a set of web-based knowledge-acquisition tools that expedites the collection and maintenance of the concepts required for text markup and the search interface required for information retrieval from the marked text. In the text markup system, domain experts (DEs) identify sections of text that contain one or more elements from a finite set of concepts. End users can then query the text using a predefined set of questions, each of which identifies a subset of complementary concepts. The search process matches that subset of concepts to relevant points in the text. The current process requires that the DE invest significant time to generate the required concepts and questions. We propose a new system--called ACQUIRE (Acquisition of Concepts and Queries in an Integrated Retrieval Environment)--that assists a DE in two essential tasks in the text-markup process. First, it helps her to develop, edit, and maintain the concept model: the set of concepts with which she marks the text. Second, ACQUIRE helps her to develop a query model: the set of specific questions that end users can later use to search the marked text. The DE incorporates concepts from the concept model when she creates the questions in the query model. The major benefit of the ACQUIRE system is a reduction in the time and effort required for the text-markup process. We compared the process of concept- and query-model creation using ACQUIRE to the process used in previous work by rebuilding two existing models that we previously constructed manually. We observed a significant decrease in the time required to build and maintain the concept and query models.
The increasing number of scientific literature on the Web and the absence of efficient tools used for classifying and searching the documents are the two most important factors that influence the speed of the search and the quality of the results. Previous studies have shown that the usage of ontologies makes it possible to process document and query information at the semantic level, which greatly improves the search for the relevant information and makes one step further towards the Semantic Web. A fundamental step in these approaches is the annotation of documents with ontology concepts, which can also be seen as a classification task. In this paper we address this issue for the biomedical domain and present a new automated and robust method, based on a Maximum Entropy approach, for annotating biomedical literature documents with terms from the Medical Subject Headings (MeSH).
The experimental evaluation shows that the suggested Maximum Entropy approach for annotating biomedical documents with MeSH terms is highly accurate, robust to the ambiguity of terms, and can provide very good performance even when a very small number of training documents is used. More precisely, we show that the proposed algorithm obtained an average F-measure of 92.4% (precision 99.41%, recall 86.77%) for the full range of the explored terms (4,078 MeSH terms), and that the algorithm’s performance is resilient to terms’ ambiguity, achieving an average F-measure of 92.42% (precision 99.32%, recall 86.87%) in the explored MeSH terms which were found to be ambiguous according to the Unified Medical Language System (UMLS) thesaurus. Finally, we compared the results of the suggested methodology with a Naive Bayes and a Decision Trees classification approach, and we show that the Maximum Entropy based approach performed with higher F-Measure in both ambiguous and monosemous MeSH terms.
Information retrieval (IR) is the field of computer science that deals with the processing of documents containing free text, so that they can be rapidly retrieved based on keywords specified in a user’s query. IR technology is the basis of Web-based search engines, and plays a vital role in biomedical research, because it is the foundation of software that supports literature search. Documents can be indexed by both the words they contain, as well as the concepts that can be matched to domain-specific thesauri; concept matching, however, poses several practical difficulties that make it unsuitable for use by itself. This article provides an introduction to IR and summarizes various applications of IR and related technologies to genomics.
information retrieval; full-text indexing; text processing; genomics
Clinicians have traditionally documented patient data using natural language text. With the increasing prevalence of computer systems in health care, an increasing amount of medical record text will be stored electronically. However, for such textual documents to be indexed, shared, and processed adequately by computers, it will be important to be able to identify concepts in the documents using a common medical terminology. Automated methods for extracting concepts in a standard terminology would enhance retrieval and analysis of medical record data. This paper discusses a method for extracting concepts from medical record documents using the medical terminology SNOMED-III (Systematized Nomenclature of Human and Veterinary Medicine, Version III). The technique employs a linear least squares fit that maps training set phrases to SNOMED concepts. This mapping can be used for unknown text inputs in the same domain as the training set to predict SNOMED concepts that are contained in the document. We have implemented the method in the domain of congestive heart failure for history and physical exam texts. Our system has a reasonable response time. We tested the system over a range of thresholds. The system performed with 90% sensitivity and 83% specificity at the lowest threshold, and 42% sensitivity and 99.9% specificity at the highest threshold.
Because of the increasing number of electronic resources, designing efficient tools to retrieve and exploit them is a major challenge. Some improvements have been offered by semantic Web technologies and applications based on domain ontologies. In life science, for instance, the Gene Ontology is widely exploited in genomic applications and the Medical Subject Headings is the basis of biomedical publications indexation and information retrieval process proposed by PubMed. However current search engines suffer from two main drawbacks: there is limited user interaction with the list of retrieved resources and no explanation for their adequacy to the query is provided. Users may thus be confused by the selection and have no idea on how to adapt their queries so that the results match their expectations.
This paper describes an information retrieval system that relies on domain ontology to widen the set of relevant documents that is retrieved and that uses a graphical rendering of query results to favor user interactions. Semantic proximities between ontology concepts and aggregating models are used to assess documents adequacy with respect to a query. The selection of documents is displayed in a semantic map to provide graphical indications that make explicit to what extent they match the user's query; this man/machine interface favors a more interactive and iterative exploration of data corpus, by facilitating query concepts weighting and visual explanation. We illustrate the benefit of using this information retrieval system on two case studies one of which aiming at collecting human genes related to transcription factors involved in hemopoiesis pathway.
The ontology based information retrieval system described in this paper (OBIRS) is freely available at: http://www.ontotoolkit.mines-ales.fr/ObirsClient/. This environment is a first step towards a user centred application in which the system enlightens relevant information to provide decision help.
The proliferation of medical terms poses a number of challenges in the sharing of medical information among different stakeholders. Ontologies are commonly used to establish relationships between different terms, yet their role in querying has not been investigated in detail. In this paper, we study the problem of supporting ontology-based keyword search queries on a database of electronic medical records. We present several approaches to support this type of queries, study the advantages and limitations of each approach, and summarize the lessons learned as best practices.
Three Problem List Terminologies (PLT) were tested using a web-based application simulating a clinical data entry environment to evaluate coverage and coding efficiency. The three PLTs were: the CORE Problem List Subset of SNOMED CT, a clinical subset extracted from the full SNOMED CT and the PLT currently used at the Mayo Clinic. Candidate problem statements were randomly extracted from free text problem list entries contained in two electronic medical record systems. Physician reviewers searched for concepts in one of the three PLTs that most closely matched a problem statement. Altogether 45 reviewers reviewed 15 problems each. The coverage of the much smaller CORE Subset was comparable to Clinical SNOMED for combined exact or partial matches. The CORE Subset required the shortest time to find a concept. This may be related to the smaller size of the pick lists for the CORE Subset.
Objective: The idea of testing a hypothesis is central to the practice of biomedical research. However, the results of testing a hypothesis are published mainly in the form of prose articles. Encoding the results as scientific assertions that are both human and machine readable would greatly enhance the synergistic growth and dissemination of knowledge.
Design: We have developed MachineProse (MP), an ontological framework for the concise specification of scientific assertions. MP is based on the idea of an assertion constituting a fundamental unit of knowledge. This is in contrast to current approaches that use discrete concept terms from domain ontologies for annotation and assertions are only inferred heuristically.
Measurements: We use illustrative examples to highlight the advantages of MP over the use of the Medical Subject Headings (MeSH) system and keywords in indexing scientific articles.
Results: We show how MP makes it possible to carry out semantic annotation of publications that is machine readable and allows for precise search capabilities. In addition, when used by itself, MP serves as a knowledge repository for emerging discoveries. A prototype for proof of concept has been developed that demonstrates the feasibility and novel benefits of MP. As part of the MP framework, we have created an ontology of relationship types with about 100 terms optimized for the representation of scientific assertions.
Conclusion: MachineProse is a novel semantic framework that we believe may be used to summarize research findings, annotate biomedical publications, and support sophisticated searches.
Randomized clinical trials (RCT) papers provide reliable information about efficacy of medical interventions. Current keyword based search methods to retrieve medical evidence, overload users with irrelevant information as these methods often do not take in to consideration semantics encoded within abstracts and the search query. Personalized semantic search, intelligent clinical question answering and medical evidence summarization aim to solve this information overload problem. Most of these approaches will significantly benefit if the information available in the abstracts is structured into meaningful categories (e.g., background, objective, method, result and conclusion). While many journals use structured abstract format, the majority of RCT abstracts still remain unstructured.
We have developed a novel automated approach to structuring RCT abstracts by combining text classification and Hidden Markov Modeling (HMM) techniques. The results (precision of 0.94, recall of 0.93) of our approach are a significant improvement over previously reported work on automated sentences categorization in RCT abstracts.
In general, it is very straightforward to store concept identifiers in electronic medical records and represent them in messages. Information models typically specify the fields that can contain coded entries. For each of these fields there may be additional constraints governing exactly which concept identifiers are applicable. However, because modern terminologies such as SNOMED CT are compositional, allowing concept expressions to be pre-coordinated within the terminology or post-coordinated within the medical record, there remains the potential to express a concept in more than one way. Often times, the various representations are similar, but not equivalent. This paper describes an approach for retrieving these pre- and post-coordinated concept expressions: (1) Create concept expressions using a logically-well-structured terminology (e.g., SNOMED CT) according to the rules of a well-specified information model (in this paper we use the HL7 RIM); (2) Transform pre- and post-coordinated concept expressions into a normalized form; (3) Transform queries into the same normalized form. The normalized instances can then be directly compared to the query. Several implementation considerations have been identified. Transformations into a normal form and execution of queries that require traversal of hierarchies need to be optimized. A detailed understanding of the information model and the terminology model are prerequisites. Queries based on the semantic properties of concepts are only as complete as the semantic information contained in the terminology model. Despite these considerations, the approach appears powerful and will continue to be refined.
Searching useful information from unstructured medical multimedia data has been a difficult problem in information retrieval. This paper reports an effective semantic medical multimedia retrieval approach which can reflect the users' query intent. Firstly, semantic annotations will be given to the multimedia documents in the medical multimedia database. Secondly, the ontology that represented semantic information will be hidden in the head of the multimedia documents. The main innovations of this approach are cross-type retrieval support and semantic information preservation. Experimental results indicate a good precision and efficiency of our approach for medical multimedia retrieval in comparison with some traditional approaches.
Most present day information retrieval systems use the presence or absence of certain words to decide which documents are appropriate for a user's query. This approach has had certain successes, but it fails to capture relationships between concepts represented by the words, and hence reduces the potential specificity of both indexing and searching of documents. A richer representation of the semantics of documents and queries, and methods for reasoning about these representations, have been provided by artificial intelligence. Navigational tools for browsing and authoring knowledge bases (KB's) add a convenient technique for focusing in the complex landscape of semantic representations. The center of such representations is usually a frame or a semantic network system. We are developing a prototype Unified Medical Language System (UMLS) taxonomy to represent objects and relationships in medicine. One focus of our research is improved methods for indexing and querying repositories of biomedical literature. The technique which we propose is based on the notion of relatedness of concepts. To this end we define heuristics which find related concepts and apply it to the UMLS taxonomy. Preliminary results from experiments with the implemented heuristics demonstrate its potential usefulness.
Summary: Search engines running on MEDLINE abstracts have been widely used by biologists to find publications that are related to their research. The existing search engines such as PubMed, however, have limitations when applied for the task of seeking textual evidence of relations between given concepts. The limitations are mainly due to the problem that the search engines do not effectively deal with multi-term queries which may imply semantic relations between the terms. To address this problem, we present MedEvi, a novel search engine that imposes positional restriction on occurrences matching multi-term queries, based on the observation that terms with semantic relations which are explicitly stated in text are not found too far from each other. MedEvi further identifies additional keywords of biological and statistical significance from local context of matching occurrences in order to help users reformulate their queries for better results.
Infobuttons have been established to be an effective resource for addressing information needs at the point of care, as evidenced by recent research and their inclusion in government-based electronic health record incentive programs in the United States. Yet their utility has been limited to wide success for only a specific set of domains (lab data, medication orders, and problem lists) and only for discrete, singular concepts that are already documented in the electronic medical record. In this manuscript, we present an effort to broaden their utility by connecting a semantic web-based phenotyping engine with an infobutton framework in order to identify and address broader issues in patient data, derived from multiple data sources. We have tested these patterns by defining and testing semantic definitions of pre-diabetes and metabolic syndrome. We intend to carry forward relevant information to the infobutton framework to present timely, relevant education resources to patients and providers.
Semantic-similarity measures quantify concept similarities in a given ontology. Potential applications for these measures include search, data mining, and knowledge discovery in database or decision-support systems that utilize ontologies. To date, there have not been comparisons of the different semantic-similarity approaches on a single ontology. Such a comparison can offer insight on the validity of different approaches. We compared 3 approaches to semantic similarity-metrics (which rely on expert opinion, ontologies only, and information content) with 4 metrics applied to SNOMED-CT. We found that there was poor agreement among those metrics based on information content with the ontology only metric. The metric based only on the ontology structure correlated most with expert opinion. Our results suggest that metrics based on the ontology only may be preferable to information-content–based metrics, and point to the need for more research on validating the different approaches.
SNOMED-CT has been promoted as a reference terminology for electronic health record (EHR) systems. Many important EHR functions are based on the assumption that medical concepts will be coded consistently by different users. This study is designed to measure agreement among three physicians using two SNOMED-CT terminology browsers to encode 242 concepts from five ophthalmology case presentations in a publicly-available clinical journal. Inter-coder reliability, based on exact coding match by each physician, was 44% using one browser and 53% using the other. Intra-coder reliability testing revealed that a different SNOMED-CT code was obtained up to 55% of the time when the two browsers were used by one user to encode the same concept. These results suggest that the reliability of SNOMED-CT coding is imperfect, and may be a function of browsing methodology. A combination of physician training, terminology refinement, and browser improvement may help increase the reproducibility of SNOMED-CT coding.
The optimal retrieval of a literature search in biomedicine depends on the appropriate use of Medical Subject Headings (MeSH), descriptors and keywords among authors and indexers. We hypothesized that authors, investigators and indexers in four biomedical databases are not consistent in their use of terminology in Complementary and Alternative Medicine (CAM).
Based on a research question addressing the validity of spinal palpation for the diagnosis of neuromuscular dysfunction, we developed four search concepts with their respective controlled vocabulary and key terms. We calculated the frequency of MeSH, descriptors, and keywords used by authors in titles and abstracts in comparison to standard practices in semantic and analytic indexing in MEDLINE, MANTIS, CINAHL, and Web of Science.
Multiple searches resulted in the final selection of 38 relevant studies that were indexed at least in one of the four selected databases. Of the four search concepts, validity showed the greatest inconsistency in terminology among authors, indexers and investigators. The use of spinal terms showed the greatest consistency. Of the 22 neuromuscular dysfunction terms provided by the investigators, 11 were not contained in the controlled vocabulary and six were never used by authors or indexers. Most authors did not seem familiar with the controlled vocabulary for validity in the area of neuromuscular dysfunction. Recently, standard glossaries have been developed to assist in the research development of manual medicine.
Searching biomedical databases for CAM is challenging due to inconsistent use of controlled vocabulary and indexing procedures in different databases. A standard terminology should be used by investigators in conducting their search strategies and authors when writing titles, abstracts and submitting keywords for publications.