Search tips
Search criteria

Results 1-25 (784357)

Clipboard (0)

Related Articles

1.  Selective retrieval of pre- and post-coordinated SNOMED concepts. 
In general, it is very straightforward to store concept identifiers in electronic medical records and represent them in messages. Information models typically specify the fields that can contain coded entries. For each of these fields there may be additional constraints governing exactly which concept identifiers are applicable. However, because modern terminologies such as SNOMED CT are compositional, allowing concept expressions to be pre-coordinated within the terminology or post-coordinated within the medical record, there remains the potential to express a concept in more than one way. Often times, the various representations are similar, but not equivalent. This paper describes an approach for retrieving these pre- and post-coordinated concept expressions: (1) Create concept expressions using a logically-well-structured terminology (e.g., SNOMED CT) according to the rules of a well-specified information model (in this paper we use the HL7 RIM); (2) Transform pre- and post-coordinated concept expressions into a normalized form; (3) Transform queries into the same normalized form. The normalized instances can then be directly compared to the query. Several implementation considerations have been identified. Transformations into a normal form and execution of queries that require traversal of hierarchies need to be optimized. A detailed understanding of the information model and the terminology model are prerequisites. Queries based on the semantic properties of concepts are only as complete as the semantic information contained in the terminology model. Despite these considerations, the approach appears powerful and will continue to be refined.
PMCID: PMC2244193  PMID: 12463817
2.  iSMART: Ontology-based Semantic Query of CDA Documents 
The Health Level 7 Clinical Document Architecture (CDA) is widely accepted as the format for electronic clinical document. With the rich ontological references in CDA documents, the ontology-based semantic query could be performed to retrieve CDA documents. In this paper, we present iSMART (interactive Semantic MedicAl Record reTrieval), a prototype system designed for ontology-based semantic query of CDA documents. The clinical information in CDA documents will be extracted into RDF triples by a declarative XML to RDF transformer. An ontology reasoner is developed to infer additional information by combining the background knowledge from SNOMED CT ontology. Then an RDF query engine is leveraged to enable the semantic queries. This system has been evaluated using the real clinical documents collected from a large hospital in southern China.
PMCID: PMC2815425  PMID: 20351883
3.  User centered and ontology based information retrieval system for life sciences 
BMC Bioinformatics  2012;13(Suppl 1):S4.
Because of the increasing number of electronic resources, designing efficient tools to retrieve and exploit them is a major challenge. Some improvements have been offered by semantic Web technologies and applications based on domain ontologies. In life science, for instance, the Gene Ontology is widely exploited in genomic applications and the Medical Subject Headings is the basis of biomedical publications indexation and information retrieval process proposed by PubMed. However current search engines suffer from two main drawbacks: there is limited user interaction with the list of retrieved resources and no explanation for their adequacy to the query is provided. Users may thus be confused by the selection and have no idea on how to adapt their queries so that the results match their expectations.
This paper describes an information retrieval system that relies on domain ontology to widen the set of relevant documents that is retrieved and that uses a graphical rendering of query results to favor user interactions. Semantic proximities between ontology concepts and aggregating models are used to assess documents adequacy with respect to a query. The selection of documents is displayed in a semantic map to provide graphical indications that make explicit to what extent they match the user's query; this man/machine interface favors a more interactive and iterative exploration of data corpus, by facilitating query concepts weighting and visual explanation. We illustrate the benefit of using this information retrieval system on two case studies one of which aiming at collecting human genes related to transcription factors involved in hemopoiesis pathway.
The ontology based information retrieval system described in this paper (OBIRS) is freely available at: This environment is a first step towards a user centred application in which the system enlightens relevant information to provide decision help.
PMCID: PMC3434427  PMID: 22373375
4.  Migrating existing clinical content from ICD-9 to SNOMED 
To identify challenges in mapping internal International Classification of Disease, 9th edition, Clinical Modification (ICD-9-CM) encoded legacy data to Systematic Nomenclature of Medicine (SNOMED), using SNOMED-prescribed compositional approaches where appropriate, and to explore the mapping coverage provided by the US National Library of Medicine (NLM)'s SNOMED clinical core subset.
This study selected ICD-CM codes that occurred at least 100 times in the organization's problem list or diagnosis data in 2008. After eliminating codes whose exact mappings were already available in UMLS, the remainder were mapped manually with software assistance.
Of the 2194 codes, 784 (35.7%) required manual mapping. 435 of these represented concept types documented in SNOMED as deprecated: these included the qualifying phrases such as ‘not elsewhere classified’. A third of the codes were composite, requiring multiple SNOMED code to map. Representing 45 composite concepts required introducing disjunction (‘or’) or set-difference (‘without’) operators, which are not currently defined in SNOMED. Only 47% of the concepts required for composition were present in the clinical core subset. Search of SNOMED for the correct concepts often required extensive application of knowledge of both English and medical synonymy.
Strategies to deal with legacy ICD data must address the issue of codes created by non-taxonomist users. The NLM core subset possibly needs augmentation with concepts from certain SNOMED hierarchies, notably qualifiers, body structures, substances/products and organisms. Concept-matching software needs to utilize query expansion strategies, but these may be effective in production settings only if a large but non-redundant SNOMED subset that minimizes the proportion of extensively pre-coordinated concepts is also available.
PMCID: PMC2995664  PMID: 20819871
Controlled vocabularies; ICD-9; SNOMED; vocabulary mapping
5.  A Comparative Evaluation of Full-text, Concept-based, and Context-sensitive Search 
Study comparatively (1) concept-based search, using documents pre-indexed by a conceptual hierarchy; (2) context-sensitive search, using structured, labeled documents; and (3) traditional full-text search. Hypotheses were: (1) more contexts lead to better retrieval accuracy; and (2) adding concept-based search to the other searches would improve upon their baseline performances.
Use our Vaidurya architecture, for search and retrieval evaluation, of structured documents classified by a conceptual hierarchy, on a clinical guidelines test collection.
Precision computed at different levels of recall to assess the contribution of the retrieval methods. Comparisons of precisions done with recall set at 0.5, using t-tests.
Performance increased monotonically with the number of query context elements. Adding context-sensitive elements, mean improvement was 11.1% at recall 0.5. With three contexts, mean query precision was 42% ± 17% (95% confidence interval [CI], 31% to 53%); with two contexts, 32% ± 13% (95% CI, 27% to 38%); and one context, 20% ± 9% (95% CI, 15% to 24%). Adding context-based queries to full-text queries monotonically improved precision beyond the 0.4 level of recall. Mean improvement was 4.5% at recall 0.5. Adding concept-based search to full-text search improved precision to 19.4% at recall 0.5.
The study demonstrated usefulness of concept-based and context-sensitive queries for enhancing the precision of retrieval from a digital library of semi-structured clinical guideline documents. Concept-based searches outperformed free-text queries, especially when baseline precision was low. In general, the more ontological elements used in the query, the greater the resulting precision.
PMCID: PMC2213470  PMID: 17213502
6.  A randomized controlled trial of concept based indexing of Web page content. 
OBJECTIVE: Medical information is increasingly being presented in a web-enabled format. Medical journals, guidelines, and textbooks are all accessible in a web-based format. It would be desirable to link these reference sources to the electronic medical record to provide education, to facilitate guideline implementation and usage and for decision support. In order for these rich information sources to be accessed via the medical record they will need to be indexed by a single comparable underlying reference terminology. METHODS: We took a random sample of 100 web pages out of the 6,000 web pages on the Mayo Clinic's Health Oasis web site. The web pages were divided into four datasets each containing 25 pages. These were humanly reviewed by four clinicians to identify all of the health concepts present (R1DA, R2DB, R3DC, R4DD). The web pages were simultaneously indexed using the SNOMED-RT beta release. The indexing engine has been previously described and validated. A new clinician reviewed the indexed web pages to determine the accuracy of the automated mappings as compared with the human identified concepts (R4DA, R3DB, R2DC, R1DD). RESULTS: This review found 13,220 health concepts. Of these 10,383 concepts were identified by the initial human review (78.5% +/- 3.6%). The automated process identified 10,083 concepts correctly (76.3% +/- 4.0%) from within this corpus. The computer identified 2,420 concepts, which were not identified by the clinician's review but were upon further consideration important to include as health concepts. There was on average a 17.1% +/- 3.5% variability in the human reviewers ability to identify the important health concepts within web page content. Concept Based Indexing provided a positive predictive value (PPV) of finding a health concept of 79.3% as compared with keyword indexing which only has a PPV of 33.7% (p < 0.001). CONCLUSION: SNOMED-RT is a reasonable ontology for web page indexing. Concept based indexing provides a significantly greater accuracy in identifying health concepts when compared with keyword indexing.
PMCID: PMC2243865  PMID: 11079877
7.  MachineProse: an Ontological Framework for Scientific Assertions 
Objective: The idea of testing a hypothesis is central to the practice of biomedical research. However, the results of testing a hypothesis are published mainly in the form of prose articles. Encoding the results as scientific assertions that are both human and machine readable would greatly enhance the synergistic growth and dissemination of knowledge.
Design: We have developed MachineProse (MP), an ontological framework for the concise specification of scientific assertions. MP is based on the idea of an assertion constituting a fundamental unit of knowledge. This is in contrast to current approaches that use discrete concept terms from domain ontologies for annotation and assertions are only inferred heuristically.
Measurements: We use illustrative examples to highlight the advantages of MP over the use of the Medical Subject Headings (MeSH) system and keywords in indexing scientific articles.
Results: We show how MP makes it possible to carry out semantic annotation of publications that is machine readable and allows for precise search capabilities. In addition, when used by itself, MP serves as a knowledge repository for emerging discoveries. A prototype for proof of concept has been developed that demonstrates the feasibility and novel benefits of MP. As part of the MP framework, we have created an ontology of relationship types with about 100 terms optimized for the representation of scientific assertions.
Conclusion: MachineProse is a novel semantic framework that we believe may be used to summarize research findings, annotate biomedical publications, and support sophisticated searches.
PMCID: PMC1447552  PMID: 16357355
8.  Determining correspondences between high-frequency MedDRA concepts and SNOMED: a case study 
The Systematic Nomenclature of Medicine Clinical Terms (SNOMED CT) is being advocated as the foundation for encoding clinical documentation. While the electronic medical record is likely to play a critical role in pharmacovigilance - the detection of adverse events due to medications - classification and reporting of Adverse Events is currently based on the Medical Dictionary of Regulatory Activities (MedDRA). Complete and high-quality MedDRA-to-SNOMED CT mappings can therefore facilitate pharmacovigilance.
The existing mappings, as determined through the Unified Medical Language System (UMLS), are partial, and record only one-to-one correspondences even though SNOMED CT can be used compositionally. Efforts to map previously unmapped MedDRA concepts would be most productive if focused on concepts that occur frequently in actual adverse event data.
We aimed to identify aspects of MedDRA that complicate mapping to SNOMED CT, determine pattern in unmapped high-frequency MedDRA concepts, and to identify types of integration errors in the mapping of MedDRA to UMLS.
Using one years' data from the US Federal Drug Administrations Adverse Event Reporting System, we identified MedDRA preferred terms that collectively accounted for 95% of both Adverse Events and Therapeutic Indications records. After eliminating those already mapping to SNOMED CT, we attempted to map the remaining 645 Adverse-Event and 141 Therapeutic-Indications preferred terms with software assistance.
All but 46 Adverse-Event and 7 Therapeutic-Indications preferred terms could be composed using SNOMED CT concepts: none of these required more than 3 SNOMED CT concepts to compose. We describe the common composition patterns in the paper. About 30% of both Adverse-Event and Therapeutic-Indications Preferred Terms corresponded to single SNOMED CT concepts: the correspondence was detectable by human inspection but had been missed during the integration process, which had created duplicated concepts in UMLS.
Identification of composite mapping patterns, and the types of errors that occur in the MedDRA content within UMLS, can focus larger-scale efforts on improving the quality of such mappings, which may assist in the creation of an adverse-events ontology.
PMCID: PMC2988705  PMID: 21029418
9.  Automated Semantic Indexing of Figure Captions to Improve Radiology Image Retrieval 
We explored automated concept-based indexing of unstructured figure captions to improve retrieval of images from radiology journals.
The MetaMap Transfer program (MMTx) was used to map the text of 84,846 figure captions from 9,004 peer-reviewed, English-language articles to concepts in three controlled vocabularies from the UMLS Metathesaurus, version 2006AA. Sampling procedures were used to estimate the standard information-retrieval metrics of precision and recall, and to evaluate the degree to which concept-based retrieval improved image retrieval.
Precision was estimated based on a sample of 250 concepts. Recall was estimated based on a sample of 40 concepts. The authors measured the impact of concept-based retrieval to improve upon keyword-based retrieval in a random sample of 10,000 search queries issued by users of a radiology image search engine.
Estimated precision was 0.897 (95% confidence interval, 0.857–0.937). Estimated recall was 0.930 (95% confidence interval, 0.838–1.000). In 5,535 of 10,000 search queries (55%), concept-based retrieval found results not identified by simple keyword matching; in 2,086 searches (21%), more than 75% of the results were found by concept-based search alone.
Concept-based indexing of radiology journal figure captions achieved very high precision and recall, and significantly improved image retrieval.
PMCID: PMC2732225  PMID: 19261938
10.  Querying phenotype-genotype relationships on patient datasets using semantic web technology: the example of cerebrotendinous xanthomatosis 
Semantic Web technology can considerably catalyze translational genetics and genomics research in medicine, where the interchange of information between basic research and clinical levels becomes crucial. This exchange involves mapping abstract phenotype descriptions from research resources, such as knowledge databases and catalogs, to unstructured datasets produced through experimental methods and clinical practice. This is especially true for the construction of mutation databases. This paper presents a way of harmonizing abstract phenotype descriptions with patient data from clinical practice, and querying this dataset about relationships between phenotypes and genetic variants, at different levels of abstraction.
Due to the current availability of ontological and terminological resources that have already reached some consensus in biomedicine, a reuse-based ontology engineering approach was followed. The proposed approach uses the Ontology Web Language (OWL) to represent the phenotype ontology and the patient model, the Semantic Web Rule Language (SWRL) to bridge the gap between phenotype descriptions and clinical data, and the Semantic Query Web Rule Language (SQWRL) to query relevant phenotype-genotype bidirectional relationships. The work tests the use of semantic web technology in the biomedical research domain named cerebrotendinous xanthomatosis (CTX), using a real dataset and ontologies.
A framework to query relevant phenotype-genotype bidirectional relationships is provided. Phenotype descriptions and patient data were harmonized by defining 28 Horn-like rules in terms of the OWL concepts. In total, 24 patterns of SWQRL queries were designed following the initial list of competency questions. As the approach is based on OWL, the semantic of the framework adapts the standard logical model of an open world assumption.
This work demonstrates how semantic web technologies can be used to support flexible representation and computational inference mechanisms required to query patient datasets at different levels of abstraction. The open world assumption is especially good for describing only partially known phenotype-genotype relationships, in a way that is easily extensible. In future, this type of approach could offer researchers a valuable resource to infer new data from patient data for statistical analysis in translational research. In conclusion, phenotype description formalization and mapping to clinical data are two key elements for interchanging knowledge between basic and clinical research.
PMCID: PMC3444309  PMID: 22849591
11.  COM3/369: Knowledge-based Information Systems: A new approach for the representation and retrieval of medical information 
Present solutions for the representation and retrieval of medical information from online sources are not very satisfying. Either the retrieval process lacks of precision and completeness the representation does not support the update and maintenance of the represented information. Most efforts are currently put into improving the combination of search engines and HTML based documents. However, due to the current shortcomings of methods for natural language understanding there are clear limitations to this approach. Furthermore, this approach does not solve the maintenance problem. At least medical information exceeding a certain complexity seems to afford approaches that rely on structured knowledge representation and corresponding retrieval mechanisms.
Knowledge-based information systems are based on the following fundamental ideas. The representation of information is based on ontologies that define the structure of the domain's concepts and their relations. Views on domain models are defined and represented as retrieval schemata. Retrieval schemata can be interpreted as canonical query types focussing on specific aspects of the provided information (e.g. diagnosis or therapy centred views). Based on these retrieval schemata it can be decided which parts of the information in the domain model must be represented explicitly and formalised to support the retrieval process. As representation language propositional logic is used. All other information can be represented in a structured but informal way using text, images etc. Layout schemata are used to assign layout information to retrieved domain concepts. Depending on the target environment HTML or XML can be used.
Based on this approach two knowledge-based information systems have been developed. The 'Ophthalmologic Knowledge-based Information System for Diabetic Retinopathy' (OKIS-DR) provides information on diagnoses, findings, examinations, guidelines, and reference images related to diabetic retinopathy. OKIS-DR uses combinations of findings to specify the information that must be retrieved. The second system focuses on nutrition related allergies and intolerances. Information on allergies and intolerances of a patient are used to retrieve general information on the specified combination of allergies and intolerances. As a special feature the system generates tables showing food types and products that are tolerated or not tolerated by patients. Evaluation by external experts and user groups showed that the described approach of knowledge-based information systems increases the precision and completeness of knowledge retrieval. Due to the structured and non-redundant representation of information the maintenance and update of the information can be simplified. Both systems are available as WWW based online knowledge bases and CD-ROMs (cf. topic: products).
PMCID: PMC1761778
Knowledge-based Information Systems; Knowledge-based Systems; Information Retrieval
12.  Discovering gene annotations in biomedical text databases 
BMC Bioinformatics  2008;9:143.
Genes and gene products are frequently annotated with Gene Ontology concepts based on the evidence provided in genomics articles. Manually locating and curating information about a genomic entity from the biomedical literature requires vast amounts of human effort. Hence, there is clearly a need forautomated computational tools to annotate the genes and gene products with Gene Ontology concepts by computationally capturing the related knowledge embedded in textual data.
In this article, we present an automated genomic entity annotation system, GEANN, which extracts information about the characteristics of genes and gene products in article abstracts from PubMed, and translates the discoveredknowledge into Gene Ontology (GO) concepts, a widely-used standardized vocabulary of genomic traits. GEANN utilizes textual "extraction patterns", and a semantic matching framework to locate phrases matching to a pattern and produce Gene Ontology annotations for genes and gene products.
In our experiments, GEANN has reached to the precision level of 78% at therecall level of 61%. On a select set of Gene Ontology concepts, GEANN either outperforms or is comparable to two other automated annotation studies. Use of WordNet for semantic pattern matching improves the precision and recall by 24% and 15%, respectively, and the improvement due to semantic pattern matching becomes more apparent as the Gene Ontology terms become more general.
GEANN is useful for two distinct purposes: (i) automating the annotation of genomic entities with Gene Ontology concepts, and (ii) providing existing annotations with additional "evidence articles" from the literature. The use of textual extraction patterns that are constructed based on the existing annotations achieve high precision. The semantic pattern matching framework provides a more flexible pattern matching scheme with respect to "exactmatching" with the advantage of locating approximate pattern occurrences with similar semantics. Relatively low recall performance of our pattern-based approach may be enhanced either by employing a probabilistic annotation framework based on the annotation neighbourhoods in textual data, or, alternatively, the statistical enrichment threshold may be adjusted to lower values for applications that put more value on achieving higher recall values.
PMCID: PMC2335285  PMID: 18325104
13.  Automation and integration of components for generalized semantic markup of electronic medical texts. 
Our group has built an information retrieval system based on a complex semantic markup of medical textbooks. We describe the construction of a set of web-based knowledge-acquisition tools that expedites the collection and maintenance of the concepts required for text markup and the search interface required for information retrieval from the marked text. In the text markup system, domain experts (DEs) identify sections of text that contain one or more elements from a finite set of concepts. End users can then query the text using a predefined set of questions, each of which identifies a subset of complementary concepts. The search process matches that subset of concepts to relevant points in the text. The current process requires that the DE invest significant time to generate the required concepts and questions. We propose a new system--called ACQUIRE (Acquisition of Concepts and Queries in an Integrated Retrieval Environment)--that assists a DE in two essential tasks in the text-markup process. First, it helps her to develop, edit, and maintain the concept model: the set of concepts with which she marks the text. Second, ACQUIRE helps her to develop a query model: the set of specific questions that end users can later use to search the marked text. The DE incorporates concepts from the concept model when she creates the questions in the query model. The major benefit of the ACQUIRE system is a reduction in the time and effort required for the text-markup process. We compared the process of concept- and query-model creation using ACQUIRE to the process used in previous work by rebuilding two existing models that we previously constructed manually. We observed a significant decrease in the time required to build and maintain the concept and query models.
PMCID: PMC2232691  PMID: 10566457
14.  A controlled trial of automated classification of negation from clinical notes 
Identification of negation in electronic health records is essential if we are to understand the computable meaning of the records: Our objective is to compare the accuracy of an automated mechanism for assignment of Negation to clinical concepts within a compositional expression with Human Assigned Negation. Also to perform a failure analysis to identify the causes of poorly identified negation (i.e. Missed Conceptual Representation, Inaccurate Conceptual Representation, Missed Negation, Inaccurate identification of Negation).
41 Clinical Documents (Medical Evaluations; sometimes outside of Mayo these are referred to as History and Physical Examinations) were parsed using the Mayo Vocabulary Server Parsing Engine. SNOMED-CT™ was used to provide concept coverage for the clinical concepts in the record. These records resulted in identification of Concepts and textual clues to Negation. These records were reviewed by an independent medical terminologist, and the results were tallied in a spreadsheet. Where questions on the review arose Internal Medicine Faculty were employed to make a final determination.
SNOMED-CT was used to provide concept coverage of the 14,792 Concepts in 41 Health Records from John's Hopkins University. Of these, 1,823 Concepts were identified as negative by Human review. The sensitivity (Recall) of the assignment of negation was 97.2% (p < 0.001, Pearson Chi-Square test; when compared to a coin flip). The specificity of assignment of negation was 98.8%. The positive likelihood ratio of the negation was 81. The positive predictive value (Precision) was 91.2%
Automated assignment of negation to concepts identified in health records based on review of the text is feasible and practical. Lexical assignment of negation is a good test of true Negativity as judged by the high sensitivity, specificity and positive likelihood ratio of the test. SNOMED-CT had overall coverage of 88.7% of the concepts being negated.
PMCID: PMC1142321  PMID: 15876352
15.  A Query Expansion Framework in Image Retrieval Domain Based on Local and Global Analysis 
We present an image retrieval framework based on automatic query expansion in a concept feature space by generalizing the vector space model of information retrieval. In this framework, images are represented by vectors of weighted concepts similar to the keyword-based representation used in text retrieval. To generate the concept vocabularies, a statistical model is built by utilizing Support Vector Machine (SVM)-based classification techniques. The images are represented as “bag of concepts” that comprise perceptually and/or semantically distinguishable color and texture patches from local image regions in a multi-dimensional feature space. To explore the correlation between the concepts and overcome the assumption of feature independence in this model, we propose query expansion techniques in the image domain from a new perspective based on both local and global analysis. For the local analysis, the correlations between the concepts based on the co-occurrence pattern, and the metrical constraints based on the neighborhood proximity between the concepts in encoded images, are analyzed by considering local feedback information. We also analyze the concept similarities in the collection as a whole in the form of a similarity thesaurus and propose an efficient query expansion based on the global analysis. The experimental results on a photographic collection of natural scenes and a biomedical database of different imaging modalities demonstrate the effectiveness of the proposed framework in terms of precision and recall.
PMCID: PMC3150552  PMID: 21822350
Image Retrieval; Vector Space Model; Support Vector Machine; Relevance Feedback; Query Expansion
16.  Towards a Framework for Developing Semantic Relatedness Reference Standards 
Journal of biomedical informatics  2010;44(2):251-265.
Our objective is to develop a framework for creating reference standards for functional testing of computerized measures of semantic relatedness. Currently, research on computerized approaches to semantic relatedness between biomedical concepts relies on reference standards created for specific purposes using a variety of methods for their analysis. In most cases, these reference standards are not publicly available and the published information provided in manuscripts that evaluate computerized semantic relatedness measurement approaches is not sufficient to reproduce the results. Our proposed framework is based on the experiences of medical informatics and computational linguistics communities and addresses practical and theoretical issues with creating reference standards for semantic relatedness. We demonstrate the use of the framework on a pilot set of 101 medical term pairs rated for semantic relatedness by 13 medical coding experts. While the reliability of this particular reference standard is in the “moderate” range; we show that using clustering and factor analyses offers a data-driven approach to finding systematic differences among raters and identifying groups of potential outliers. We test two ontology-based measures of relatedness and provide both the reference standard containing individual ratings and the R program used to analyze the ratings as open-source. Currently, these resources are intended to be used to reproduce and compare results of studies involving computerized measures of semantic relatedness. Our framework may be extended to the development of reference standards in other research areas in medical informatics including automatic classification, information retrieval from medical records and vocabulary/ontology development.
PMCID: PMC3063326  PMID: 21044697
semantic relatedness; reference standards; reliability; inter-annotator agreement
17.  A Maximum-Entropy approach for accurate document annotation in the biomedical domain 
Journal of Biomedical Semantics  2012;3(Suppl 1):S2.
The increasing number of scientific literature on the Web and the absence of efficient tools used for classifying and searching the documents are the two most important factors that influence the speed of the search and the quality of the results. Previous studies have shown that the usage of ontologies makes it possible to process document and query information at the semantic level, which greatly improves the search for the relevant information and makes one step further towards the Semantic Web. A fundamental step in these approaches is the annotation of documents with ontology concepts, which can also be seen as a classification task. In this paper we address this issue for the biomedical domain and present a new automated and robust method, based on a Maximum Entropy approach, for annotating biomedical literature documents with terms from the Medical Subject Headings (MeSH).
The experimental evaluation shows that the suggested Maximum Entropy approach for annotating biomedical documents with MeSH terms is highly accurate, robust to the ambiguity of terms, and can provide very good performance even when a very small number of training documents is used. More precisely, we show that the proposed algorithm obtained an average F-measure of 92.4% (precision 99.41%, recall 86.77%) for the full range of the explored terms (4,078 MeSH terms), and that the algorithm’s performance is resilient to terms’ ambiguity, achieving an average F-measure of 92.42% (precision 99.32%, recall 86.87%) in the explored MeSH terms which were found to be ambiguous according to the Unified Medical Language System (UMLS) thesaurus. Finally, we compared the results of the suggested methodology with a Naive Bayes and a Decision Trees classification approach, and we show that the Maximum Entropy based approach performed with higher F-Measure in both ambiguous and monosemous MeSH terms.
PMCID: PMC3337257  PMID: 22541593
18.  An Experiment Comparing Lexical and Statistical Methods for Extracting MeSH Terms from Clinical Free Text 
Abstract Objective: A primary goal of the University of Pittsburgh's 1990-94 UMLS-sponsored effort was to develop and evaluate PostDoc (a lexical indexing system) and Pindex (a statistical indexing system) comparatively, and then in combination as a hybrid system. Each system takes as input a portion of the free text from a narrative part of a patient's electronic medical record and returns a list of suggested MeSH terms to use in formulating a Medline search that includes concepts in the text. This paper describes the systems and reports an evaluation. The intent is for this evaluation to serve as a step toward the eventual realization of systems that assist healthcare personnel in using the electronic medical record to construct patient-specific searches of Medline.
Design: The authors tested the performances of PostDoc, Pindex, and a hybrid system, using text taken from randomly selected clinical records, which were stratified to include six radiology reports, six pathology reports, and six discharge summaries. They identified concepts in the clinical records that might conceivably be used in performing a patient-specific Medline search. Each system was given the free text of each record as an input. The extent to which a system-derived list of MeSH terms captured the relevant concepts in these documents was determined based on blinded assessments by the authors.
Results: PostDoc output a mean of approximately 19 MeSH terms per report, which included about 40% of the relevant report concepts. Pindex output a mean of approximately 57 terms per report and captured about 45% of the relevant report concepts. A hybrid system captured approximately 66% of the relevant concepts and output about 71 terms per report.
Conclusion: The outputs of PostDoc and Pindex are complementary in capturing MeSH terms from clinical free text. The results suggest possible approaches to reduce the number of terms output while maintaining the percentage of terms captured, including the use of UMLS semantic types to constrain the output list to contain only clinically relevant MeSH terms.
PMCID: PMC61276  PMID: 9452986
19.  LOINC and SNOMED CT Code Use in Electronic Laboratory Reporting—US, 2011 
To examine the use of LOINC and SNOMED CT codes for coding laboratory orders and results in laboratory reports sent from 63 non-federal hospitals to the BioSense Program in calendar year 2011.
Monitoring laboratory test reports could aid disease surveillance by adding diagnostic specificity to early warning signals and thus improving the efficiency of public health investigation of detected signals. Laboratory data could also be employed to direct and evaluate interventions and countermeasures, while monitoring outbreak trends and progress; this would ultimately result in better outbreak response and management, and enhanced situation awareness. Since Electronic Laboratory Reporting (ELR) has the potential to be more accurate, timely, and cost-effective than reporting by other means of communication (e.g., mail, fax, etc.), ELR adoption has been systematically promoted as a public health priority. However, the continuing use of non-standard, local codes or text to represent laboratory test type and results complicates the use of ELR data in public health practice. Use of structured, unique, and widely available coding system(s) to support the concepts represented by locally assigned laboratory test order and result information improves the computational characteristics of ELR data. Out of several coding strategies available, the Office of the U.S. National Coordinator for Health Information Technology has recently suggested incorporating Logical Observation Identifiers Names and Codes (LOINC) for laboratory orders and Systemized Nomenclature of Medicine- Clinical Terms (SNOMED CT) codes for laboratory results to standardize ELR.
We assessed the use of LOINC and SNOMED CT codes in laboratory data reported to BioSense, a near real-time national-level, electronic syndromic surveillance system, managed by the Centers for Disease Control and Prevention. ELR data reported by 63 non-federal hospitals to BioSense in 2011 were analyzed to examine LOINC and SNOMED CT use in coding laboratory orders and results. We used Relma software, developed and distributed by Regenstrief Institute Inc for identifying LOINC codes.
In 2011, a total of 14,028,774 laboratory test order or result reports from 821,108 individual patients were reported from the 63 hospitals in 14 states. Since, by design the BioSense Program monitors a select set of syndromes mainly representing infectious conditions, 94% of the total reports were microbiology test orders or results. Seventy-seven percent of all test orders (n = 10,776,494) used LOINC codes. Of all test results with at least one value either in observation identifier (OBX3) or observation value (OBX5) segments of their Health Level 7 (HL7) ELR message (n = 12,313,952), 81% had only LOINC codes, 0.1% had only SNOMED codes, 7% had both LOINC and SNOMED codes, and 12% used no codes. In total, 1,428 unique LOINC and 608 unique SNOMED codes were used to describe the results, and 805 unique LOINC codes were used to describe the orders. Of the 608 unique SNOMED codes, 111 (18.3%) did not have corresponding LOINC codes. Fifty-one (46%) of these 111 SNOMED codes could have been matched to corresponding LOINC codes based on the concept. However, our search for matching LOINC codes in Relma for certain SNOMED concepts indicated that LOINC does not have codes for select types of laboratory test results, particularly qualifier (such as reactive, negative, and resistant) or structural (labia, urethra, and vagina) concepts.
Our analysis showed that the use of SNOMED CT codes for laboratory test results by non-federal hospitals reporting laboratory data to BioSense was extremely limited. These hospitals more frequently used LOINC codes than SNOMED CT in reporting test results. We found that a large percentage of test results with SNOMED CT codes could be represented by LOINC codes that exactly or closely match SNOMED CT codes. Using LOINC codes to report both test order and results in these databases could increase the availability and use of laboratory data in public health and surveillance activities. However, to increase the sensitivity of the coding further, a small number of tests could benefit by using LOINC along with SNOMED CT codes. Evaluation of use of syndromic surveillance case definitions that incorporate laboratory result information is required to determine if it improves syndromic surveillance performance for enhanced outbreak detection or improved situation awareness.
PMCID: PMC3692777
LOINC; SNOMED; Laboratory reporting; ELR
20.  Evaluating the Coverage of Controlled Health Data Terminologies 
Objective: To determine the extent to which a combination of existing machine-readable health terminologies cover the concepts and terms needed for a comprehensive controlled vocabulary for health information systems by carrying out a distributed national experiment using the Internet and the UMLS Knowledge Sources, lexical programs, and server.
Methods: Using a specially designed Web-based interface to the UMLS Knowledge Source Server, participants searched the more than 30 vocabularies in the 1996 UMLS Metathesaurus and three planned additions to determine if concepts for which they desired controlled terminology were present or absent. For each term submitted, the interface presented a candidate exact match or a set of potential approximate matches from which the participant selected the most closely related concept. The interface captured a profile of the terms submitted by the participant and for each term searched, information about the concept (if any) selected by the participant. The term information was loaded into a database at NLM for review and analysis and was also available to be downloaded by the participant. A team of subject experts reviewed records to identify matches missed by participants and to correct any obvious errors in relationships. The editors of SNOMED International and the Read Codes were given a random sample of reviewed terms for which exact meaning matches were not found to identify exact matches that were missed or any valid combinations of concepts that were synonymous to input terms. The 1997 UMLS Metathesaurus was used in the semantic type and vocabulary source analysis because it included most of the three planned additions.
Results: Sixty-three participants submitted a total of 41,127 terms, which represented 32,679 normalized strings. More than 80% of the terms submitted were wanted for parts of the patient record related to the patient's condition. Following review, 58% of all submitted terms had exact meaning matches in the controlled vocabularies in the test, 41% had related concepts, and 1% were not found. Of the 28% of the terms which were narrower in meaning than a concept in the controlled vocabularies, 86% shared lexical items with the broader concept, but had additional modification. The percentage of exact meaning matches varied by specialty from 45% to 71%. Twenty-nine different vocabularies contained meanings for some of the 23,837 terms (a maximum of 12,707 discrete concepts) with exact meaning matches. Based on preliminary data and analysis, individual vocabularies contained <1% to 63% of the terms and <1% to 54% of the concepts. Only SNOMED International and the Read Codes had more than 60% of the terms and more than 50% of the concepts.
Conclusions: The combination of existing controlled vocabularies included in the test represents the meanings of the majority of the terminology needed to record patient conditions, providing substantially more exact matches than any individual vocabulary in the set. From a technical and organizational perspective, the test was successful and should serve as a useful model, both for distributed input to the enhancement of controlled vocabularies and for other kinds of collaborative informatics research.
PMCID: PMC61267  PMID: 9391936
21.  Semantically linking and browsing PubMed abstracts with gene ontology 
BMC Genomics  2008;9(Suppl 1):S10.
The technological advances in the past decade have lead to massive progress in the field of biotechnology. The documentation of the progress made exists in the form of research articles. The PubMed is the current most used repository for bio-literature. PubMed consists of about 17 million abstracts as of 2007 that require methods to efficiently retrieve and browse large volume of relevant information. The State-of-the-art technologies such as GOPubmed use simple keyword-based techniques for retrieving abstracts from the PubMed and linking them to the Gene Ontology (GO). This paper changes the paradigm by introducing semantics enabled technique to link the PubMed to the Gene Ontology, called, SEGOPubmed for ontology-based browsing. Latent Semantic Analysis (LSA) framework is used to semantically interface PubMed abstracts to the Gene Ontology.
The Empirical analysis is performed to compare the performance of the SEGOPubmed with the GOPubmed. The analysis is initially performed using a few well-referenced query words. Further, statistical analysis is performed using GO curated dataset as ground truth. The analysis suggests that the SEGOPubmed performs better than the classic GOPubmed as it incorporates semantics.
The LSA technique is applied on the PubMed abstracts obtained based on the user query and the semantic similarity between the query and the abstracts. The analyses using well-referenced keywords show that the proposed semantic-sensitive technique outperformed the string comparison based techniques in associating the relevant abstracts to the GO terms. The SEGOPubmed also extracted the abstracts in which the keywords do not appear in isolation (i.e. they appear in combination with other terms) that could not be retrieved by simple term matching techniques.
PMCID: PMC2386052  PMID: 18366599
22.  Mapping clinical phenotype data elements to standardized metadata repositories and controlled terminologies: the eMERGE Network experience 
Systematic study of clinical phenotypes is important for a better understanding of the genetic basis of human diseases and more effective gene-based disease management. A key aspect in facilitating such studies requires standardized representation of the phenotype data using common data elements (CDEs) and controlled biomedical vocabularies. In this study, the authors analyzed how a limited subset of phenotypic data is amenable to common definition and standardized collection, as well as how their adoption in large-scale epidemiological and genome-wide studies can significantly facilitate cross-study analysis.
The authors mapped phenotype data dictionaries from five different eMERGE (Electronic Medical Records and Genomics) Network sites studying multiple diseases such as peripheral arterial disease and type 2 diabetes. For mapping, standardized terminological and metadata repository resources, such as the caDSR (Cancer Data Standards Registry and Repository) and SNOMED CT (Systematized Nomenclature of Medicine), were used. The mapping process comprised both lexical (via searching for relevant pre-coordinated concepts and data elements) and semantic (via post-coordination) techniques. Where feasible, new data elements were curated to enhance the coverage during mapping. A web-based application was also developed to uniformly represent and query the mapped data elements from different eMERGE studies.
Approximately 60% of the target data elements (95 out of 157) could be mapped using simple lexical analysis techniques on pre-coordinated terms and concepts before any additional curation of terminology and metadata resources was initiated by eMERGE investigators. After curation of 54 new caDSR CDEs and nine new NCI thesaurus concepts and using post-coordination, the authors were able to map the remaining 40% of data elements to caDSR and SNOMED CT. A web-based tool was also implemented to assist in semi-automatic mapping of data elements.
This study emphasizes the requirement for standardized representation of clinical research data using existing metadata and terminology resources and provides simple techniques and software for data element mapping using experiences from the eMERGE Network.
PMCID: PMC3128396  PMID: 21597104
Ritu and pupu and 12; informatics; ontologies; knowledge representations; controlled terminologies and vocabularies; machine learning; terminologies; metadata; mapping; harmonization; eMERGE Network
23.  Using SNOMED CT to Represent Two Interface Terminologies 
Interface terminologies are designed to support interactions between humans and structured medical information. In particular, many interface terminologies have been developed for structured computer based documentation systems. Experts and policy-makers have recommended that interface terminologies be mapped to reference terminologies. The goal of the current study was to evaluate how well the reference terminology SNOMED CT could map to and represent two interface terminologies, MEDCIN and the Categorical Health Information Structured Lexicon (CHISL).
Automated mappings between SNOMED CT and 500 terms from each of the two interface terminologies were evaluated by human reviewers, who also searched SNOMED CT to identify better mappings when this was judged to be necessary. Reviewers judged whether they believed the interface terms to be clinically appropriate, whether the terms were covered by SNOMED CT concepts and whether the terms' implied semantic structure could be represented by SNOMED CT.
Outcomes included concept coverage by SNOMED CT for study terms and their implied semantics. Agreement statistics and compositionality measures were calculated.
The SNOMED CT terminology contained concepts to represent 92.4% of MEDCIN and 95.9% of CHISL terms. Semantic structures implied by study terms were less well covered, with some complex compositional expressions requiring semantics not present in SNOMED CT. Among sampled terms, those from MEDCIN were more complex than those from CHISL, containing an average 3.8 versus 1.8 atomic concepts respectively, p<0.001.
Our findings support using SNOMED CT to provide standardized representations of information created using these two terminologies, but suggest that enriching SNOMED CT semantics would improve representation of the external terms.
PMCID: PMC2605600  PMID: 18952944
24.  Visual Exploratory Search of Relationship Graphs on Smartphones 
PLoS ONE  2013;8(11):e79379.
This paper presents a novel framework for Visual Exploratory Search of Relationship Graphs on Smartphones (VESRGS) that is composed of three major components: inference and representation of semantic relationship graphs on the Web via meta-search, visual exploratory search of relationship graphs through both querying and browsing strategies, and human-computer interactions via the multi-touch interface and mobile Internet on smartphones. In comparison with traditional lookup search methodologies, the proposed VESRGS system is characterized with the following perceived advantages. 1) It infers rich semantic relationships between the querying keywords and other related concepts from large-scale meta-search results from Google, Yahoo! and Bing search engines, and represents semantic relationships via graphs; 2) the exploratory search approach empowers users to naturally and effectively explore, adventure and discover knowledge in a rich information world of interlinked relationship graphs in a personalized fashion; 3) it effectively takes the advantages of smartphones’ user-friendly interfaces and ubiquitous Internet connection and portability. Our extensive experimental results have demonstrated that the VESRGS framework can significantly improve the users’ capability of seeking the most relevant relationship information to their own specific needs. We envision that the VESRGS framework can be a starting point for future exploration of novel, effective search strategies in the mobile Internet era.
PMCID: PMC3817041  PMID: 24223936
25.  Semantic similarity in the biomedical domain: an evaluation across knowledge sources 
BMC Bioinformatics  2012;13:261.
Semantic similarity measures estimate the similarity between concepts, and play an important role in many text processing tasks. Approaches to semantic similarity in the biomedical domain can be roughly divided into knowledge based and distributional based methods. Knowledge based approaches utilize knowledge sources such as dictionaries, taxonomies, and semantic networks, and include path finding measures and intrinsic information content (IC) measures. Distributional measures utilize, in addition to a knowledge source, the distribution of concepts within a corpus to compute similarity; these include corpus IC and context vector methods. Prior evaluations of these measures in the biomedical domain showed that distributional measures outperform knowledge based path finding methods; but more recent studies suggested that intrinsic IC based measures exceed the accuracy of distributional approaches. Limitations of previous evaluations of similarity measures in the biomedical domain include their focus on the SNOMED CT ontology, and their reliance on small benchmarks not powered to detect significant differences between measure accuracy. There have been few evaluations of the relative performance of these measures on other biomedical knowledge sources such as the UMLS, and on larger, recently developed semantic similarity benchmarks.
We evaluated knowledge based and corpus IC based semantic similarity measures derived from SNOMED CT, MeSH, and the UMLS on recently developed semantic similarity benchmarks. Semantic similarity measures based on the UMLS, which contains SNOMED CT and MeSH, significantly outperformed those based solely on SNOMED CT or MeSH across evaluations. Intrinsic IC based measures significantly outperformed path-based and distributional measures. We released all code required to reproduce our results and all tools developed as part of this study as open source, available under We provide a publicly-accessible web service to compute semantic similarity, available under
Knowledge based semantic similarity measures are more practical to compute than distributional measures, as they do not require an external corpus. Furthermore, knowledge based measures significantly and meaningfully outperformed distributional measures on large semantic similarity benchmarks, suggesting that they are a practical alternative to distributional measures. Future evaluations of semantic similarity measures should utilize benchmarks powered to detect significant differences in measure accuracy.
PMCID: PMC3533586  PMID: 23046094
Semantic similarity; Information content; Information theory; Biomedical ontologies

Results 1-25 (784357)