REDCap (Research Electronic Data Capture) is a web-based software solution and tool set that allows biomedical researchers to create secure online forms for data capture, management and analysis with minimal effort and training. The Shared Data Instrument Library (SDIL) is a relatively new component of REDCap that allows sharing of commonly used data collection instruments for immediate study use by 3 research teams. Objectives of the SDIL project include: 1) facilitating reuse of data dictionaries and reducing duplication of effort; 2) promoting the use of validated data collection instruments, data standards and best practices; and 3) promoting research collaboration and data sharing. Instruments submitted to the library are reviewed by a library oversight committee, with rotating membership from multiple institutions, which ensures quality, relevance and legality of shared instruments. The design allows researchers to download the instruments in a consumable electronic format in the REDCap environment. At the time of this writing, the SDIL contains over 128 data collection instruments. Over 2500 instances of instruments have been downloaded by researchers at multiple institutions. In this paper we describe the library platform, provide detail about experience gained during the first 25 months of sharing public domain instruments and provide evidence of impact for the SDIL across the REDCap consortium research community. We postulate that the shared library of instruments reduces the burden of adhering to sound data collection principles while promoting best practices.
Validated instruments; Data Collection; Data Sharing; Informatics; Translational Research; REDCap
The Pharmacogenomics Research Network (PGRN) is a collaborative partnership of research groups funded by NIH to discover and understand how genome contributes to an individual’s response to medication. Since traditional biomedical research studies and clinical trials are often conducted independently, common and standardized representations for data are seldom used. This leads to heterogeneity in data representation, which hinders data reuse, data integration and meta-analyses.
This study demonstrates harmonization and semantic annotation work for pharmacogenomics data dictionaries collected from PGRN research groups. A semi-automated system was developed to support the harmonization/annotation process, which includes four individual steps, 1) pre-processing PGRN variables; 2) decomposing and normalizing variable descriptions; 3) semantically annotating words and phrases using controlled terminologies; 4) grouping PGRN variables into categories based on the annotation results and semantic types, for total 1514 PGRN variables.
Our results demonstrate that there is a significant amount of variability in how pharmacogenomics data is represented and that additional standardization efforts are needed. This represents a critical first step toward identifying and creating data standards for pharmacogenomics studies.
Data harmonization; semantic annotation; Pharmacogenomics
Though genome-wide technologies, such as microarrays, are widely used, data from these methods are considered noisy; there is still varied success in downstream biological validation. We report a method that increases the likelihood of successfully validating microarray findings using real time RT-PCR, including genes at low expression levels and with small differences. We use a Bayesian network to identify the most relevant sources of noise based on the successes and failures in validation for an initial set of selected genes, and then improve our subsequent selection of genes for validation based on eliminating these sources of noise. The network displays the significant sources of noise in an experiment, and scores the likelihood of validation for every gene. We show how the method can significantly increase validation success rates. In conclusion, in this study, we have successfully added a new automated step to determine the contributory sources of noise that determine successful or unsuccessful downstream biological validation.
Bioinformatics; Bayesian network; Microarray; RT-PCR; Microarray data
To identify Common Data Elements (CDEs) in eligibility criteria of multiple clinical trials studying the same disease using a human-computer collaborative approach.
A set of free-text eligibility criteria from clinical trials on two representative diseases, breast cancer and cardiovascular diseases, was sampled to identify disease-specific eligibility criteria CDEs. In this proposed approach, a semantic annotator is used to recognize Unified Medical Language Systems (UMLS) terms within the eligibility criteria text. The Apriori algorithm is applied to mine frequent disease-specific UMLS terms, which are then filtered by a list of preferred UMLS semantic types, grouped by similarity based on the Dice coefficient, and, finally, manually reviewed.
Standard precision, recall, and F-score of the CDEs recommended by the proposed approach were measured with respect to manually identified CDEs.
Average precision and recall of the recommended CDEs for the two diseases were 0.823 and 0.797, respectively, leading to an average F-score of 0.810. In addition, the machine-powered CDEs covered 80% of the cardiovascular CDEs published by The American Heart Association and assigned by human experts.
It is feasible and effort saving to use a human-computer collaborative approach to augment domain experts for identifying disease-specific CDEs from free-text clinical trial eligibility criteria.
Clinical Research Informatics; Clinical Trial Eligibility Criteria; Common Data Elements; Knowledge Management; Human-Computer Collaboration; Text Mining
Terminologies and ontologies are increasingly prevalent in health-care and biomedicine. However they suffer from inconsistent renderings, distribution formats, and syntax that make applications through common terminologies services challenging. To address the problem, one could posit a shared representation syntax, associated schema, and tags. We identified a set of commonly-used elements in biomedical ontologies and terminologies based on our experience with the Common Terminology Services 2 (CTS2) Specification as well as the Lexical Grid (LexGrid) project. We propose guidelines for precisely such a shared terminology model, and recommend tags assembled from SKOS, OWL, Dublin Core, RDF Schema, and DCMI meta-terms. We divide these guidelines into lexical information (e.g. synonyms, and definitions) and semantic information (e.g. hierarchies.) The latter we distinguish for use by informal terminologies vs. formal ontologies. We then evaluate the guidelines with a spectrum of widely used terminologies and ontologies to examine how the lexical guidelines are implemented, and whether our proposed guidelines would enhance interoperability.
Biomedical Ontology; Terminology; W3C; OWL; RDF; Ontology Representation Guidelines
When new concepts are inserted into the UMLS, they are assigned one or several semantic types from the UMLS Semantic Network by the UMLS editors. However, not every combination of semantic types is permissible. It was observed that many concepts with rare combinations of semantic types have erroneous semantic type assignments or prohibited combinations of semantic types. The correction of such errors is resource-intensive.
We design a computational system to inform UMLS editors as to whether a specific combination of two, three, four, or five semantic types is permissible or prohibited or questionable.
We identify a set of inclusion and exclusion instructions in the UMLS Semantic Network documentation and derive corresponding rule-categories as well as rule-categories from the UMLS concept content. We then design an algorithm adviseEditor based on these rule-categories. The algorithm specifies rules for an editor how to proceed when considering a tuple (pair, triple, quadruple, quintuple) of semantic types to be assigned to a concept.
Eight rule-categories were identified. A Web-based system was developed to implement the adviseEditor algorithm, which returns for an input combination of semantic types whether it is permitted, prohibited or (in a few cases) requires more research. The numbers of semantic type pairs assigned to each rule-category are reported. Interesting examples for each rule-category are illustrated. Cases of semantic type assignments that contradict rules are listed, including recently introduced ones.
The adviseEditor system implements explicit and implicit knowledge available in the UMLS in a system that informs UMLS editors about the permissibility of a desired combination of semantic types. Using adviseEditor might help accelerate the work of the UMLS editors and prevent erroneous semantic type assignments.
UMLS; Semantic Network; Metathesaurus; UMLS Editing; Semantic types; Semantic type definitions; Usage notes; Concept Insertion
We develop and evaluate a data-driven approach for detecting unusual (anomalous) patient-management decisions using past patient cases stored in electronic health records (EHRs). Our hypothesis is that a patient-management decision that is unusual with respect to past patient care may be due to an error and that it is worthwhile to generate an alert if such a decision is encountered. We evaluate this hypothesis using data obtained from EHRs of 4,486 post-cardiac surgical patients and a subset of 222 alerts generated from the data. We base the evaluation on the opinions of a panel of experts. The results of the study support our hypothesis that the outlier-based alerting can lead to promising true alert rates. We observed true alert rates that ranged from 25% to 66% for a variety of patient-management actions, with 66% corresponding to the strongest outliers.
machine learning; clinical alerting; conditional outlier detection; medical errors
To measure the rate of non-publication and assess possible publication bias in clinical trials of electronic health records.
We searched ClinicalTrials.gov to identify registered clinical trials of electronic health records and searched the biomedical literature and contacted trial investigators to determine whether the results of the trials were published. Publications were judged as positive, negative, or neutral according to the primary outcome.
76% of trials had publications describing trial results; of these, 74% were positive, 21% were neutral, and 4% were negative (harmful). Of unpublished studies for which the investigator responded, 43% were positive, 57% were neutral, and none were negative; the lower rate of positive results was significant (p<0.001).
The rate of non-publication in electronic health record studies is similar to that in other biomedical studies. There appears to be a bias toward publication of positive trials in this domain.
In the modern world people frequently interact with retrieval systems to satisfy their information needs. Humanly understandable well-formed phrases represent a crucial interface between humans and the web, and the ability to index and search with such phrases is beneficial for human-web interactions. In this paper we consider the problem of identifying humanly understandable, well formed, and high quality biomedical phrases in MEDLINE documents. The main approaches used previously for detecting such phrases are syntactic, statistical, and a hybrid approach combining these two. In this paper we propose a supervised learning approach for identifying high quality phrases. First we obtain a set of known well-formed useful phrases from an existing source and label these phrases as positive. We then extract from MEDLINE a large set of multiword strings that do not contain stop words or punctuation. We believe this unlabeled set contains many well-formed phrases. Our goal is to identify these additional high quality phrases. We examine various feature combinations and several machine learning strategies designed to solve this problem. A proper choice of machine learning methods and features identifies in the large collection strings that are likely to be high quality phrases. We evaluate our approach by making human judgments on multiword strings extracted from MEDLINE using our methods. We find that over 85% of such extracted phrase candidates are humanly judged to be of high quality.
machine learning; imbalanced data; biomedical phrases; statistical phrase identification; Unified Medical Language System; abbreviation full forms
Emphasis on participatory medicine requires that patients and consumers participate in tasks traditionally reserved for healthcare providers. This includes reading and comprehending medical documents, often but not necessarily in the context of interacting with Personal Health Records (PHRs). Research suggests that while giving patients access to medical documents has many benefits (e.g., improved patient-provider communication), lay people often have difficulty understanding medical information. Informatics can address the problem by developing tools that support comprehension; this requires in-depth understanding of the nature and causes of errors that lay people make when comprehending clinical documents. The objective of this study was to develop a classification scheme of comprehension errors, based on lay individuals’ retellings of two documents containing clinical text: a description of a clinical trial and a typical office visit note. While not comprehensive, the scheme can serve as a foundation of further development of a taxonomy of patients’ comprehension errors. Eighty participants, all healthy volunteers, read and retold two medical documents. A data-driven content analysis procedure was used to extract and classify retelling errors. The resulting hierarchical classification scheme contains nine categories and twenty-three subcategories. The most common error made by the participants involved incorrectly recalling brand names of medications. Other common errors included misunderstanding clinical concepts, misreporting the objective of a clinical research study and physician’s findings during a patient’s visit, and confusing and misspelling clinical terms. A combination of informatics support and health education is likely to improve the accuracy of lay comprehension of medical documents.
Information Literacy; Clinical Documentation; Consumer Health; Patients; Classification; Content Analysis
Auditing healthcare terminologies for errors requires human experts. In this paper, we present a study of the performance of auditors looking for errors in the semantic type assignments of complex UMLS concepts. In this study, concepts are considered complex whenever they are assigned combinations of semantic types. Past research has shown that complex concepts have a higher likelihood of errors. The results of this study indicate that individual auditors are not reliable when auditing such concepts and their performance is low, according to various metrics. These results confirm the outcomes of an earlier pilot study. They imply that to achieve an acceptable level of reliability and performance, when auditing such concepts of the UMLS, several auditors need to be assigned the same task. A mechanism is then needed to combine the possibly differing opinions of the different auditors into a final determination. In the current study, in contrast to our previous work, we used a majority mechanism for this purpose. For a sample of 232 complex UMLS concepts, the majority opinion was found reliable and its performance for accuracy, recall, precision and the F-measure was found statistically significantly higher than the average performance of individual auditors.
Auditing of terminologies; UMLS auditing; Semantic type assignments; Auditor performance; Auditor reliability; Auditing process; UMLS curators; Auditing errors; Quality assurance; Quality assurance evaluation; Auditing performance study
In this paper we utilize methods of hyperdimensional computing to mediate the identification of therapeutically useful connections for the purpose of literature-based discovery. Our approach, named Predication-based Semantic Indexing, is utilized to identify empirically sequences of relationships known as “discovery patterns”, such as “drug x INHIBITS substance y, substance y CAUSES disease z” that link pharmaceutical substances to diseases they are known to treat. These sequences are derived from semantic predications extracted from the biomedical literature by the SemRep system, and subsequently utilized to direct the search for known treatments for a held out set of diseases. Rapid and efficient inference is accomplished through the application of geometric operators in PSI space, allowing for both the derivation of discovery patterns from a large set of known TREATS relationships, and the application of these discovered patterns to constrain search for therapeutic relationships at scale. Our results include the rediscovery of discovery patterns that have been constructed manually by other authors in previous research, as well as the discovery of a set of previously unrecognized patterns. The application of these patterns to direct search through PSI space results in better recovery of therapeutic relationships than is accomplished with models based on distributional statistics alone. These results demonstrate the utility of efficient approximate inference in geometric space as a means to identify therapeutic relationships, suggesting a role of these methods in drug repurposing efforts. In addition, the results provide strong support for the utility of the discovery pattern approach pioneered by Hristovski and his colleagues.
Distributional semantics; Literature-based discovery; Predication-based Semantic Indexing; Vector symbolic architectures
Abbreviations are widely used in clinical documents and they are often ambiguous. Building a list of possible senses (also called sense inventory) for each ambiguous abbreviation is the first step to automatically identify correct meanings of abbreviations in given contexts. Clustering based methods have been used to detect senses of abbreviations from a clinical corpus . However, rare senses remain challenging and existing algorithms are not good enough to detect them. In this study, we developed a new two-phase clustering algorithm called Tight Clustering for Rare Senses (TCRS) and applied it to sense generation of abbreviations in clinical text. Using manually annotated sense inventories from a set of 13 ambiguous clinical abbreviations, we evaluated and compared TCRS with the existing Expectation Maximization (EM) clustering algorithm for sense generation, at two different levels of annotation cost (10 vs. 20 instances for each abbreviation). Our results showed that the TCRS-based method could detect 85% senses on average; while the EM-based method found only 75% senses, when similar annotation effort (about 20 instances) was used. Further analysis demonstrated that the improvement by the TCRS method was mainly from additionally detected rare senses, thus indicating its usefulness for building more complete sense inventories of clinical abbreviations.
Natural language processing; Word sense discrimination; Clustering; Clinical abbreviations
We introduce two concepts: the Query Web as a layer of interconnected queries over the document web and the semantic web, and a Query Web Integrator and Manager (QI) that enables the Query Web to evolve. QI permits users to write, save and reuse queries over any web accessible source, including other queries saved in other installations of QI. The saved queries may be in any language (e.g. SPARQL, XQuery); the only condition for interconnection is that the queries return their results in some form of XML. This condition allows queries to chain off each other, and to be written in whatever language is appropriate for the task. We illustrate the potential use of QI for several biomedical use cases, including ontology view generation using a combination of graph-based and logical approaches, value set generation for clinical data management, image annotation using terminology obtained from an ontology web service, ontology-driven brain imaging data integration, small-scale clinical data integration, and wider-scale clinical data integration. Such use cases illustrate the current range of applications of QI and lead us to speculate about the potential evolution from smaller groups of interconnected queries into a larger query network that layers over the document and semantic web. The resulting Query Web could greatly aid researchers and others who now have to manually navigate through multiple information sources in order to answer specific questions.
Ontologies; data integration; semantic web; query web
The complexity and rapid growth of genetic data demand investment in information technology to support effective use of this information. Creating infrastructure to communicate genetic information to health care providers and enable them to manage that data can positively affect a patient’s care in many ways. However, genetic data are complex and present many challenges. We report on the usability of a novel application designed to assist providers in receiving and managing a patient’s genetic profile, including ongoing updated interpretations of the genetic variants in those patients. Because these interpretations are constantly evolving, managing them represents a challenge. We conducted usability tests with potential users of this application and reported findings to the application development team, many of which were addressed in subsequent versions. Clinicians were excited about the value this tool provides in pushing out variant updates to providers and overall gave the application high usability ratings, but had some difficulty interpreting elements of the interface. Many issues identified required relatively little development effort to fix suggesting that consistently incorporating this type of analysis in the development process can be highly beneficial. For genetic decision support applications, our findings suggest the importance of designing a system that can deliver the most current knowledge and highlight the significance of new genetic information for clinical care. Our results demonstrate that using a development and design process that is user focused helped optimize the value of this application for personalized medicine.
clinical decision support; electronic health records; genomics; personalized medicine
In this study we present novel feature engineering techniques that leverage the biomedical domain knowledge encoded in the Unified Medical Language System (UMLS) to improve machine-learning based clinical text classification. Critical steps in clinical text classification include identification of features and passages relevant to the classification task, and representation of clinical text to enable discrimination between documents of different classes. We developed novel information-theoretic techniques that utilize the taxonomical structure of the Unified Medical Language System (UMLS) to improve feature ranking, and we developed a semantic similarity measure that projects clinical text into a feature space that improves classification. We evaluated these methods on the 2008 Integrating Informatics with Biology and the Bedside (I2B2) obesity challenge. The methods we developed improve upon the results of this challenge’s top machine-learning based system, and may improve the performance of other machine-learning based clinical text classification systems. We have released all tools developed as part of this study as open source, available at http://code.google.com/p/ytex
Natural Language Processing; Document Classification; Semantic Similarity; Feature Selection; Kernel Methods; Information Gain; Information Content
The main objective of this study was to investigate the feasibility of using PharmGKB, a pharmacogenomic database, as a source of training data in combination with text of MEDLINE abstracts for a text mining approach to identification of potential gene targets for pathway-driven pharmacogenomics research. We used the manually curated relations between drugs and genes in PharmGKB database to train a support vector machine predictive model and applied this model prospectively to MEDLINE abstracts. The gene targets suggested by this approach were subsequently manually reviewed. Our quantitative analysis showed that a support vector machine classifiers trained on MEDLINE abstracts with single words (unigrams) used as features and PharmGKB relations used for supervision, achieve an overall sensitivity of 85% and specificity of 69%. The subsequent qualitative analysis showed that gene targets “suggested” by the automatic classifier were not anticipated by expert reviewers but were subsequently found to be relevant to the three drugs that were investigated: carbamazepine, lamivudine and zidovudine. Our results show that this approach is not only feasible but may also find new gene targets not identifiable by other methods thus making it a valuable tool for pathway-driven pharmacogenomics research.
pharmacogenomics; text mining; support vector machine; pathway-driven analysis; gene-drug associations; PharmGKB
Recent progress in high-throughput genomic technologies has shifted pharmacogenomic research from candidate gene pharmacogenetics to clinical pharmacogenomics (PGx). Many clinical related questions may be asked such as ‘what drug should be prescribed for a patient with mutant alleles?’ Typically, answers to such questions can be found in publications mentioning the relationships of the gene–drug–disease of interest. In this work, we hypothesize that ClinicalTrials.gov is a comparable source rich in PGx related information. In this regard, we developed a systematic approach to automatically identify PGx relationships between genes, drugs and diseases from trial records in ClinicalTrials.gov. In our evaluation, we found that our extracted relationships overlap significantly with the curated factual knowledge through the literature in a PGx database and that most relationships appear on average 5 years earlier in clinical trials than in their corresponding publications, suggesting that clinical trials may be valuable for both validating known and capturing new PGx related information in a more timely manner. Furthermore, two human reviewers judged a portion of computer-generated relationships and found an overall accuracy of 74% for our text-mining approach. This work has practical implications in enriching our existing knowledge on PGx gene–drug–disease relationships as well as suggesting crosslinks between ClinicalTrials.gov and other PGx knowledge bases.
Text mining; Clinical outcome; Pharmacogenomics; Clinical trial
Despite the existence of multiple standards for the coding of biomedical data and the known benefits of doing so, there remain a myriad of biomedical information domain spaces that are essentially un-coded and unstandardized. Perhaps a worse situation is when the same or similar information in a given domain is coded to a variety of different standards. Such is the case with cephalometrics – standardized measurements of angles and distances between specified landmarks on X-ray film used for orthodontic treatment planning and a variety of research applications. We describe how we unified the existing cephalometric definitions from ten existing cephalometric standards to one unifying terminology set using an existing standard (LOINC). Using our example of an open and web-based orthodontic case file system, we describe how this work benefited our project and discuss how adopting or expanding established standards can benefit other similar projects in specialized domains.
Cephalometry; Logical Observation Identifiers Names and Codes; Terminology as Topic; Systems Integration; Information Storage and Retrieval
To support clinical decision-making,computerized information retrieval tools known as “infobuttons” deliver contextually-relevant knowledge resources intoclinical information systems.The Health Level Seven International(HL7)Context-Aware Knowledge Retrieval (Infobutton) Standard specifies a standard mechanism to enable infobuttons on a large scale.
To examine the experience of organizations in the course of implementing the HL7 Infobutton Standard.
Cross-sectionalonline survey and in-depth phone interviews.
A total of 17 organizations participated in the study.Analysis of the in-depth interviews revealed 20 recurrent themes.Implementers underscored the benefits, simplicity, and flexibility of the HL7 Infobutton Standard. Yet, participants voiced the need for easier access to standard specifications and improved guidance to beginners. Implementers predicted that the Infobutton Standard will be widely or at least fairly well adopted in the next five years, but uptake will dependlargely on adoption among electronic health record (EHR) vendors. To accelerate EHR adoption of the Infobutton Standard,implementers recommended HL7-compliant infobutton capabilities to be included in the United States Meaningful Use Certification Criteria EHR systems.
Opinions and predictions should be interpreted with caution, since all the participant organizations have successfully implemented the Standard and overhalf of the organizations were actively engaged in the development of the Standard.
Overall, implementers reported a very positive experience with the HL7 Infobutton Standard.Despite indications of increasing uptake, measures should be taken to stimulate adoption of the Infobutton Standard among EHR vendors. Widespread adoption of the Infobutton standard has the potential to bring contextually relevant clinical decision support content into the healthcare provider workflow.
information need; health information technology; standard; clinical decision support; electronic health record system; knowledge resource
Mapping medical test names into a standardized vocabulary is a prerequisite to sharing test-related data between healthcare entities. One major barrier in this process is the inability to describe tests in sufficient detail to assign the appropriate name in Logical Observation Identifiers, Names, and Codes (LOINC®). Approaches to address mapping of test names with incomplete information have not been well described. We developed a process of "enhancing" local test names by incorporating information required for LOINC mapping into the test names themselves. When using the Regenstrief LOINC Mapping Assistant (RELMA) we found that 73/198 (37%) of "enhanced" test names were successfully mapped to LOINC, compared to 41/191 (21%) of original names (p=0.001). Our approach led to a significantly higher proportion of test names with successful mapping to LOINC, but further efforts are required to achieve more satisfactory results.
Standardized terminology mapping; LOINC
Interoperable health information exchange depends on adoption of terminology standards, but international use of such standards can be challenging because of language differences between local concept names and the standard terminology. To address this important barrier, we describe the evolution of an efficient process for constructing translations of LOINC terms names, the foreign language functions in RELMA, and the current state of translations in LOINC. We also present the development of the Italian translation to illustrate how translation is enabling adoption in international contexts. We built a tool that finds the unique list of LOINC Parts that make up a given set of LOINC terms. This list enables translation of smaller pieces like the core component “hepatitis c virus” separately from all the suffixes that could appear with it, such “Ab.IgG”, “DNA”, and “RNA”. We built another tool that generates a translation of a full LOINC name from all of these atomic pieces. As of version 2.36 (June 2011), LOINC terms have been translated into 9 languages from 15 linguistic variants other than its native English. The five largest linguistic variants have all used the Part-based translation mechanism. However, even with efficient tools and processes, translation of standard terminology is a complex undertaking. Two of the prominent linguistic challenges that translators have faced include: the approach to handling acronyms and abbreviations, and the differences in linguistic syntax (e.g. word order) between languages. LOINC’s open and customizable approach has enabled many different groups to create translations that met their needs and matched their resources. Distributing the standard and its many language translations at no cost worldwide accelerates LOINC adoption globally, and is an important enabler of interoperable health information exchange
LOINC; Vocabulary, Controlled; Multilingualism; Translating; Clinical Laboratory Information Systems/standards; Medical Records Systems; Computerized/standards