Search tips
Search criteria

Results 1-25 (69)

Clipboard (0)

Select a Filter Below

Year of Publication
Document Types
1.  Evaluating the Impact of Conceptual Knowledge Engineering on the Design and Usability of a Clinical and Translational Science Collaboration Portal 
With the growing prevalence of large-scale, team science endeavors in the biomedical and life science domains, the impetus to implement platforms capable of supporting asynchronous interaction among multidisciplinary groups of collaborators has increased commensurately. However, there is a paucity of literature describing systematic approaches to identifying the information needs of targeted end-users for such platforms, and the translation of such requirements into practicable software component design criteria. In previous studies, we have reported upon the efficacy of employing conceptual knowledge engineering (CKE) techniques to systematically address both of the preceding challenges in the context of complex biomedical applications. In this manuscript we evaluate the impact of CKE approaches relative to the design of a clinical and translational science collaboration portal, and report upon the preliminary qualitative users satisfaction as reported for the resulting system.
PMCID: PMC3041529  PMID: 21347146
2.  Ontology Mapping and Data Discovery for the Translational Investigator 
An integrated data repository (IDR) containing aggregations of clinical, biomedical, economic, administrative, and public health data is a key component of an overall translational research infrastructure. But most available data repositories are designed using standard data warehouse architecture that employs arbitrary data encoding standards, making queries across disparate repositories difficult. In response to these shortcomings we have designed a Health Ontology Mapper (HOM) that translates terminologies into formal data encoding standards without altering the underlying source data. We believe the HOM system promotes inter-institutional data sharing and research collaboration, and will ultimately lower the barrier to developing and using an IDR.
PMCID: PMC3041530  PMID: 21347152
3.  VISAGE: A Query Interface for Clinical Research 
We present the design and implementation of VISAGE (VISual AGgregator and Explorer), a query interface for clinical research. We follow a user-centered development approach and incorporate visual, ontological, searchable and explorative features in three interrelated components: Query Builder, Query Manager and Query Explorer. The Query Explorer provides novel on-line data mining capabilities for purposes such as hypothesis generation or cohort identification. The VISAGE query interface has been implemented as a significant component of Physio-MIMI, an NCRR-funded, multi-CTSA-site pilot project. Preliminary evaluation results show that VISAGE is more efficient for query construction than the i2b2 web-client.
PMCID: PMC3041531  PMID: 21347154
4.  User Requirements for Exploring a Resource Inventory for Clinical Research 
The CTSA Inventory of Resources Explorer facilitates searching and finding relevant biomedical resources in this rich, federated inventory. We used efficient and non-traditional formal usability methods to define requirements and to design the Explorer, which may be extended to similar web-based tools.
PMCID: PMC3041532  PMID: 21347144
5.  Analysis of False Positive Errors of an Acute Respiratory Infection Text Classifier due to Contextual Features 
Text classifiers have been used for biosurveillance tasks to identify patients with diseases or conditions of interest. When compared to a clinical reference standard of 280 cases of Acute Respiratory Infection (ARI), a text classifier consisting of simple rules and NegEx plus string matching for specific concepts of interest produced 569 (4%) false positive (FP) cases. Using instance level manual annotation we estimate the prevalence of contextual attributes and error types leading to FP cases. Errors were due to (1) Deletion errors from abbreviations, spelling mistakes and missing synonyms (57%); (2) Insertion errors from templated document structures such as check boxes, and lists of signs and symptoms (36%) and; (3) Substitution errors from irrelevant concepts and alternate meanings for the same word (6%). We demonstrate that specific concept attributes contribute to false positive cases. These results will inform modifications and adaptations to improve text classifier performance.
PMCID: PMC3041533  PMID: 21347150
6.  Secondary Use of EHR: Data Quality Issues and Informatics Opportunities 
Given the large-scale deployment of Electronic Health Records (EHR), secondary use of EHR data will be increasingly needed in all kinds of health services or clinical research. This paper reports some data quality issues we encountered in a survival analysis of pancreatic cancer patients. Using the clinical data warehouse at Columbia University Medical Center in the City of New York, we mined EHR data elements collected between 1999 and 2009 for a cohort of pancreatic cancer patients. Of the 3068 patients who had ICD-9-CM diagnoses for pancreatic cancer, only 1589 had corresponding disease documentation in pathology reports. Incompleteness was the leading data quality issue; many study variables had missing values to various degrees. Inaccuracy and inconsistency were the next common problems. In this paper, we present the manifestations of these data quality issues and discuss some strategies for using emerging informatics technologies to solve these problems.
PMCID: PMC3041534  PMID: 21347133
7.  Social Network Analysis of an Online Melanoma Discussion Group 
We have developed tools to explore social networks that share information in medical forums to better understand the unmet informational needs of patients and family members facing cancer treatments. We define metrics that demonstrate members discussing interleukin-2 receive a stronger response from the melanoma discussion group than a typical topic. The interleukin-2 network has a different topology than the melanoma network, has a higher density, and its members are more likely to have a higher intimacy level with another member and a lower inquisitiveness level than a typical melanoma user. Members are more likely to join the interleukin-2 network to answer a question than in the melanoma network (probability =.2 ±.05 p-value=.001). Within the melanoma network 20% of the questions posed to the community do not get an answer. In the interleukin-2 network, 1.3% of the questions (one question) do not get a response.
PMCID: PMC3041535  PMID: 21347134
8.  Shared Genomics: Developing an accessible integrated analysis platform for Genome-Wide Association Studies 
Increasingly, genome-wide association studies are being used to identify positions within the human genome that have a link with a disease condition. The number of genomic locations studied means that computationally intensive and bioinformatic intensive solutions will have to be used in the analysis of these data sets. In this paper we present an integrated Workbench that provides user-friendly access to parallelized statistical genetics analysis codes for clinical researchers. In addition we biologically annotate statistical analysis results through the reuse of existing bionformatic Taverna workflows.
PMCID: PMC3041536  PMID: 21347139
9.  Distributed Cognition Artifacts on Clinical Research Data Collection Forms 
Medical record abstraction, a primary mode of data collection in secondary data use, is associated with high error rates. Cognitive factors have not been studied as a possible explanation for medical record abstraction errors. We employed the theory of distributed representation and representational analysis to systematically evaluate cognitive demands in medical record abstraction and the extent of external cognitive support employed in a sample of clinical research data collection forms.
We show that the cognitive load required for abstraction in 61% of the sampled data elements was high, exceedingly so in 9%. Further, the data collection forms did not support external cognition for the most complex data elements. High working memory demands are a possible explanation for the association of data errors with data elements requiring abstractor interpretation, comparison, mapping or calculation. The representational analysis used here can be used to identify data elements with high cognitive demands.
PMCID: PMC3041537  PMID: 21347145
10.  Automated Ontological Gene Annotation for Computing Disease Similarity 
The annotation of gene/gene products with information on associated diseases is useful as an aid to clinical diagnosis and drug discovery. Several supervised and unsupervised methods exist that automate the association of genes with diseases, but relatively little work has been done to map protein sequence data to disease terminologies. This paper augments an existing open-disease terminology, the Disease Ontology (DO), and uses it for automated annotation of Swissprot records. In addition to the inherent benefits of mapping data to a rich ontology, we demonstrate a gain of 36.1% in gene-disease associations compared to that in DO. Further, we measure disease similarity by exploiting the co-occurrence of annotation among proteins and the hierarchical structure of DO. This makes it possible to find related diseases or signs, with the potential to find previously unknown relationships.
PMCID: PMC3041538  PMID: 21347137
11.  Analysis of Eligibility Criteria Complexity in Clinical Trials 
Formal, computer-interpretable representations of eligibility criteria would allow computers to better support key clinical research and care use cases such as eligibility determination. To inform the development of such formal representations for eligibility criteria, we conducted this study to characterize and quantify the complexity present in 1000 eligibility criteria randomly selected from studies in We classified the criteria by their complexity, semantic patterns, clinical content, and data sources. Our analyses revealed significant semantic and clinical content variability. We found that 93% of criteria were comprehensible, with 85% of these criteria having significant semantic complexity, including 40% relying on temporal data. We also identified several domains of clinical content. Using the findings of the study as requirements for computer-interpretable representations of eligibility, we discuss the challenges for creating such representations for use in clinical research and practice.
PMCID: PMC3041539  PMID: 21347148
12.  An R Package for Simulation Experiments Evaluating Clinical Trial Designs 
This paper presents an open-source application for evaluating competing clinical trial (CT) designs using simulations. The S4 system of classes and methods is utilized. Using object-oriented programming provides extensibility through careful, clear interface specification; using R, an open-source widely-used statistical language, makes the application extendible by the people who design CTs: biostatisticians. Four key classes define the specifications of the population models, CT designs, outcome models and evaluation criteria. Five key methods define the interfaces for generating patient baseline characteristics, stopping rule, assigning treatment, generating patient outcomes and calculating the criteria. Documentation of their connections with the user input screens, with the central simulation loop, and with each other faciliates the extensibility. New subclasses and instances of existing classes meeting these interfaces can integrate immediately into the application. To illustrate the application, we evaluate the effect of patient pharmacokinetic heterogeneity on the performance of a common Phase I “3+3” design.
PMCID: PMC3041540  PMID: 21347151
13.  Evaluation of an Ontology-anchored Natural Language-based Approach for Asserting Multi-scale Biomolecular Networks for Systems Medicine 
The ability to adequately and efficiently integrate unstructured, heterogeneous datasets, which are incumbent to systems biology and medicine, is one of the primary limitations to their comprehensive analysis. Natural language processing (NLP) and biomedical ontologies are automated methods for capturing, standardizing and integrating information across diverse sources, including narrative text. We have utilized the BioMedLEE NLP system to extract and encode, using standard ontologies (e.g., Cell Type Ontology, Mammalian Phenotype, Gene Ontology), biomolecular mechanisms and clinical phenotypes from the scientific literature. We subsequently applied semantic processing techniques to the structured BioMedLEE output to determine the relationships between these biomolecular and clinical phenotype concepts. We conducted an evaluation that shows an average precision and recall of BioMedLEE with respect to annotating phrases comprised of cell type, anatomy/disease, and gene/protein concepts were 86% and 78%, respectively. The precision of the asserted phenotype-molecular relationships was 75%.
PMCID: PMC3041541  PMID: 21347135
14.  Concept Discovery for Pathology Reports using an N-gram Model 
A large amount of valuable information is available in plain text clinical reports. New techniques and technologies are applied to extract information from these reports. One of the leading systems in the cancer community is the Cancer Text Information Extraction System (caTIES), which was developed with caBIG-compliant data structures. caTIES embedded two key components for extracting data: MMTx and GATE. In this paper, an n-gram based framework is proven to be capable of discovering concepts from text reports. MetaMap is used to map medical terms to the National Cancer Institute (NCI) Metathesaurus and the Unified Medical Language System (UMLS) Metathesaurus for verifying legitimate medical data. The final concepts from our framework and caTIES are weighted based on our scoring model. The scores show that, on average, our framework scores higher than caTIES on 848 (36.9%) of reports. Furthermore, 1388 (60.5%) of reports have similar performances on both systems.
PMCID: PMC3041542  PMID: 21347147
15.  Facilitating Health Data Sharing Across Diverse Practices and Communities 
Health data sharing with and among practices is a method for engaging rural and underserved populations, often with strong histories of marginalization, in health research. The Institute of Translational Health Sciences, funded by a National Institutes of Health Clinical and Translational Science Award, is engaged in the LC Data QUEST project to build practice and community based research networks with the ability to share semantically aligned electronic health data. We visited ten practices and communities to assess the feasibility of and barriers to developing data sharing networks. We found that these sites had very different approaches and expectations for data sharing. In order to support practices and communities and foster the acceptance of data sharing in these settings, informaticists must take these diverse views into account. Based on these findings, we discuss system design implications and the need for flexibility in the development of community-based data sharing networks.
PMCID: PMC3041543  PMID: 21347138
16.  A Collaborative Framework for Representation and Harmonization of Clinical Study Data Elements Using Semantic MediaWiki 
Semantic interoperability among terminologies, data elements, and information models is fundamental and critical for sharing information from the scientific bench to the clinical bedside and back among systems. To meet this need, the vision for CDISC is to build a global, accessible electronic library, which enables precise and standardized data element definitions that can be used in applications and studies to improve biomedical research and its link with health care. As a pilot study, we propose a representation and harmonization framework for clinical study data elements and implement a prototype CDISC Shared Health and Research Electronic Library (CSHARE) using Semantic MediaWiki. We report the preliminary observations of how the components worked and the lessons learnt. In summary, the wiki provided a useful prototyping tool from a process standpoint.
PMCID: PMC3041544  PMID: 21347136
17.  Representing Multi-Database Study Schemas for Reusability 
The need for easy, non-technical interfaces to clinical databases for research preceded translational research activities but is made more important because of them. The utility of such interfaces can be improved by the presence of a persistent, reusable and modifiable structure that holds the decisions made in extraction of data from one or more datasources for a study, including the filtering of records, selection of the fields within those records, renaming of fields, and classification of data. This paper demonstrates use of the Web Ontology Language (OWL) as a data representation of these decisions which define a study schema.
PMCID: PMC3041545  PMID: 21347140
18.  The Human Studies Database Project: Federating Human Studies Design Data Using the Ontology of Clinical Research 
Human studies, encompassing interventional and observational studies, are the most important source of evidence for advancing our understanding of health, disease, and treatment options. To promote discovery, the design and results of these studies should be made machine-readable for large-scale data mining, synthesis, and re-analysis. The Human Studies Database Project aims to define and implement an informatics infrastructure for institutions to share the design of their human studies. We have developed the Ontology of Clinical Research (OCRe) to model study features such as design type, interventions, and outcomes to support scientific query and analysis. We are using OCRe as the reference semantics for federated data sharing of human studies over caGrid, and are piloting this implementation with several Clinical and Translational Science Award (CTSA) institutions.
PMCID: PMC3041546  PMID: 21347149
19.  Biomolecular Systems of Disease Buried Across Multiple GWAS Unveiled by Information Theory and Ontology 
A key challenge for genome-wide association studies (GWAS) is to understand how single nucleotide polymorphisms (SNPs) mechanistically underpin complex diseases. While this challenge has been addressed partially by Gene Ontology (GO) enrichment of large list of host genes of SNPs prioritized in GWAS, these enrichment have not been formally evaluated. Here, we develop a novel computational approach anchored in information theoretic similarity, by systematically mining lists of host genes of SNPs prioritized in three adult-onset diabetes mellitus GWAS. The “gold-standard” is based on GO associated with 20 published diabetes SNPs’ host genes and on our own evaluation. We computationally identify 69 similarity-predicted GO independently validated in all three GWAS (FDR<5%), enriched with those of the gold-standard (odds ratio=5.89, P=4.81e-05), and these terms can be organized by similarity criteria into 11 groupings termed “biomolecular systems”. Six biomolecular systems were corroborated by the gold-standard and the remaining five were previously uncharacterized.
PMCID: PMC3041547  PMID: 21347143
20.  An Automated Approach to Calculating the Daily Dose of Tacrolimus in Electronic Health Records 
Clinical research often requires extracting detailed drug information, such as medication names and dosages, from Electronic Health Records (EHR). Since medication information is often recorded as both structured and unstructured formats in the EHR, extracting all the relevant drug mentions and determining the daily dose of a medication for a selected patient at a given date can be a challenging and time-consuming task. In this paper, we present an automated approach using natural language processing to calculate daily doses of medications mentioned in clinical text, using tacrolimus as a test case. We evaluated this method using data sets from four different types of unstructured clinical data. Our results showed that the system achieved precisions of 0.90–1.00 and recalls of 0.81–1.00.
PMCID: PMC3041548  PMID: 21347153
21.  A Knowledge Extraction Framework for Biomedical Pathways 
In this paper we present a novel knowledge extraction framework that is based on semantic parsing. The semantic information originates in a variety of resources, but one in particular, namely BioFrameNet, is central to the characterization of complex events and processes that form biomedical pathways. The paper discusses the promising results of semantic parsing and explains how these results can be used for capturing complex medical knowledge.
PMCID: PMC3041549  PMID: 21347132
22.  PositionMatcher: A Fast Custom-Annotation Tool for Short DNA Sequences 
Microarray probes and reads from massively parallel sequencing technologies are two most widely used genomic tags for a transcriptome study. Names and underlying technologies might differ, but expression technologies share a common objective—to obtain mRNA abundance values at the gene level, with high sensitivity and specificity. However, the initial tag annotation becomes obsolete as more insight is gained into biological references (genome, transcriptome, SNP, etc.). While novel alignment algorithms for short reads are being released every month, solutions for rapid annotation of tags are rare. We have developed a generic matching algorithm that uses genomic positions for rapid custom-annotation of tags with a time complexity O(nlogn). We demonstrate our algorithm on the custom annotation of Illumina massively parallel sequencing reads and Affymetrix microarray probes and identification of alternatively spliced regions.
PMCID: PMC3041550  PMID: 21347141
23.  Corpus-based Approach to Creating a Semantic Lexicon for Clinical Research Eligibility Criteria from UMLS 
We describe a corpus-based approach to creating a semantic lexicon using UMLS knowledge sources. We extracted 10,000 sentences from the eligibility criteria sections of clinical trial summaries contained in The UMLS Metathesaurus and SPECIALIST Lexical Tools were used to extract and normalize UMLS recognizable terms. When annotated with Semantic Network types, the corpus had a lexical ambiguity of 1.57 (=total types for unique lexemes / total unique lexemes) and a word occurrence ambiguity of 1.96 (=total type occurrences / total word occurrences). A set of semantic preference rules was developed and applied to completely eliminate ambiguity in semantic type assignment. The lexicon covered 95.95% UMLS-recognizable terms in our corpus. A total of 20 UMLS semantic types, representing about 17% of all the distinct semantic types assigned to corpus lexemes, covered about 80% of the vocabulary of our corpus.
PMCID: PMC3041551  PMID: 21347142
24.  Conceptual Dissonance: Evaluating the Efficacy of Natural Language Processing Techniques for Validating Translational Knowledge Constructs 
The conduct of large-scale translational studies presents significant challenges related to the storage, management and analysis of integrative data sets. Ideally, the application of methodologies such as conceptual knowledge discovery in databases (CKDD) provides a means for moving beyond intuitive hypothesis discovery and testing in such data sets, and towards the high-throughput generation and evaluation of knowledge-anchored relationships between complex bio-molecular and phenotypic variables. However, the induction of such high-throughput hypotheses is non-trivial, and requires correspondingly high-throughput validation methodologies. In this manuscript, we describe an evaluation of the efficacy of a natural language processing-based approach to validating such hypotheses. As part of this evaluation, we will examine a phenomenon that we have labeled as “Conceptual Dissonance” in which conceptual knowledge derived from two or more sources of comparable scope and granularity cannot be readily integrated or compared using conventional methods and automated tools.
PMCID: PMC3041552  PMID: 21347178
25.  Bayesian Combinatorial Partitioning For Detecting Interactions Among Genetic Variants 
Detecting epistatic (nolinear) interactions among single nucleotide polymorphisms (SNPs) at multiple loci is important in the analysis of genomic data in association studies. We developed a Bayesian combinatorial partitioning (BCP) for detecting such interactions among SNPs that are predictive of disease. When compared with multifactor dimensionality reduction (MDR), a widely used combinatorial partitioning method for detecting interactions, BCP has significantly greater power and is computationally more efficient.
PMCID: PMC3041553  PMID: 21347185

Results 1-25 (69)