1.  Conceptual Dissonance: Evaluating the Efficacy of Natural Language Processing Techniques for Validating Translational Knowledge Constructs 
The conduct of large-scale translational studies presents significant challenges related to the storage, management and analysis of integrative data sets. Ideally, the application of methodologies such as conceptual knowledge discovery in databases (CKDD) provides a means for moving beyond intuitive hypothesis discovery and testing in such data sets, and towards the high-throughput generation and evaluation of knowledge-anchored relationships between complex bio-molecular and phenotypic variables. However, the induction of such high-throughput hypotheses is non-trivial, and requires correspondingly high-throughput validation methodologies. In this manuscript, we describe an evaluation of the efficacy of a natural language processing-based approach to validating such hypotheses. As part of this evaluation, we will examine a phenomenon that we have labeled as “Conceptual Dissonance” in which conceptual knowledge derived from two or more sources of comparable scope and granularity cannot be readily integrated or compared using conventional methods and automated tools.
PMCID: PMC3041552  PMID: 21347178
2.  Bayesian Combinatorial Partitioning For Detecting Interactions Among Genetic Variants 
Detecting epistatic (nolinear) interactions among single nucleotide polymorphisms (SNPs) at multiple loci is important in the analysis of genomic data in association studies. We developed a Bayesian combinatorial partitioning (BCP) for detecting such interactions among SNPs that are predictive of disease. When compared with multifactor dimensionality reduction (MDR), a widely used combinatorial partitioning method for detecting interactions, BCP has significantly greater power and is computationally more efficient.
PMCID: PMC3041553  PMID: 21347185
3.  Comorbidity of Bipolar Disorder with Substance Abuse: Selection of Prioritized Genes for Translational Research 
Bipolar disorder is a highly heritable mental illness. The global burden of bipolar disorder is complicated by its comorbidity with substance abuse. Several genome-wide linkage/association studies on bipolar disorder as well as substance abuse have focused on the identification and/or prioritization of candidate disease genes. A useful step for translational research of these identified/prioritized genes is to identify sets of genes that have particular kinds of publicly available data. Therefore, we have leveraged the availability of links to related resources in the Entrez Gene database to develop a web-based resource for selecting genes based on presence or absence in particular biological data resources. The utility of our approach is demonstrated using a set of 3,399 genes from multiple eukaryotes that have been studied in the context of bipolar disorder and/or substance abuse. A web resource to automate the selection of genes that contain certain database links is available at
PMCID: PMC3041554  PMID: 21347170
4.  Literature Mapping with PubAtlas — extending PubMed with a ‘BLASTing interface’ * 
PubAtlas ( is a web service and standalone program providing literature maps for the biomedical research literature. It accepts user-defined sets of terms (PubMed queries) as input, and permits ‘BLASTing’ of one set against another: for all terms x and y in these sets, deriving the results of the pairwise intersections x AND y. This all vs. all capability extends PubMed with a literature analysis interface. Correspondingly, the basic form of literature map that PubAtlas provides for exploring associations among sets of terms is an interactive tabular display, in heatmap/microarray format.
PubAtlas supports development of specialized lexica -- hierarchies of controlled terminology that can represent sets of related concepts or a ‘user-defined query language’. PubAtlas also provides historical perspectives on the literature, with temporal query features that highlight historical patterns. Generally, it is a framework for extending the PubMed interface, and an extensible platform for producing interactive literature maps.
PMCID: PMC3041555  PMID: 21347177
5.  A Comparative Study of Metabolic Network Topology between a Pathogenic and a Non-Pathogenic Bacterium for Potential Drug Target Identification 
Metabolic network provides a unified platform to integrate all the biological information on genes, proteins, metabolites, drugs and drug targets for a comprehensive system level study of the relationship between metabolism and disease. In recent times, drug-target identification by in silico methods has emerged causing a phenomenal achievement in the field of drug discovery. This paper focuses on describing how microbial drug target identification can be carried out using bioinformatic tools. Specifically, it highlights the use of metabolic ‘choke point’ and ‘load point’ analyses to understand the local and global properties of metabolic networks in Pseudomonas aeruginosa and allow us to identify potential drug targets. We also list out top 10 choke point enzymes based on the load point values and the number of shortest paths. A non-pathogenic bacterial strain Pseudomonas putida KT2440 and a related pathogenic bacteria P.aeruginosa PA01 was selected for the network anlaysis. A comparative study of the metabolic networks of these two microbes highlights the analogies and differences between their respective pathways. System analysis of metabolic networks will help us in identifying new drug targets which in turn will generate more in-depth understanding of the mechanism of diseases and thus provide better guidance for drug discovery.
PMCID: PMC3041556  PMID: 21347179
6.  Developing a Manually Annotated Clinical Document Corpus to Identify Phenotypic Information for Inflammatory Bowel Disease 
Natural Language Processing (NLP) systems can be used for specific Information Extraction (IE) tasks such as extracting phenotypic data from the electronic medical record (EMR). These data are useful for translational research and are often found only in free text clinical notes. A key required step for IE is the manual annotation of clinical corpora and the creation of a reference standard for (1) training and validation tasks and (2) to focus and clarify NLP system requirements. These tasks are time consuming, expensive, and require considerable effort on the part of human reviewers.
Using a set of clinical documents from the VA EMR for a particular use case of interest we identify specific challenges and present several opportunities for annotation tasks. We demonstrate specific methods using an open source annotation tool, a customized annotation schema, and a corpus of clinical documents for patients known to have a diagnosis of Inflammatory Bowel Disease (IBD). We report clinician annotator agreement at the document, concept, and concept attribute level. We estimate concept yield in terms of annotated concepts within specific note sections and document types.
Annotator agreement at the document level for documents that contained concepts of interest for IBD using estimated Kappa statistic (95% CI) was very high at 0.87 (0.82, 0.93). At the concept level, F-measure ranged from 0.61 to 0.83. However, agreement varied greatly at the specific concept attribute level. For this particular use case (IBD), clinical documents producing the highest concept yield per document included GI clinic notes and primary care notes. Within the various types of notes, the highest concept yield was in sections representing patient assessment and history of presenting illness. Ancillary service documents and family history and plan note sections produced the lowest concept yield.
Challenges include defining and building appropriate annotation schemas, adequately training clinician annotators, and determining the appropriate level of information to be annotated. Opportunities include narrowing the focus of information extraction to use case specific note types and sections, especially in cases where NLP systems will be used to extract information from large repositories of electronic clinical note documents.
PMCID: PMC3041557  PMID: 21347157
7.  A Controlled Vocabulary to Represent Sonographic Features of the Thyroid and its application in a Bayesian Network to Predict Thyroid Nodule Malignancy 
It is challenging to distinguish benign from malignant thyroid nodules on high resolution ultrasound. Many ultrasound features have been studied individually as predictors for thyroid malignancy, none with a high degree of accuracy, and there is no consistent vocabulary used to describe the features. Our hypothesis is that a standard vocabulary will advance accuracy. We performed a systemic literature review and identified all the sonographic features that have been well studied in thyroid cancers. We built a controlled vocabulary for describing sonographic features and to enable us to unify data in the literature on the predictive power of each feature. We used this terminology to build a Bayesian network to predict thyroid malignancy. Our Bayesian network performed similar to or slightly better than experienced radiologists. Controlled terminology for describing thyroid radiology findings could be useful to characterize thyroid nodules and could enable decision support applications.
PMCID: PMC3041558  PMID: 21347173
8.  Extraction of Conditional Probabilities of the Relationships Between Drugs, Diseases, and Genes from PubMed Guided by Relationships in PharmGKB 
Guided by curated associations between genes, treatments (i.e., drugs), and diseases in pharmGKB, we constructed n-way Bayesian networks based on conditional probability tables (cpt’s) extracted from co-occurrence statistics over the entire Pubmed corpus, producing a broad-coverage analysis of the relationships between these biological entities. The networks suggest hypotheses regarding drug mechanisms, treatment biomarkers, and/or potential markers of genetic disease. The cpt’s enable Trio, an inferential database, to query indirect (inferred) relationships via an SQL-like query language.
PMCID: PMC3041559  PMID: 21347183
9.  Ontology Based Clinical Query Extraction 
Knowledge about human anatomy, radiology and diseases that is essential for medical images can be acquired from medical ontology terms and relations. These can then be analyzed using domain corpora to observe statistically most relevant term-relation-term patterns. We argue that such patterns are the basis for more complex clinical search queries and describe our approach for deriving them. These patterns can then be used to support the knowledge elicitation process between the domain expert and the knowledge engineer by providing a common vocabulary for the communication.
PMCID: PMC3041560  PMID: 21347186
10.  Analysis of AML Genes in Dysregulated Molecular Networks 
Identifying disease causing genes and understanding their molecular mechanisms are essential to developing effective therapeutics. Thus, several computational methods have been proposed to prioritize candidate disease genes by integrating different data types, including sequence information, biomedical literature, and pathway information. Recently, molecular interaction networks have been incorporated to predict disease genes, but most of those methods do not utilize invaluable disease-specific information available in mRNA expression profiles of patient samples.
Through the integration of protein-protein interaction networks and gene expression profiles of acute myeloid leukemia (AML) patients, we identified subnetworks of interacting proteins dysregulated in AML and characterized known mutation genes causally implicated to AML embedded in the subnetworks. The analysis shows that the set of extracted subnetworks is a reservoir rich in AML genes reflecting key leukemogenic processes such as myeloid differentiation,
We showed that the integrative approach both utilizing gene expression profiles and molecular networks could identify AML causing genes most of which were not detectable with gene expression analysis alone due to their minor changes in mRNA.
PMCID: PMC3041561  PMID: 21347161
11.  Integrating Automated Workflows, Human Intelligence and Collaboration 
Many methods and tools have evolved for microarray analysis such as single probe evaluation, promoter module modeling and pathway analysis. Little is known, however, about optimizing this flow of analysis for the flexible reasoning biomedical researchers need for hypothesizing about disease mechanisms. In developing and implementing a workflow, we found that workflows are not complete or valuable unless automation is well-integrated with human intelligence. We present our workflow for the translational problem of classifying new sub-types of renal diseases. Using our workflow as an example, we explain opportunities and limitations in achieving this necessary integration and propose approaches to guide such integration for the next great frontier-facilitating exploratory analysis of candidate genes.
PMCID: PMC3041562  PMID: 21347175
12.  Development of an Agile Knowledge Engineering Framework in Support of Multi-Disciplinary Translational Research 
In October 2006, the National Institutes of Health launched a new national consortium, funded through Clinical and Translational Science Awards (CTSA), with the primary objective of improving the conduct and efficiency of the inherently multi-disciplinary field of translational research. To help meet this goal, the Ohio State University Center for Clinical and Translational Science has launched a knowledge management initiative that is focused on facilitating widespread semantic interoperability among administrative, basic science, clinical and research computing systems, both internally and among the translational research community at-large, through the integration of domain-specific standard terminologies and ontologies with local annotations. This manuscript describes an agile framework that builds upon prevailing knowledge engineering and semantic interoperability methods, and will be implemented as part this initiative.
PMCID: PMC3041563  PMID: 21347164
13.  Automatically Classifying Sentences in Full-Text Biomedical Articles into Introduction, Methods, Results and Discussion 
Biomedical texts can be typically represented by four rhetorical categories: introduction, methods, results and discussion (IMRAD). Classifying sentences into these categories can benefit many other text-mining tasks. Although many studies have applied approaches to automatically classify sentences in MEDLINE abstracts into the IMRAD categories, few have explored the classification of sentences that appear in full-text biomedical articles. We explored different approaches to automatically classify a sentence in a full-text biomedical article into the IMRAD categories. Our best system is a support vector machine classifier that achieved 81.30% accuracy, which is significantly higher than baseline systems.
PMCID: PMC3041564  PMID: 21347163
14.  Agent-Based Modeling Supporting the Migration of Registry Systems to Grid Based Architectures 
With the increasing age and cost of operation of the existing NCI SEER platform core technologies, such essential resources in the fight against cancer as these will eventually have to be migrated to Grid based systems. In order to model this migration, a simulation is proposed based upon an agent modeling technology. This modeling technique allows for simulation of complex and distributed services provided by a large scale Grid computing platform such as the caBIG™ project’s caGRID. In order to investigate such a migration to a Grid based platform technology, this paper proposes using agent-based modeling simulations to predict the performance of current and Grid configurations of the NCI SEER system integrated with the existing translational opportunities afforded by caGRID. The model illustrates how the use of Grid technology can potentially improve system response time as systems under test are scaled. In modeling SEER nodes accessing multiple registry silos, we show that the performance of SEER applications re-implemented in a Grid native manner exhibits a nearly constant user response time with increasing numbers of distributed registry silos, compared with the current application architecture which exhibits a linear increase in response time for increasing numbers of silos.
PMCID: PMC3041565  PMID: 21347166
15.  Pattern Discovery in Breast Cancer Specific Protein Interaction Network 
The interest in indentifying novel biomarkers for early stage breast cancer (BRCA) detection has become grown significantly in recent years. From a view of network biology, one of the emerging themes today is to re-characterize a protein’s biological functions in its molecular network. Although many methods have been presented, including network-based gene ranking for molecular biomarker discovery, and graph clustering for functional module discovery, it is still hard to find systems-level properties hidden in disease specific molecular networks. We reconstructed BRCA-related protein interaction network by using BRCA-associated genes/proteins as seeds, and expanding them in an integrated protein interaction database. We further developed a computational framework based on Ant Colony Optimization to rank network nodes. The task of ranking nodes is represented as the problem of finding optimal density distributions of “ant colonies” on all nodes of the network. Our results revealed some interesting systems-level pattern in BRCA-related protein interaction network.
PMCID: PMC3041566  PMID: 21347162
16.  Mining to find the lipid interaction networks involved in Ovarian Cancers 
The role of lipids in cancer during the genesis, progression and subsequent metastasis stages is increasingly discussed in the scientific literature. This information is discussed in a wide range of journals making it difficult for researchers to track the latest developments. A comprehensive assessment and translation of the lipidome of ovarian cancer, originating from literature, has yet to be made. We illustrate the deployment of semantic technologies; lipid ontology and text mining, in the aggregation and coordination of lipid literature. We provide the first report on the roles and types of lipids involved in ovarian cancer based on the mining of literature and identify key lipid-protein interactions that may point to potential drug discovery targets.
PMCID: PMC3041567  PMID: 21347172
17.  Using the Weighted Keyword Model to Improve Information Retrieval for Answering Biomedical Questions 
Physicians ask many complex questions during the patient encounter. Information retrieval systems that can provide immediate and relevant answers to these questions can be invaluable aids to the practice of evidence-based medicine. In this study, we first automatically identify topic keywords from ad hoc clinical questions with a Condition Random Field model that is trained over thousands of manually annotated clinical questions. We then report on a linear model that assigns query weights based on their automatically identified semantic roles: topic keywords, domain specific terms, and their synonyms. Our evaluation shows that this weighted keyword model improves information retrieval from the Text Retrieval Conference Genomics track data.
PMCID: PMC3041568  PMID: 21347188
18.  Using Personal Health Records for Automated Clinical Trials Recruitment: the ePaIRing Model 
We describe the development of a model describing the use of patient information to improve patient recruitment in clinical trials. This model, named ePaIRing (electronic Participant Identification and Recruitment Model) describes variations in how information flows between stakeholders, and how personal health records can specifically facilitate patient recruitment.
PMCID: PMC3041569  PMID: 21347187
19.  Evaluation of cardiovascular risk assessment models with respect to the clinical interpretation of atherosclerosis in a different type II diabetes cohort 
Epidemiological studies on cardiovascular risk developed many assessment models which are widely available for the public use. As many arterial occlusive diseases are developed from atherosclerosis in their early stage, it is meaningful to evaluate such models with respect to the clinical interpretation of atherosclerosis so as to promote the preventive care of vascular diseases. Our study aims to make use of the data collection form from the Hong Kong Chinese type II diabetes to evaluate and compare the performance of the risk assessment models of ARIC, FHS, UKPDS using ROC curve.
We found that ARIC’s Stroke model gives the best performance whose AUC is 0.646 in model for Black. UKPDS’s Stroke has the lowest AUC, 0.497. It was found that ARIC model for the Black Americans has superior performance with respect to the cohort in Hong Kong based on the ROC analysis.
PMCID: PMC3041570  PMID: 21347160
20.  The potential for automated question answering in the context of genomic medicine: An assessment of existing resources and properties of answers 
Knowledge gained in studies of genetic disorders is reported in a growing body of biomedical literature containing reports of genetic variation in individuals that map to medical conditions and/or response to therapy. These scientific discoveries need to be translated into practical applications to optimize patient care. Translating research into practice can be facilitated by supplying clinicians with research evidence. We assessed the role of existing tools in extracting answers to translational research questions in the area of genomic medicine. We: evaluate the coverage of translational research terms in the Unified Medical Language Systems (UMLS) Metathesaurus; determine where answers are most often found in full-text articles; and determine common answer patterns. Findings suggest that we will be able to leverage the UMLS in development of natural language processing algorithms for automated extraction of answers to translational research questions from biomedical text in the area of genomic medicine.
PMCID: PMC3041571  PMID: 21347155
21.  Multi-Criteria Decision Making Approaches for Quality Control of Genome-Wide Association Studies 
Experimental errors in the genotyping phases of a Genome-Wide Association Study (GWAS) can lead to false positive findings and to spurious associations. An appropriate quality control phase could minimize the effects of this kind of errors. Several filtering criteria can be used to perform quality control. Currently, no formal methods have been proposed for taking into account at the same time these criteria and the experimenter’s preferences. In this paper we propose two strategies for setting appropriate genotyping rate thresholds for GWAS quality control. These two approaches are based on the Multi-Criteria Decision Making theory. We have applied our method on a real dataset composed by 734 individuals affected by Arterial Hypertension (AH) and 486 nonagenarians without history of AH. The proposed strategies appear to deal with GWAS quality control in a sound way, as they lead to rationalize and make explicit the experimenter’s choices thus providing more reproducible results.
PMCID: PMC3041572  PMID: 21347174
22.  Artificial Intelligence in Prediction of Secondary Protein Structure Using CB513 Database 
In this paper we describe CB513 a non-redundant dataset, suitable for development of algorithms for prediction of secondary protein structure. A program was made in Borland Delphi for transforming data from our dataset to make it suitable for learning of neural network for prediction of secondary protein structure implemented in MATLAB Neural-Network Toolbox. Learning (training and testing) of neural network is researched with different sizes of windows, different number of neurons in the hidden layer and different number of training epochs, while using dataset CB513.
PMCID: PMC3041573  PMID: 21347158
23.  A Semantic Image Annotation Model to Enable Integrative Translational Research 
Integrating and relating images with clinical and molecular data is a crucial activity in translational research, but challenging because the information in images is not explicit in standard computer-accessible formats. We have developed an ontology-based representation of the semantic contents of radiology images called AIM (Annotation and Image Markup). AIM specifies the quantitative and qualitative content that researchers extract from images. The AIM ontology enables semantic image annotation and markup, specifying the entities and relations necessary to describe images. AIM annotations, represented as instances in the ontology, enable key use cases for images in translational research such as disease status assessment, query, and inter-observer variation analysis. AIM will enable ontology-based query and mining of images, and integration of images with data in other ontology-annotated bioinformatics databases. Our ultimate goal is to enable researchers to link images with related scientific data so they can learn the biological and physiological significance of the image content.
PMCID: PMC3041574  PMID: 21347180
24.  Consistent visualizations of changing knowledge 
Networks are increasingly used in biology to represent complex data in uncomplicated symbolic form. However, as biological knowledge is continually evolving, so must those networks representing this knowledge. Capturing and presenting this type of knowledge change over time is particularly challenging due to the intimate manner in which researchers customize those networks they come into contact with. The effective visualization of this knowledge is important as it creates insight into complex systems and stimulates hypothesis generation and biological discovery. Here we highlight how the retention of user customizations, and the collection and visualization of knowledge associated provenance supports effective and productive network exploration. We also present an extension of the Hanalyzer system, ReOrient, which supports network exploration and analysis in the presence of knowledge change.
PMCID: PMC3041575  PMID: 21347184
25.  The Open Biomedical Annotator 
The range of publicly available biomedical data is enormous and is expanding fast. This expansion means that researchers now face a hurdle to extracting the data they need from the large numbers of data that are available. Biomedical researchers have turned to ontologies and terminologies to structure and annotate their data with ontology concepts for better search and retrieval. However, this annotation process cannot be easily automated and often requires expert curators. Plus, there is a lack of easy-to-use systems that facilitate the use of ontologies for annotation. This paper presents the Open Biomedical Annotator (OBA), an ontology-based Web service that annotates public datasets with biomedical ontology concepts based on their textual metadata ( The biomedical community can use the annotator service to tag datasets automatically with ontology terms (from UMLS and NCBO BioPortal ontologies). Such annotations facilitate translational discoveries by integrating annotated data.[1]
PMCID: PMC3041576  PMID: 21347171

