PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (84)
 

Clipboard (0)
None

Select a Filter Below

Year of Publication
more »
1.  An automated tool for detecting medication overuse based on the electronic health records 
Purpose
Medication overuse is a serious concern in healthcare as it leads to increased expenditures, side effects and morbidities. Identifying overuse is only possible through excluding appropriate indications that are primarily mentioned in unstructured notes. We developed a framework for automatic identification of medication overuse and applied it to proton pump inhibitors (PPIs).
Methods
We first created an indications knowledgebase using data from drug labels, clinical guidelines, expert opinion and other sources. We also obtained the list of current problems for 200 randomly selected inpatients who received PPIs using a natural language processing system and the discharge summaries of those patients. These problems were checked against the indications knowledge base to identify overuse candidates. Two gastroenterologists manually reviewed the notes and identified cases of overuse. Results from the automated framework were compared to the manual review.
Results
Reviewers had high inter-rater reliability in finding indications (agreement = 92.1%, Cohen’s κ = 0.773). In 137 notes included in final analysis, our system identified indications with a sensitivity of 74% (95%CI = 59% – 86%) and specificity of95% (95%CI = 87% – 98%). In cases of appropriate use where the automated system also found one or more indications, it always included the correct indication.
Conclusions
We created an automated system that can identify established indications of medication use in electronic health records with high accuracy. It can provide clinical decision support for identifying potential overuse of PPIs, and could be useful for reducing overuse and also to encourage better documentation of indications.
doi:10.1002/pds.3387
PMCID: PMC3566345  PMID: 23233423
Overuse; Electronic Health Records; Drug Utilization; Indications; Proton Pump Inhibitors; Natural Language Processing
2.  A new clustering method for detecting rare senses of abbreviations in clinical notes 
Journal of biomedical informatics  2012;45(6):1075-1083.
Abbreviations are widely used in clinical documents and they are often ambiguous. Building a list of possible senses (also called sense inventory) for each ambiguous abbreviation is the first step to automatically identify correct meanings of abbreviations in given contexts. Clustering based methods have been used to detect senses of abbreviations from a clinical corpus [1]. However, rare senses remain challenging and existing algorithms are not good enough to detect them. In this study, we developed a new two-phase clustering algorithm called Tight Clustering for Rare Senses (TCRS) and applied it to sense generation of abbreviations in clinical text. Using manually annotated sense inventories from a set of 13 ambiguous clinical abbreviations, we evaluated and compared TCRS with the existing Expectation Maximization (EM) clustering algorithm for sense generation, at two different levels of annotation cost (10 vs. 20 instances for each abbreviation). Our results showed that the TCRS-based method could detect 85% senses on average; while the EM-based method found only 75% senses, when similar annotation effort (about 20 instances) was used. Further analysis demonstrated that the improvement by the TCRS method was mainly from additionally detected rare senses, thus indicating its usefulness for building more complete sense inventories of clinical abbreviations.
doi:10.1016/j.jbi.2012.06.003
PMCID: PMC3729222  PMID: 22742938
Natural language processing; Word sense discrimination; Clustering; Clinical abbreviations
3.  Drug–drug interaction through molecular structure similarity analysis 
Background
Drug–drug interactions (DDIs) are responsible for many serious adverse events; their detection is crucial for patient safety but is very challenging. Currently, the US Food and Drug Administration and pharmaceutical companies are showing great interest in the development of improved tools for identifying DDIs.
Methods
We present a new methodology applicable on a large scale that identifies novel DDIs based on molecular structural similarity to drugs involved in established DDIs. The underlying assumption is that if drug A and drug B interact to produce a specific biological effect, then drugs similar to drug A (or drug B) are likely to interact with drug B (or drug A) to produce the same effect. DrugBank was used as a resource for collecting 9454 established DDIs. The structural similarity of all pairs of drugs in DrugBank was computed to identify DDI candidates.
Results
The methodology was evaluated using as a gold standard the interactions retrieved from the initial DrugBank database. Results demonstrated an overall sensitivity of 0.68, specificity of 0.96, and precision of 0.26. Additionally, the methodology was also evaluated in an independent test using the Micromedex/Drugdex database.
Conclusion
The proposed methodology is simple, efficient, allows the investigation of large numbers of drugs, and helps highlight the etiology of DDI. A database of 58 403 predicted DDIs with structural evidence is provided as an open resource for investigators seeking to analyze DDIs.
doi:10.1136/amiajnl-2012-000935
PMCID: PMC3534468  PMID: 22647690
Drug-drug interaction; adverse drug event; structure similarity; molecular fingerprints; QSAR; molecular modeling; drug design; automated learning; statistical analysis of large datasets; discovery; and text and data mining methods
4.  Predicting Monoamine Oxidase Inhibitory Activity through Ligand-Based Models 
Current topics in medicinal chemistry  2012;12(20):2258-2274.
The evolution of bio- and cheminformatics associated with the development of specialized software and increasing computer power has produced a great interest in theoretical in silico methods applied in drug rational design. These techniques apply the concept that “similar molecules have similar biological properties” that has been exploited in Medicinal Chemistry for years to design new molecules with desirable pharmacological profiles. Ligand-based methods are not dependent on receptor structural data and take into account two and three-dimensional molecular properties to assess similarity of new compounds in regards to the set of molecules with the biological property under study. Depending on the complexity of the calculation, there are different types of ligand-based methods, such as QSAR (Quantitative Structure-Activity Relationship) with 2D and 3D descriptors, CoMFA (Comparative Molecular Field Analysis) or pharmacophoric approaches. This work provides a description of a series of ligand-based models applied in the prediction of the inhibitory activity of monoamine oxidase (MAO) enzymes. The controlled regulation of the enzymes’ function through the use of MAO inhibitors is used as a treatment in many psychiatric and neurological disorders, such as depression, anxiety, Alzheimer’s and Parkinson’s disease. For this reason, multiple scaffolds, such as substituted coumarins, indolylmethylamine or pyridazine derivatives were synthesized and assayed toward MAO-A and MAO-B inhibition. Our intention is to focus on the description of ligand-based models to provide new insights in the relationship between the MAO inhibitory activity and the molecular structure of the different inhibitors, and further study enzyme selectivity and possible mechanisms of action.
PMCID: PMC3762258  PMID: 23231398
Alzheimer’s; CoMFA; Ligand-based models; MAO; Molecular Descriptors; Parkinson’s; Pharmacophore; QSAR
5.  Novel Data Mining Methodologies for Adverse Drug Event Discovery and Analysis 
Introduction
Discovery of new adverse drug events (ADEs) in the post-approval period is an important goal of the health system. Data mining methods that can transform data into meaningful knowledge to inform patient safety have proven to be essential. New opportunities have emerged to harness data sources that have not been used within the traditional framework. This article provides an overview of recent methodological innovations and data sources used in support of ADE discovery and analysis.
doi:10.1038/clpt.2012.50
PMCID: PMC3675775  PMID: 22549283
Pharmacovigilance; Adverse Drug Events; Data Mining
6.  Evaluation considerations for EHR-based phenotyping algorithms: A case study for drug-induced liver injury 
Developing electronic health record (EHR) phenotyping algorithms involves generating queries that run across the EHR data repository. Algorithms are commonly assessed within demonstration studies. There remains, however, little emphasis on assessing the precision and accuracy of measurement methods during the evaluation process. Depending on the complexity of an algorithm, interim refinements may be required to improve measurement methods. Therefore, we develop an evaluation framework that incorporates both measurement and demonstration studies. We evaluate a baseline EHR phenotyping algorithm for drug induced liver injury (DILI) developed in collaboration with electronic Medical Records Genomics (eMERGE) network participants. We conduct a measurement study and report qualitative (i.e., perceptions of evaluation approach effectiveness) and quantitative (i.e., inter-rater reliability) measures. We also conduct a demonstration study and report qualitative (i.e., appropriateness of results) and quantitative (i.e., positive predictive value) measures. Given results from the measurement study, our evaluation approach underwent multiple changes including the addition of laboratory value visualization and an expanded review of clinical notes. Results from the demonstration study informed changes to our algorithm. For example, given the goal of eMERGE to identify patients who may have a genetic susceptibility to DILI, we excluded overdose patients.
PMCID: PMC3814479  PMID: 24303321
7.  Detection of Drug-Drug Interactions by Modeling Interaction Profile Fingerprints 
PLoS ONE  2013;8(3):e58321.
Drug-drug interactions (DDIs) constitute an important problem in postmarketing pharmacovigilance and in the development of new drugs. The effectiveness or toxicity of a medication could be affected by the co-administration of other drugs that share pharmacokinetic or pharmacodynamic pathways. For this reason, a great effort is being made to develop new methodologies to detect and assess DDIs. In this article, we present a novel method based on drug interaction profile fingerprints (IPFs) with successful application to DDI detection. IPFs were generated based on the DrugBank database, which provided 9,454 well-established DDIs as a primary source of interaction data. The model uses IPFs to measure the similarity of pairs of drugs and generates new putative DDIs from the non-intersecting interactions of a pair. We described as part of our analysis the pharmacological and biological effects associated with the putative interactions; for example, the interaction between haloperidol and dicyclomine can cause increased risk of psychosis and tardive dyskinesia. First, we evaluated the method through hold-out validation and then by using four independent test sets that did not overlap with DrugBank. Precision for the test sets ranged from 0.4–0.5 with more than two fold enrichment factor enhancement. In conclusion, we demonstrated the usefulness of the method in pharmacovigilance as a DDI predictor, and created a dataset of potential DDIs, highlighting the etiology or pharmacological effect of the DDI, and providing an exploratory tool to facilitate decision support in DDI detection and patient safety.
doi:10.1371/journal.pone.0058321
PMCID: PMC3592896  PMID: 23520498
8.  Facilitating adverse drug event detection in pharmacovigilance databases using molecular structure similarity: application to rhabdomyolysis 
Background
Adverse drug events (ADE) cause considerable harm to patients, and consequently their detection is critical for patient safety. The US Food and Drug Administration maintains an adverse event reporting system (AERS) to facilitate the detection of ADE in drugs. Various data mining approaches have been developed that use AERS to detect signals identifying associations between drugs and ADE. The signals must then be monitored further by domain experts, which is a time-consuming task.
Objective
To develop a new methodology that combines existing data mining algorithms with chemical information by analysis of molecular fingerprints to enhance initial ADE signals generated from AERS, and to provide a decision support mechanism to facilitate the identification of novel adverse events.
Results
The method achieved a significant improvement in precision in identifying known ADE, and a more than twofold signal enhancement when applied to the ADE rhabdomyolysis. The simplicity of the method assists in highlighting the etiology of the ADE by identifying structurally similar drugs. A set of drugs with strong evidence from both AERS and molecular fingerprint-based modeling is constructed for further analysis.
Conclusion
The results demonstrate that the proposed methodology could be used as a pharmacovigilance decision support tool to facilitate ADE detection.
doi:10.1136/amiajnl-2011-000417
PMCID: PMC3241177  PMID: 21946238
Adverse drug event; AERS; FDA; molecular fingerprints; rhabdomyolysis; spontaneous reporting system; structure similarity
9.  Combining Corpus-derived Sense Profiles with Estimated Frequency Information to Disambiguate Clinical Abbreviations 
AMIA Annual Symposium Proceedings  2012;2012:1004-1013.
Abbreviations are widely used in clinical notes and are often ambiguous. Word sense disambiguation (WSD) for clinical abbreviations therefore is a critical task for many clinical natural language processing (NLP) systems. Supervised machine learning based WSD methods are known for their high performance. However, it is time consuming and costly to construct annotated samples for supervised WSD approaches and sense frequency information is often ignored by these methods. In this study, we proposed a profile-based method that used dictated discharge summaries as an external source to automatically build sense profiles and applied them to disambiguate abbreviations in hospital admission notes via the vector space model. Our evaluation using a test set containing 2,386 annotated instances from 13 ambiguous abbreviations in admission notes showed that the profile-based method performed better than two baseline methods and achieved a best average precision of 0.792. Furthermore, we developed a strategy to combine sense frequency information estimated from a clustering analysis with the profile-based method. Our results showed that the combined approach largely improved the performance and achieved a highest precision of 0.875 on the same test set, indicating that integrating sense frequency information with local context is effective for clinical abbreviation disambiguation.
PMCID: PMC3540457  PMID: 23304376
10.  Deriving a probabilistic syntacto-semantic grammar for biomedicine based on domain-specific terminologies 
Journal of biomedical informatics  2011;44(5):805-814.
Biomedical natural language processing (BioNLP) is a useful technique that unlocks valuable information stored in textual data for practice and/or research. Syntactic parsing is a critical component of BioNLP applications that rely on correctly determining the sentence and phrase structure of free text. In addition to dealing with the vast amount of domain-specific terms, a robust biomedical parser needs to model the semantic grammar to obtain viable syntactic structures. With either a rule-based or corpus-based approach, the grammar engineering process requires substantial time and knowledge from experts, and does not always yield a semantically transferable grammar. To reduce the human effort and to promote semantic transferability, we propose an automated method for deriving a probabilistic grammar based on a training corpus consisting of concept strings and semantic classes from the Unified Medical Language System (UMLS), a comprehensive terminology resource widely used by the community. The grammar is designed to specify noun phrases only due to the nominal nature of the majority of biomedical terminological concepts. Evaluated on manually parsed clinical notes, the derived grammar achieved a recall of 0.644, precision of 0.737, and average cross-bracketing of 0.61, which demonstrated better performance than a control grammar with the semantic information removed. Error analysis revealed shortcomings that could be addressed to improve performance. The results indicated the feasibility of an approach which automatically incorporates terminology semantics in the building of an operational grammar. Although the current performance of the unsupervised solution does not adequately replace manual engineering, we believe once the performance issues are addressed, it could serve as an aide in a semi-supervised solution.
doi:10.1016/j.jbi.2011.04.006
PMCID: PMC3172402  PMID: 21549857
Natural language processing; Biomedical terminology; Semantic grammar; Probabilistic parsing
11.  Enhancing Adverse Drug Event Detection in Electronic Health Records Using Molecular Structure Similarity: Application to Pancreatitis 
PLoS ONE  2012;7(7):e41471.
Background
Adverse drug events (ADEs) detection and assessment is at the center of pharmacovigilance. Data mining of systems, such as FDA’s Adverse Event Reporting System (AERS) and more recently, Electronic Health Records (EHRs), can aid in the automatic detection and analysis of ADEs. Although different data mining approaches have been shown to be valuable, it is still crucial to improve the quality of the generated signals.
Objective
To leverage structural similarity by developing molecular fingerprint-based models (MFBMs) to strengthen ADE signals generated from EHR data.
Methods
A reference standard of drugs known to be causally associated with the adverse event pancreatitis was used to create a MFBM. Electronic Health Records (EHRs) from the New York Presbyterian Hospital were mined to generate structured data. Disproportionality Analysis (DPA) was applied to the data, and 278 possible signals related to the ADE pancreatitis were detected. Candidate drugs associated with these signals were then assessed using the MFBM to find the most promising candidates based on structural similarity.
Results
The use of MFBM as a means to strengthen or prioritize signals generated from the EHR significantly improved the detection accuracy of ADEs related to pancreatitis. MFBM also highlights the etiology of the ADE by identifying structurally similar drugs, which could follow a similar mechanism of action.
Conclusion
The method proposed in this paper provides evidence of being a promising adjunct to existing automated ADE detection and analysis approaches.
doi:10.1371/journal.pone.0041471
PMCID: PMC3404072  PMID: 22911794
12.  Biclustering of Adverse Drug Events in FDA’s Spontaneous Reporting System 
In this paper we present a new pharmacovigilance data mining technique based on the biclustering paradigm, which is designed to identify drug groups that share a common set of adverse events in FDA’s spontaneous reporting system. A taxonomy of biclusters is developed, revealing that a significant number of bone fide adverse drug event (ADE) biclusters are identified. Statistical tests indicate that it is extremely unlikely that the discovered bicluster structures as well as their content arose by chance. Some of the biclusters classified as indeterminate provide support for previously unrecognized and potentially novel ADEs. In addition, we demonstrate the importance of the proposed methodology to several important aspects of pharmacovigilance such as: providing insight into the etiology of ADEs, facilitating the identification of novel ADEs, suggesting methods and rational for aggregating terminologies, highlighting areas of focus, and as a data exploratory tool.
doi:10.1038/clpt.2010.285
PMCID: PMC3282185  PMID: 21191383
Pharmacovigilance; Adverse Drug Events; Biclustering; Clustering; FDA Adverse Event Reporting System
13.  Selecting Information in Electronic Health Records for Knowledge Acquisition 
Journal of biomedical informatics  2010;43(4):595-601.
Knowledge acquisition of relations between biomedical entities is critical for many automated biomedical applications, including pharmacovigilance and decision support. Automated acquisition of statistical associations from biomedical and clinical documents has shown some promise. However, acquisition of clinically meaningful relations (i.e. specific associations) remains challenging because textual information is noisy and co-occurrence does not typically determine specific relations. In this work, we focus on acquisition of two types of relations from clinical reports: disease-manifestation related symptom (MRS) and drug-adverse drug event (ADE), and explore the use of filtering by sections of the report to improve performance. Evaluation indicated that applying the filters improved recall (disease-MRS: from 0.85 to 0.90; drug-ADE: from 0.43 to 0.75) and precision (disease-MRS: from 0.82 to 0.92; drug-ADE: from 0.16 to 0.31). This preliminary study demonstrates that selecting information in narrative electronic reports based on the section improves the detection of disease-MRS and drug-ADE types of relations. Further investigation of complementary methods, such as more sophisticated statistical methods, more complex temporal models and use of information from other knowledge sources, is needed.
doi:10.1016/j.jbi.2010.03.011
PMCID: PMC2902678  PMID: 20362071
knowledge acquisition; natural language processing (NLP); text mining; pharmacovigilance; decision support; electronic health record (EHR)
14.  Discovering Disease Associations by Integrating Electronic Clinical Data and Medical Literature 
PLoS ONE  2011;6(6):e21132.
Electronic health record (EHR) systems offer an exceptional opportunity for studying many diseases and their associated medical conditions within a population. The increasing number of clinical record entries that have become available electronically provides access to rich, large sets of patients' longitudinal medical information. By integrating and comparing relations found in the EHRs with those already reported in the literature, we are able to verify existing and to identify rare or novel associations. Of particular interest is the identification of rare disease co-morbidities, where the small numbers of diagnosed patients make robust statistical analysis difficult. Here, we introduce ADAMS, an Application for Discovering Disease Associations using Multiple Sources, which contains various statistical and language processing operations. We apply ADAMS to the New York-Presbyterian Hospital's EHR to combine the information from the relational diagnosis tables and textual discharge summaries with those from PubMed and Wikipedia in order to investigate the co-morbidities of the rare diseases Kaposi sarcoma, toxoplasmosis, and Kawasaki disease. In addition to finding well-known characteristics of diseases, ADAMS can identify rare or previously unreported associations. In particular, we report a statistically significant association between Kawasaki disease and diagnosis of autistic disorder.
doi:10.1371/journal.pone.0021132
PMCID: PMC3121722  PMID: 21731656
15.  Community Vaccinators in the Workplace 
Emerging Infectious Diseases  2011;17(6):1134-1135.
doi:10.3201/eid1706.101763
PMCID: PMC3358212  PMID: 21749793
Vaccination; immunization; human influenza; workplace; disease outbreaks; whooping cough; diphtheria; tetanus; vaccine; letter
16.  A Drug-Adverse Event Extraction Algorithm to Support Pharmacovigilance Knowledge Mining from PubMed Citations 
AMIA Annual Symposium Proceedings  2011;2011:1464-1470.
Adverse drug events (ADEs) create a serious problem causing substantial harm to patients. An executable standardized knowledgebase of drug-ADE relations which is publicly available would be valuable so that it could be used for ADE detection. The literature is an important source that could be used to generate a knowledgebase of drug-ADE pairs. In this paper, we report on a method that automatically determines whether a specific adverse event (AE) is caused by a specific drug based on the content of PubMed citations. A drug-ADE classification method was initially developed to detect neutropenia based on a pre-selected set of drugs. This method was then applied to a different set of 76 drugs to determine if they caused neutropenia. For further proof of concept this method was applied to 48 drugs to determine whether they caused another AE, myocardial infarction. Results showed that AUROC was 0.93 and 0.86 respectively.
PMCID: PMC3243206  PMID: 22195210
17.  Determining the Reasons for Medication Prescriptions in the EHR using Knowledge and Natural Language Processing 
Knowledge of medication indications is significant for automatic applications aimed at improving patient safety, such as computerized physician order entry and clinical decision support systems. The Electronic Health Record (EHR) contains pertinent information related to patient safety such as information related to appropriate prescribing. However, the reasons for medication prescriptions are usually not explicitly documented in the patient record. This paper describes a method that determines the reasons for medication uses based on information occurring in outpatient notes. The method utilizes drug-indication knowledge that we acquired, and natural language processing. Evaluation showed the method obtained a sensitivity of 62.8%, specificity of 93.9%, precision of 90% and F-measure of 73.9%. This pilot study demonstrated that linking external drug indication knowledge to the EHR for determining the reasons for medication use was promising, but also revealed some challenges. Future work will focus on increasing the accuracy and coverage of the indication knowledge and evaluating its performance using a much larger set of drugs frequently used in the outpatient population.
PMCID: PMC3243251  PMID: 22195134
18.  Mining multi-item drug adverse effect associations in spontaneous reporting systems 
BMC Bioinformatics  2010;11(Suppl 9):S7.
Background
Multi-item adverse drug event (ADE) associations are associations relating multiple drugs to possibly multiple adverse events. The current standard in pharmacovigilance is bivariate association analysis, where each single drug-adverse effect combination is studied separately. The importance and difficulty in the detection of multi-item ADE associations was noted in several prominent pharmacovigilance studies. In this paper we examine the application of a well established data mining method known as association rule mining, which we tailored to the above problem, and demonstrate its value. The method was applied to the FDAs spontaneous adverse event reporting system (AERS) with minimal restrictions and expectations on its output, an experiment that has not been previously done on the scale and generality proposed in this work.
Results
Based on a set of 162,744 reports of suspected ADEs reported to AERS and published in the year 2008, our method identified 1167 multi-item ADE associations. A taxonomy that characterizes the associations was developed based on a representative sample. A significant number (67% of the total) of potential multi-item ADE associations identified were characterized and clinically validated by a domain expert as previously recognized ADE associations. Several potentially novel ADEs were also identified. A smaller proportion (4%) of associations were characterized and validated as known drug-drug interactions.
Conclusions
Our findings demonstrate that multi-item ADEs are present and can be extracted from the FDA’s adverse effect reporting system using our methodology, suggesting that our method is a valid approach for the initial identification of multi-item ADEs. The study also revealed several limitations and challenges that can be attributed to both the method and quality of data.
doi:10.1186/1471-2105-11-S9-S7
PMCID: PMC2967748  PMID: 21044365
19.  Information Visualization Techniques in Bioinformatics during the Postgenomic Era 
Drug discovery today. Biosilico  2004;2(6):237-245.
Information visualization techniques, which take advantage of the bandwidth of human vision, are powerful tools for organizing and analyzing a large amount of data. In the postgenomic era, information visualization tools are indispensable for biomedical research. This paper aims to present an overview of current applications of information visualization techniques in bioinformatics for visualizing different types of biological data, such as from genomics, proteomics, expression profiling and structural studies. Finally, we discuss the challenges of information visualization in bioinformatics related to dealing with more complex biological information in the emerging fields of systems biology and systems medicine.
doi:10.1016/S1741-8364(04)02423-0
PMCID: PMC2957900  PMID: 20976032
Information visualization; Bioinformatics
20.  PHENOGO: ASSIGNING PHENOTYPIC CONTEXT TO GENE ONTOLOGY ANNOTATIONS WITH NATURAL LANGUAGE PROCESSING 
Natural language processing (NLP) is a high throughput technology because it can process vast quantities of text within a reasonable time period. It has the potential to substantially facilitate biomedical research by extracting, linking, and organizing massive amounts of information that occur in biomedical journal articles as well as in textual fields of biological databases. Until recently, much of the work in biological NLP and text mining has revolved around recognizing the occurrence of biomolecular entities in articles, and in extracting particular relationships among the entities. Now, researchers have recognized a need to link the extracted information to ontologies or knowledge bases, which is a more difficult task. One such knowledge base is Gene Ontology annotations (GOA), which significantly increases semantic computations over the function, cellular components and processes of genes. For multicellular organisms, these annotations can be refined with phenotypic context, such as the cell type, tissue, and organ because establishing phenotypic contexts in which a gene is expressed is a crucial step for understanding the development and the molecular underpinning of the pathophysiology of diseases. In this paper, we propose a system, PhenoGO, which automatically augments annotations in GOA with additional context. PhenoGO utilizes an existing NLP system, called BioMedLEE, an existing knowledge-based phenotype organizer system (PhenOS) in conjunction with MeSH indexing and established biomedical ontologies. More specifically, PhenoGO adds phenotypic contextual information to existing associations between gene products and GO terms as specified in GOA. The system also maps the context to identifiers that are associated with different biomedical ontologies, including the UMLS, Cell Ontology, Mouse Anatomy, NCBI taxonomy, GO, and Mammalian Phenotype Ontology. In addition, PhenoGO was evaluated for coding of anatomical and cellular information and assigning the coded phenotypes to the correct GOA; results obtained show that PhenoGO has a precision of 91% and recall of 92%, demonstrating that the PhenoGO NLP system can accurately encode a large number of anatomical and cellular ontologies to GO annotations. The PhenoGO Database may be accessed at the following URL: http://www.phenoGO.org
PMCID: PMC2906243  PMID: 17094228
21.  Visualizing Information across Multidimensional Post-Genomic Structured and Textual Databases 
Bioinformatics (Oxford, England)  2004;21(8):1659-1667.
Motivation
Visualizing relations among biological information to facilitate understanding is crucial to biological research during the post-genomic era. Although different systems have been developed to view gene-phenotype relations for specific databases, very few have been designed specifically as a general flexible tool for visualizing multidimensional genotypic and phenotypic information together. Our goal is to develop a method for visualizing multidimensional genotypic and phenotypic information and a model that unifies different biological databases in order to present the integrated knowledge using a uniform interface.
Results
We developed a novel, flexible and generalizable visualization tool, called PhenoGenesviewer (PGviewer), which in this paper was used to display gene-phenotype relations from a human-curated database (OMIM) and from an automatic method using a Natural Language Processing tool called BioMedLEE. Data obtained from multiple databases were first integrated into a uniform structure and then organized by PGviewer. PGviewer provides a flexible query interface that allows dynamic selection and ordering of any desired dimension in the databases. Based on users’ queries, results can be visualized using hierarchical expandable trees that present views specified by users according to their research interests. We believe that this method, which allows users to dynamically organize and visualize multiple dimensions, is a potentially powerful and promising tool that should substantially facilitate biological research.
doi:10.1093/bioinformatics/bti210
PMCID: PMC2901923  PMID: 15598839
22.  DISCOVERY OF PROTEIN INTERACTION NETWORKS SHARED BY DISEASES# 
The study of protein-protein interactions is essential to define the molecular networks that contribute to maintain homeostasis of an organism’s body functions. Disruptions in protein interaction networks have been shown to result in diseases in both humans and animals. Monogenic diseases disrupting biochemical pathways such as hereditary coagulopathies (e.g. hemophilia), provided a deep insight in the biochemical pathways of acquired coagulopathies of complex diseases. Indeed, a variety of complex liver diseases can lead to decreased synthesis of the same set of coagulation factors as in hemophilia. Similarly, more complex diseases such as different cancers have been shown to result from malfunctions of common proteins pathways. In order to discover, in high throughput, the molecular underpinnings of poorly characterized diseases, we present a statistical method to identify shared protein interaction network(s) between diseases. Integrating (i) a protein interaction network with (ii) disease to protein relationships derived from mining Gene Ontology annotations and the biomedical literature with natural language understanding (PhenoGO), we identified protein-protein interactions that were associated with pairs of diseases and calculated the statistical significance of the occurrence of interactions in the protein interaction knowledgebase. Significant correlations between diseases and shared protein networks were identified and evaluated in this study, demonstrating the high precision of the approach and correct non-trivial predictions, signifying the potential for discovery. In conclusion, we demonstrate that the associations between diseases are directly correlated to their underlying protein-protein interaction networks, possibly providing insight into the underlying molecular mechanisms of phenotypes and biological processes disrupted in related diseases.
PMCID: PMC2886192  PMID: 17992746
23.  Information theory applied to the sparse gene ontology annotation network to predict novel gene function 
Bioinformatics (Oxford, England)  2007;23(13):i529-i538.
Motivation
Despite advances in the gene annotation process, the functions of a large portion of the gene products remain insufficiently characterized. In addition, the “in silico” prediction of novel Gene Ontology (GO) annotations for partially characterized gene functions or processes is highly dependent on reverse genetic or function genomics approaches.
Results
We propose a novel approach, Information Theory-based Semantic Similarity (ITSS), to automatically predict molecular functions of genes based on Gene Ontology annotations. We have demonstrated using a 10-fold cross-validation that the ITSS algorithm obtains prediction accuracies (Precision 97%, Recall 77%) comparable to other machine learning algorithms when applied to similarly dense annotated portions of the GO datasets. In addition, such method can generate highly accurate predictions in sparsely annotated portions of GO, in which previous algorithm failed to do so. As a result, our technique generates an order of magnitude more gene function predictions than previous methods. Further, this paper presents the first historical rollback validation for the predicted GO annotations, which may represent more realistic conditions for an evaluation than generally used cross-validations type of evaluations. By manually assessing a random sample of 100 predictions conducted in a historical roll-back evaluation, we estimate that a minimum precision of 51% (95% confidence interval: 43%–58%) can be achieved for the human GO Annotation file dated 2003.
Availability
The program is available on request. The 97,732 positive predictions of novel gene annotations from the 2005 GO Annotation dataset are available at http://phenos.bsd.uchicago.edu/mphenogo/prediction_result_2005.txt.
doi:10.1093/bioinformatics/btm195
PMCID: PMC2882681  PMID: 17646340
24.  Bio-Ontologies and Text: Bridging the Modeling Gap Between 
Bioinformatics (Oxford, England)  2006;22(19):2421-2429.
Motivation
Natural language processing (NLP) techniques are increasingly being used in biology to automate the capture of new biological discoveries in text, which are being reported at a rapid rate. To facilitate the computational reuse and integration of information buried in unstructured text, we propose a schema that represents a comprehensive set of biological entities and relations as expressed in natural language. In addition, the schema connects different scales of biological information, and provides links from the textual information to existing ontologies, which are essential in biology for integration, organization, dissemination, and knowledge management of heterogeneous information. A comprehensive representation for otherwise heterogeneous datasets, such as the one proposed, are critical for advancing systems biology because they allow for acquisition and reuse of unprecedented volumes of diverse types of knowledge and information from text.
Results
A novel representational schema, PGschema, was developed that enables translation of information in textual narratives to a well-defined data structure comprising genotypic and phenotypic concepts from established ontologies along with modifiers and relationships. Initial evaluation for coverage of a selected set of entities showed that 85% of the information could be represented. Moreover, PGschema can be realized automatically in an XML format by using natural language techniques to process the text.
doi:10.1093/bioinformatics/btl405
PMCID: PMC2879055  PMID: 16870928
25.  Evaluation of an Ontology-anchored Natural Language-based Approach for Asserting Multi-scale Biomolecular Networks for Systems Medicine 
The ability to adequately and efficiently integrate unstructured, heterogeneous datasets, which are incumbent to systems biology and medicine, is one of the primary limitations to their comprehensive analysis. Natural language processing (NLP) and biomedical ontologies are automated methods for capturing, standardizing and integrating information across diverse sources, including narrative text. We have utilized the BioMedLEE NLP system to extract and encode, using standard ontologies (e.g., Cell Type Ontology, Mammalian Phenotype, Gene Ontology), biomolecular mechanisms and clinical phenotypes from the scientific literature. We subsequently applied semantic processing techniques to the structured BioMedLEE output to determine the relationships between these biomolecular and clinical phenotype concepts. We conducted an evaluation that shows an average precision and recall of BioMedLEE with respect to annotating phrases comprised of cell type, anatomy/disease, and gene/protein concepts were 86% and 78%, respectively. The precision of the asserted phenotype-molecular relationships was 75%.
PMCID: PMC3041541  PMID: 21347135

Results 1-25 (84)