To describe a collaborative approach for developing an electronic health record (EHR) phenotyping algorithm for drug-induced liver injury (DILI).
We analyzed types and causes of differences in DILI case definitions provided by two institutions—Columbia University and Mayo Clinic; harmonized two EHR phenotyping algorithms; and assessed the performance, measured by sensitivity, specificity, positive predictive value, and negative predictive value, of the resulting algorithm at three institutions except that sensitivity was measured only at Columbia University.
Although these sites had the same case definition, their phenotyping methods differed by selection of liver injury diagnoses, inclusion of drugs cited in DILI cases, laboratory tests assessed, laboratory thresholds for liver injury, exclusion criteria, and approaches to validating phenotypes. We reached consensus on a DILI phenotyping algorithm and implemented it at three institutions. The algorithm was adapted locally to account for differences in populations and data access. Implementations collectively yielded 117 algorithm-selected cases and 23 confirmed true positive cases.
Phenotyping for rare conditions benefits significantly from pooling data across institutions. Despite the heterogeneity of EHRs and varied algorithm implementations, we demonstrated the portability of this algorithm across three institutions. The performance of this algorithm for identifying DILI was comparable with other computerized approaches to identify adverse drug events.
Phenotyping algorithms developed for rare and complex conditions are likely to require adaptive implementation at multiple institutions. Better approaches are also needed to share algorithms. Early agreement on goals, data sources, and validation methods may improve the portability of the algorithms.
Electronic health records; Phenotyping; Pharmacovigilance; Drug-induced liver injury; Rare diseases
Extracting comorbidity information is crucial for phenotypic studies because of the confounding effect of comorbidities. We developed an automated method that accurately determines comorbidities from electronic medical records. Using a modified version of the Charlson comorbidity index (CCI), two physicians created a reference standard of comorbidities by manual review of 100 admission notes. We processed the notes using the MedLEE natural language processing system, and wrote queries to extract comorbidities automatically from its structured output. Interrater agreement for the reference set was very high (97.7%). Our method yielded an F1 score of 0.761 and the summed CCI score was not different from the reference standard (p=0.329, power 80.4%). In comparison, obtaining comorbidities from claims data yielded an F1 score of 0.741, due to lower sensitivity (66.1%). Because CCI has previously been validated as a predictor of mortality and readmission, our method could allow automated prediction of these outcomes.
Comorbidity; Confounding Factors; Natural Language Processing
Observational studies suggest that proton pump inhibitors (PPIs) are a risk factor for incident Clostridium difficile infection (CDI). Data also suggest an association between PPIs and recurrent CDI, although large-scale studies focusing solely on hospitalized patients are lacking. We therefore performed a retrospective cohort analysis of inpatients with incident CDI to assess receipt of PPIs as a risk factor for CDI recurrence in this population.
Using electronic medical records, we identified hospitalized adult patients between December 1, 2009 and June 30, 2012 with incident CDI, defined as a first positive stool test for C. difficile toxin B and who received appropriate treatment. Electronic records were parsed for clinical factors including receipt of PPIs, other acid suppression, non-CDI antibiotics, and comorbidities. The primary exposure was in-hospital PPIs given concurrently with C. difficile treatment. Recurrence was defined as a second positive stool test 15 to 90 days after the initial positive test. C. difficile recurrence rates in the PPI exposed and unexposed groups were compared with the log-rank test. Multivariable Cox proportional hazards modeling was performed to control for demographics, comorbidities, and other clinical factors.
We identified 894 inpatients with incident CDI. The cumulative incidence of CDI recurrence in the cohort was 23%. Receipt of PPIs concurrent with CDI treatment was not associated with C. difficile recurrence (HR 0.82; 95% CI 0.58–1.16). Black race (HR 1.66, 95% CI 1.05–2.63), increased age (HR 1.02, 95% CI 1.01–1.03), and increased comorbidities (HR 1.09, 95% CI 1.04–1.14) were associated with CDI recurrence. In light of a higher 90-day mortality seen among those who received PPIs (log-rank p = 0.02), we also analyzed the subset of patients who survived to 90 days of follow-up. Again, there was no association between PPIs and CDI recurrence (HR 0.87; 95% CI 0.60–1.28). Finally, there was no association between recurrent CDI and increased duration or dose of PPIs.
Among hospitalized adults with C. difficile, receipt of PPIs concurrent with C. difficile treatment was not associated with CDI recurrence. Black race, increased age, and increased comorbidities significantly predicted recurrence. Future studies should test interventions to prevent CDI recurrence among high risk inpatients.
Data-mining algorithms that can produce accurate signals of potentially novel adverse drug reactions (ADRs) are a central component of pharmacovigilance. We propose a signal-detection strategy that combines the adverse event reporting system (AERS) of the Food and Drug Administration and electronic health records (EHRs) by requiring signaling in both sources. We claim that this approach leads to improved accuracy of signal detection when the goal is to produce a highly selective ranked set of candidate ADRs.
Materials and methods
Our investigation was based on over 4 million AERS reports and information extracted from 1.2 million EHR narratives. Well-established methodologies were used to generate signals from each source. The study focused on ADRs related to three high-profile serious adverse reactions. A reference standard of over 600 established and plausible ADRs was created and used to evaluate the proposed approach against a comparator.
The combined signaling system achieved a statistically significant large improvement over AERS (baseline) in the precision of top ranked signals. The average improvement ranged from 31% to almost threefold for different evaluation categories. Using this system, we identified a new association between the agent, rasburicase, and the adverse event, acute pancreatitis, which was supported by clinical review.
The results provide promising initial evidence that combining AERS with EHRs via the framework of replicated signaling can improve the accuracy of signal detection for certain operating scenarios. The use of additional EHR data is required to further evaluate the capacity and limits of this system and to extend the generalizability of these results.
Pharmacovigilance; Adverse Drug Reactions; Integration; Electronic Health Records
Medication overuse is a serious concern in healthcare as it leads to increased expenditures, side effects and morbidities. Identifying overuse is only possible through excluding appropriate indications that are primarily mentioned in unstructured notes. We developed a framework for automatic identification of medication overuse and applied it to proton pump inhibitors (PPIs).
We first created an indications knowledgebase using data from drug labels, clinical guidelines, expert opinion and other sources. We also obtained the list of current problems for 200 randomly selected inpatients who received PPIs using a natural language processing system and the discharge summaries of those patients. These problems were checked against the indications knowledge base to identify overuse candidates. Two gastroenterologists manually reviewed the notes and identified cases of overuse. Results from the automated framework were compared to the manual review.
Reviewers had high inter-rater reliability in finding indications (agreement = 92.1%, Cohen’s κ = 0.773). In 137 notes included in final analysis, our system identified indications with a sensitivity of 74% (95%CI = 59% – 86%) and specificity of95% (95%CI = 87% – 98%). In cases of appropriate use where the automated system also found one or more indications, it always included the correct indication.
We created an automated system that can identify established indications of medication use in electronic health records with high accuracy. It can provide clinical decision support for identifying potential overuse of PPIs, and could be useful for reducing overuse and also to encourage better documentation of indications.
Overuse; Electronic Health Records; Drug Utilization; Indications; Proton Pump Inhibitors; Natural Language Processing
Abbreviations are widely used in clinical documents and they are often ambiguous. Building a list of possible senses (also called sense inventory) for each ambiguous abbreviation is the first step to automatically identify correct meanings of abbreviations in given contexts. Clustering based methods have been used to detect senses of abbreviations from a clinical corpus . However, rare senses remain challenging and existing algorithms are not good enough to detect them. In this study, we developed a new two-phase clustering algorithm called Tight Clustering for Rare Senses (TCRS) and applied it to sense generation of abbreviations in clinical text. Using manually annotated sense inventories from a set of 13 ambiguous clinical abbreviations, we evaluated and compared TCRS with the existing Expectation Maximization (EM) clustering algorithm for sense generation, at two different levels of annotation cost (10 vs. 20 instances for each abbreviation). Our results showed that the TCRS-based method could detect 85% senses on average; while the EM-based method found only 75% senses, when similar annotation effort (about 20 instances) was used. Further analysis demonstrated that the improvement by the TCRS method was mainly from additionally detected rare senses, thus indicating its usefulness for building more complete sense inventories of clinical abbreviations.
Natural language processing; Word sense discrimination; Clustering; Clinical abbreviations
Drug–drug interactions (DDIs) are responsible for many serious adverse events; their detection is crucial for patient safety but is very challenging. Currently, the US Food and Drug Administration and pharmaceutical companies are showing great interest in the development of improved tools for identifying DDIs.
We present a new methodology applicable on a large scale that identifies novel DDIs based on molecular structural similarity to drugs involved in established DDIs. The underlying assumption is that if drug A and drug B interact to produce a specific biological effect, then drugs similar to drug A (or drug B) are likely to interact with drug B (or drug A) to produce the same effect. DrugBank was used as a resource for collecting 9454 established DDIs. The structural similarity of all pairs of drugs in DrugBank was computed to identify DDI candidates.
The methodology was evaluated using as a gold standard the interactions retrieved from the initial DrugBank database. Results demonstrated an overall sensitivity of 0.68, specificity of 0.96, and precision of 0.26. Additionally, the methodology was also evaluated in an independent test using the Micromedex/Drugdex database.
The proposed methodology is simple, efficient, allows the investigation of large numbers of drugs, and helps highlight the etiology of DDI. A database of 58 403 predicted DDIs with structural evidence is provided as an open resource for investigators seeking to analyze DDIs.
Drug-drug interaction; adverse drug event; structure similarity; molecular fingerprints; QSAR; molecular modeling; drug design; automated learning; statistical analysis of large datasets; discovery; and text and data mining methods
The evolution of bio- and cheminformatics associated with the development of specialized software and increasing computer power has produced a great interest in theoretical in silico methods applied in drug rational design. These techniques apply the concept that “similar molecules have similar biological properties” that has been exploited in Medicinal Chemistry for years to design new molecules with desirable pharmacological profiles. Ligand-based methods are not dependent on receptor structural data and take into account two and three-dimensional molecular properties to assess similarity of new compounds in regards to the set of molecules with the biological property under study. Depending on the complexity of the calculation, there are different types of ligand-based methods, such as QSAR (Quantitative Structure-Activity Relationship) with 2D and 3D descriptors, CoMFA (Comparative Molecular Field Analysis) or pharmacophoric approaches. This work provides a description of a series of ligand-based models applied in the prediction of the inhibitory activity of monoamine oxidase (MAO) enzymes. The controlled regulation of the enzymes’ function through the use of MAO inhibitors is used as a treatment in many psychiatric and neurological disorders, such as depression, anxiety, Alzheimer’s and Parkinson’s disease. For this reason, multiple scaffolds, such as substituted coumarins, indolylmethylamine or pyridazine derivatives were synthesized and assayed toward MAO-A and MAO-B inhibition. Our intention is to focus on the description of ligand-based models to provide new insights in the relationship between the MAO inhibitory activity and the molecular structure of the different inhibitors, and further study enzyme selectivity and possible mechanisms of action.
Alzheimer’s; CoMFA; Ligand-based models; MAO; Molecular Descriptors; Parkinson’s; Pharmacophore; QSAR
Discovery of new adverse drug events (ADEs) in the post-approval period is an important goal of the health system. Data mining methods that can transform data into meaningful knowledge to inform patient safety have proven to be essential. New opportunities have emerged to harness data sources that have not been used within the traditional framework. This article provides an overview of recent methodological innovations and data sources used in support of ADE discovery and analysis.
Pharmacovigilance; Adverse Drug Events; Data Mining
Developing electronic health record (EHR) phenotyping algorithms involves generating queries that run across the EHR data repository. Algorithms are commonly assessed within demonstration studies. There remains, however, little emphasis on assessing the precision and accuracy of measurement methods during the evaluation process. Depending on the complexity of an algorithm, interim refinements may be required to improve measurement methods. Therefore, we develop an evaluation framework that incorporates both measurement and demonstration studies. We evaluate a baseline EHR phenotyping algorithm for drug induced liver injury (DILI) developed in collaboration with electronic Medical Records Genomics (eMERGE) network participants. We conduct a measurement study and report qualitative (i.e., perceptions of evaluation approach effectiveness) and quantitative (i.e., inter-rater reliability) measures. We also conduct a demonstration study and report qualitative (i.e., appropriateness of results) and quantitative (i.e., positive predictive value) measures. Given results from the measurement study, our evaluation approach underwent multiple changes including the addition of laboratory value visualization and an expanded review of clinical notes. Results from the demonstration study informed changes to our algorithm. For example, given the goal of eMERGE to identify patients who may have a genetic susceptibility to DILI, we excluded overdose patients.
Drug-drug interactions (DDIs) constitute an important problem in postmarketing pharmacovigilance and in the development of new drugs. The effectiveness or toxicity of a medication could be affected by the co-administration of other drugs that share pharmacokinetic or pharmacodynamic pathways. For this reason, a great effort is being made to develop new methodologies to detect and assess DDIs. In this article, we present a novel method based on drug interaction profile fingerprints (IPFs) with successful application to DDI detection. IPFs were generated based on the DrugBank database, which provided 9,454 well-established DDIs as a primary source of interaction data. The model uses IPFs to measure the similarity of pairs of drugs and generates new putative DDIs from the non-intersecting interactions of a pair. We described as part of our analysis the pharmacological and biological effects associated with the putative interactions; for example, the interaction between haloperidol and dicyclomine can cause increased risk of psychosis and tardive dyskinesia. First, we evaluated the method through hold-out validation and then by using four independent test sets that did not overlap with DrugBank. Precision for the test sets ranged from 0.4–0.5 with more than two fold enrichment factor enhancement. In conclusion, we demonstrated the usefulness of the method in pharmacovigilance as a DDI predictor, and created a dataset of potential DDIs, highlighting the etiology or pharmacological effect of the DDI, and providing an exploratory tool to facilitate decision support in DDI detection and patient safety.
Adverse drug events (ADE) cause considerable harm to patients, and consequently their detection is critical for patient safety. The US Food and Drug Administration maintains an adverse event reporting system (AERS) to facilitate the detection of ADE in drugs. Various data mining approaches have been developed that use AERS to detect signals identifying associations between drugs and ADE. The signals must then be monitored further by domain experts, which is a time-consuming task.
To develop a new methodology that combines existing data mining algorithms with chemical information by analysis of molecular fingerprints to enhance initial ADE signals generated from AERS, and to provide a decision support mechanism to facilitate the identification of novel adverse events.
The method achieved a significant improvement in precision in identifying known ADE, and a more than twofold signal enhancement when applied to the ADE rhabdomyolysis. The simplicity of the method assists in highlighting the etiology of the ADE by identifying structurally similar drugs. A set of drugs with strong evidence from both AERS and molecular fingerprint-based modeling is constructed for further analysis.
The results demonstrate that the proposed methodology could be used as a pharmacovigilance decision support tool to facilitate ADE detection.
Adverse drug event; AERS; FDA; molecular fingerprints; rhabdomyolysis; spontaneous reporting system; structure similarity
Abbreviations are widely used in clinical notes and are often ambiguous. Word sense disambiguation (WSD) for clinical abbreviations therefore is a critical task for many clinical natural language processing (NLP) systems. Supervised machine learning based WSD methods are known for their high performance. However, it is time consuming and costly to construct annotated samples for supervised WSD approaches and sense frequency information is often ignored by these methods. In this study, we proposed a profile-based method that used dictated discharge summaries as an external source to automatically build sense profiles and applied them to disambiguate abbreviations in hospital admission notes via the vector space model. Our evaluation using a test set containing 2,386 annotated instances from 13 ambiguous abbreviations in admission notes showed that the profile-based method performed better than two baseline methods and achieved a best average precision of 0.792. Furthermore, we developed a strategy to combine sense frequency information estimated from a clustering analysis with the profile-based method. Our results showed that the combined approach largely improved the performance and achieved a highest precision of 0.875 on the same test set, indicating that integrating sense frequency information with local context is effective for clinical abbreviation disambiguation.
Biomedical natural language processing (BioNLP) is a useful technique that unlocks valuable information stored in textual data for practice and/or research. Syntactic parsing is a critical component of BioNLP applications that rely on correctly determining the sentence and phrase structure of free text. In addition to dealing with the vast amount of domain-specific terms, a robust biomedical parser needs to model the semantic grammar to obtain viable syntactic structures. With either a rule-based or corpus-based approach, the grammar engineering process requires substantial time and knowledge from experts, and does not always yield a semantically transferable grammar. To reduce the human effort and to promote semantic transferability, we propose an automated method for deriving a probabilistic grammar based on a training corpus consisting of concept strings and semantic classes from the Unified Medical Language System (UMLS), a comprehensive terminology resource widely used by the community. The grammar is designed to specify noun phrases only due to the nominal nature of the majority of biomedical terminological concepts. Evaluated on manually parsed clinical notes, the derived grammar achieved a recall of 0.644, precision of 0.737, and average cross-bracketing of 0.61, which demonstrated better performance than a control grammar with the semantic information removed. Error analysis revealed shortcomings that could be addressed to improve performance. The results indicated the feasibility of an approach which automatically incorporates terminology semantics in the building of an operational grammar. Although the current performance of the unsupervised solution does not adequately replace manual engineering, we believe once the performance issues are addressed, it could serve as an aide in a semi-supervised solution.
Natural language processing; Biomedical terminology; Semantic grammar; Probabilistic parsing
Adverse drug events (ADEs) detection and assessment is at the center of pharmacovigilance. Data mining of systems, such as FDA’s Adverse Event Reporting System (AERS) and more recently, Electronic Health Records (EHRs), can aid in the automatic detection and analysis of ADEs. Although different data mining approaches have been shown to be valuable, it is still crucial to improve the quality of the generated signals.
To leverage structural similarity by developing molecular fingerprint-based models (MFBMs) to strengthen ADE signals generated from EHR data.
A reference standard of drugs known to be causally associated with the adverse event pancreatitis was used to create a MFBM. Electronic Health Records (EHRs) from the New York Presbyterian Hospital were mined to generate structured data. Disproportionality Analysis (DPA) was applied to the data, and 278 possible signals related to the ADE pancreatitis were detected. Candidate drugs associated with these signals were then assessed using the MFBM to find the most promising candidates based on structural similarity.
The use of MFBM as a means to strengthen or prioritize signals generated from the EHR significantly improved the detection accuracy of ADEs related to pancreatitis. MFBM also highlights the etiology of the ADE by identifying structurally similar drugs, which could follow a similar mechanism of action.
The method proposed in this paper provides evidence of being a promising adjunct to existing automated ADE detection and analysis approaches.
In this paper we present a new pharmacovigilance data mining technique based on the biclustering paradigm, which is designed to identify drug groups that share a common set of adverse events in FDA’s spontaneous reporting system. A taxonomy of biclusters is developed, revealing that a significant number of bone fide adverse drug event (ADE) biclusters are identified. Statistical tests indicate that it is extremely unlikely that the discovered bicluster structures as well as their content arose by chance. Some of the biclusters classified as indeterminate provide support for previously unrecognized and potentially novel ADEs. In addition, we demonstrate the importance of the proposed methodology to several important aspects of pharmacovigilance such as: providing insight into the etiology of ADEs, facilitating the identification of novel ADEs, suggesting methods and rational for aggregating terminologies, highlighting areas of focus, and as a data exploratory tool.
Pharmacovigilance; Adverse Drug Events; Biclustering; Clustering; FDA Adverse Event Reporting System
Knowledge acquisition of relations between biomedical entities is critical for many automated biomedical applications, including pharmacovigilance and decision support. Automated acquisition of statistical associations from biomedical and clinical documents has shown some promise. However, acquisition of clinically meaningful relations (i.e. specific associations) remains challenging because textual information is noisy and co-occurrence does not typically determine specific relations. In this work, we focus on acquisition of two types of relations from clinical reports: disease-manifestation related symptom (MRS) and drug-adverse drug event (ADE), and explore the use of filtering by sections of the report to improve performance. Evaluation indicated that applying the filters improved recall (disease-MRS: from 0.85 to 0.90; drug-ADE: from 0.43 to 0.75) and precision (disease-MRS: from 0.82 to 0.92; drug-ADE: from 0.16 to 0.31). This preliminary study demonstrates that selecting information in narrative electronic reports based on the section improves the detection of disease-MRS and drug-ADE types of relations. Further investigation of complementary methods, such as more sophisticated statistical methods, more complex temporal models and use of information from other knowledge sources, is needed.
knowledge acquisition; natural language processing (NLP); text mining; pharmacovigilance; decision support; electronic health record (EHR)
Electronic health record (EHR) systems offer an exceptional opportunity for studying many diseases and their associated medical conditions within a population. The increasing number of clinical record entries that have become available electronically provides access to rich, large sets of patients' longitudinal medical information. By integrating and comparing relations found in the EHRs with those already reported in the literature, we are able to verify existing and to identify rare or novel associations. Of particular interest is the identification of rare disease co-morbidities, where the small numbers of diagnosed patients make robust statistical analysis difficult. Here, we introduce ADAMS, an Application for Discovering Disease Associations using Multiple Sources, which contains various statistical and language processing operations. We apply ADAMS to the New York-Presbyterian Hospital's EHR to combine the information from the relational diagnosis tables and textual discharge summaries with those from PubMed and Wikipedia in order to investigate the co-morbidities of the rare diseases Kaposi sarcoma, toxoplasmosis, and Kawasaki disease. In addition to finding well-known characteristics of diseases, ADAMS can identify rare or previously unreported associations. In particular, we report a statistically significant association between Kawasaki disease and diagnosis of autistic disorder.
Vaccination; immunization; human influenza; workplace; disease outbreaks; whooping cough; diphtheria; tetanus; vaccine; letter
Adverse drug events (ADEs) create a serious problem causing substantial harm to patients. An executable standardized knowledgebase of drug-ADE relations which is publicly available would be valuable so that it could be used for ADE detection. The literature is an important source that could be used to generate a knowledgebase of drug-ADE pairs. In this paper, we report on a method that automatically determines whether a specific adverse event (AE) is caused by a specific drug based on the content of PubMed citations. A drug-ADE classification method was initially developed to detect neutropenia based on a pre-selected set of drugs. This method was then applied to a different set of 76 drugs to determine if they caused neutropenia. For further proof of concept this method was applied to 48 drugs to determine whether they caused another AE, myocardial infarction. Results showed that AUROC was 0.93 and 0.86 respectively.
Knowledge of medication indications is significant for automatic applications aimed at improving patient safety, such as computerized physician order entry and clinical decision support systems. The Electronic Health Record (EHR) contains pertinent information related to patient safety such as information related to appropriate prescribing. However, the reasons for medication prescriptions are usually not explicitly documented in the patient record. This paper describes a method that determines the reasons for medication uses based on information occurring in outpatient notes. The method utilizes drug-indication knowledge that we acquired, and natural language processing. Evaluation showed the method obtained a sensitivity of 62.8%, specificity of 93.9%, precision of 90% and F-measure of 73.9%. This pilot study demonstrated that linking external drug indication knowledge to the EHR for determining the reasons for medication use was promising, but also revealed some challenges. Future work will focus on increasing the accuracy and coverage of the indication knowledge and evaluating its performance using a much larger set of drugs frequently used in the outpatient population.
Multi-item adverse drug event (ADE) associations are associations relating multiple drugs to possibly multiple adverse events. The current standard in pharmacovigilance is bivariate association analysis, where each single drug-adverse effect combination is studied separately. The importance and difficulty in the detection of multi-item ADE associations was noted in several prominent pharmacovigilance studies. In this paper we examine the application of a well established data mining method known as association rule mining, which we tailored to the above problem, and demonstrate its value. The method was applied to the FDAs spontaneous adverse event reporting system (AERS) with minimal restrictions and expectations on its output, an experiment that has not been previously done on the scale and generality proposed in this work.
Based on a set of 162,744 reports of suspected ADEs reported to AERS and published in the year 2008, our method identified 1167 multi-item ADE associations. A taxonomy that characterizes the associations was developed based on a representative sample. A significant number (67% of the total) of potential multi-item ADE associations identified were characterized and clinically validated by a domain expert as previously recognized ADE associations. Several potentially novel ADEs were also identified. A smaller proportion (4%) of associations were characterized and validated as known drug-drug interactions.
Our findings demonstrate that multi-item ADEs are present and can be extracted from the FDA’s adverse effect reporting system using our methodology, suggesting that our method is a valid approach for the initial identification of multi-item ADEs. The study also revealed several limitations and challenges that can be attributed to both the method and quality of data.
Information visualization techniques, which take advantage of the bandwidth of human vision, are powerful tools for organizing and analyzing a large amount of data. In the postgenomic era, information visualization tools are indispensable for biomedical research. This paper aims to present an overview of current applications of information visualization techniques in bioinformatics for visualizing different types of biological data, such as from genomics, proteomics, expression profiling and structural studies. Finally, we discuss the challenges of information visualization in bioinformatics related to dealing with more complex biological information in the emerging fields of systems biology and systems medicine.
Information visualization; Bioinformatics
Natural language processing (NLP) is a high throughput technology because it can process vast quantities of text within a reasonable time period. It has the potential to substantially facilitate biomedical research by extracting, linking, and organizing massive amounts of information that occur in biomedical journal articles as well as in textual fields of biological databases. Until recently, much of the work in biological NLP and text mining has revolved around recognizing the occurrence of biomolecular entities in articles, and in extracting particular relationships among the entities. Now, researchers have recognized a need to link the extracted information to ontologies or knowledge bases, which is a more difficult task. One such knowledge base is Gene Ontology annotations (GOA), which significantly increases semantic computations over the function, cellular components and processes of genes. For multicellular organisms, these annotations can be refined with phenotypic context, such as the cell type, tissue, and organ because establishing phenotypic contexts in which a gene is expressed is a crucial step for understanding the development and the molecular underpinning of the pathophysiology of diseases. In this paper, we propose a system, PhenoGO, which automatically augments annotations in GOA with additional context. PhenoGO utilizes an existing NLP system, called BioMedLEE, an existing knowledge-based phenotype organizer system (PhenOS) in conjunction with MeSH indexing and established biomedical ontologies. More specifically, PhenoGO adds phenotypic contextual information to existing associations between gene products and GO terms as specified in GOA. The system also maps the context to identifiers that are associated with different biomedical ontologies, including the UMLS, Cell Ontology, Mouse Anatomy, NCBI taxonomy, GO, and Mammalian Phenotype Ontology. In addition, PhenoGO was evaluated for coding of anatomical and cellular information and assigning the coded phenotypes to the correct GOA; results obtained show that PhenoGO has a precision of 91% and recall of 92%, demonstrating that the PhenoGO NLP system can accurately encode a large number of anatomical and cellular ontologies to GO annotations. The PhenoGO Database may be accessed at the following URL: http://www.phenoGO.org
Visualizing relations among biological information to facilitate understanding is crucial to biological research during the post-genomic era. Although different systems have been developed to view gene-phenotype relations for specific databases, very few have been designed specifically as a general flexible tool for visualizing multidimensional genotypic and phenotypic information together. Our goal is to develop a method for visualizing multidimensional genotypic and phenotypic information and a model that unifies different biological databases in order to present the integrated knowledge using a uniform interface.
We developed a novel, flexible and generalizable visualization tool, called PhenoGenesviewer (PGviewer), which in this paper was used to display gene-phenotype relations from a human-curated database (OMIM) and from an automatic method using a Natural Language Processing tool called BioMedLEE. Data obtained from multiple databases were first integrated into a uniform structure and then organized by PGviewer. PGviewer provides a flexible query interface that allows dynamic selection and ordering of any desired dimension in the databases. Based on users’ queries, results can be visualized using hierarchical expandable trees that present views specified by users according to their research interests. We believe that this method, which allows users to dynamically organize and visualize multiple dimensions, is a potentially powerful and promising tool that should substantially facilitate biological research.