The evolution of bio- and cheminformatics associated with the development of specialized software and increasing computer power has produced a great interest in theoretical in silico methods applied in drug rational design. These techniques apply the concept that “similar molecules have similar biological properties” that has been exploited in Medicinal Chemistry for years to design new molecules with desirable pharmacological profiles. Ligand-based methods are not dependent on receptor structural data and take into account two and three-dimensional molecular properties to assess similarity of new compounds in regards to the set of molecules with the biological property under study. Depending on the complexity of the calculation, there are different types of ligand-based methods, such as QSAR (Quantitative Structure-Activity Relationship) with 2D and 3D descriptors, CoMFA (Comparative Molecular Field Analysis) or pharmacophoric approaches. This work provides a description of a series of ligand-based models applied in the prediction of the inhibitory activity of monoamine oxidase (MAO) enzymes. The controlled regulation of the enzymes’ function through the use of MAO inhibitors is used as a treatment in many psychiatric and neurological disorders, such as depression, anxiety, Alzheimer’s and Parkinson’s disease. For this reason, multiple scaffolds, such as substituted coumarins, indolylmethylamine or pyridazine derivatives were synthesized and assayed toward MAO-A and MAO-B inhibition. Our intention is to focus on the description of ligand-based models to provide new insights in the relationship between the MAO inhibitory activity and the molecular structure of the different inhibitors, and further study enzyme selectivity and possible mechanisms of action.
Alzheimer’s; CoMFA; Ligand-based models; MAO; Molecular Descriptors; Parkinson’s; Pharmacophore; QSAR
Discovery of new adverse drug events (ADEs) in the post-approval period is an important goal of the health system. Data mining methods that can transform data into meaningful knowledge to inform patient safety have proven to be essential. New opportunities have emerged to harness data sources that have not been used within the traditional framework. This article provides an overview of recent methodological innovations and data sources used in support of ADE discovery and analysis.
Pharmacovigilance; Adverse Drug Events; Data Mining
Drug-drug interactions (DDIs) constitute an important problem in postmarketing pharmacovigilance and in the development of new drugs. The effectiveness or toxicity of a medication could be affected by the co-administration of other drugs that share pharmacokinetic or pharmacodynamic pathways. For this reason, a great effort is being made to develop new methodologies to detect and assess DDIs. In this article, we present a novel method based on drug interaction profile fingerprints (IPFs) with successful application to DDI detection. IPFs were generated based on the DrugBank database, which provided 9,454 well-established DDIs as a primary source of interaction data. The model uses IPFs to measure the similarity of pairs of drugs and generates new putative DDIs from the non-intersecting interactions of a pair. We described as part of our analysis the pharmacological and biological effects associated with the putative interactions; for example, the interaction between haloperidol and dicyclomine can cause increased risk of psychosis and tardive dyskinesia. First, we evaluated the method through hold-out validation and then by using four independent test sets that did not overlap with DrugBank. Precision for the test sets ranged from 0.4–0.5 with more than two fold enrichment factor enhancement. In conclusion, we demonstrated the usefulness of the method in pharmacovigilance as a DDI predictor, and created a dataset of potential DDIs, highlighting the etiology or pharmacological effect of the DDI, and providing an exploratory tool to facilitate decision support in DDI detection and patient safety.
Adverse drug events (ADE) cause considerable harm to patients, and consequently their detection is critical for patient safety. The US Food and Drug Administration maintains an adverse event reporting system (AERS) to facilitate the detection of ADE in drugs. Various data mining approaches have been developed that use AERS to detect signals identifying associations between drugs and ADE. The signals must then be monitored further by domain experts, which is a time-consuming task.
To develop a new methodology that combines existing data mining algorithms with chemical information by analysis of molecular fingerprints to enhance initial ADE signals generated from AERS, and to provide a decision support mechanism to facilitate the identification of novel adverse events.
The method achieved a significant improvement in precision in identifying known ADE, and a more than twofold signal enhancement when applied to the ADE rhabdomyolysis. The simplicity of the method assists in highlighting the etiology of the ADE by identifying structurally similar drugs. A set of drugs with strong evidence from both AERS and molecular fingerprint-based modeling is constructed for further analysis.
The results demonstrate that the proposed methodology could be used as a pharmacovigilance decision support tool to facilitate ADE detection.
Adverse drug event; AERS; FDA; molecular fingerprints; rhabdomyolysis; spontaneous reporting system; structure similarity
Abbreviations are widely used in clinical notes and are often ambiguous. Word sense disambiguation (WSD) for clinical abbreviations therefore is a critical task for many clinical natural language processing (NLP) systems. Supervised machine learning based WSD methods are known for their high performance. However, it is time consuming and costly to construct annotated samples for supervised WSD approaches and sense frequency information is often ignored by these methods. In this study, we proposed a profile-based method that used dictated discharge summaries as an external source to automatically build sense profiles and applied them to disambiguate abbreviations in hospital admission notes via the vector space model. Our evaluation using a test set containing 2,386 annotated instances from 13 ambiguous abbreviations in admission notes showed that the profile-based method performed better than two baseline methods and achieved a best average precision of 0.792. Furthermore, we developed a strategy to combine sense frequency information estimated from a clustering analysis with the profile-based method. Our results showed that the combined approach largely improved the performance and achieved a highest precision of 0.875 on the same test set, indicating that integrating sense frequency information with local context is effective for clinical abbreviation disambiguation.
Biomedical natural language processing (BioNLP) is a useful technique that unlocks valuable information stored in textual data for practice and/or research. Syntactic parsing is a critical component of BioNLP applications that rely on correctly determining the sentence and phrase structure of free text. In addition to dealing with the vast amount of domain-specific terms, a robust biomedical parser needs to model the semantic grammar to obtain viable syntactic structures. With either a rule-based or corpus-based approach, the grammar engineering process requires substantial time and knowledge from experts, and does not always yield a semantically transferable grammar. To reduce the human effort and to promote semantic transferability, we propose an automated method for deriving a probabilistic grammar based on a training corpus consisting of concept strings and semantic classes from the Unified Medical Language System (UMLS), a comprehensive terminology resource widely used by the community. The grammar is designed to specify noun phrases only due to the nominal nature of the majority of biomedical terminological concepts. Evaluated on manually parsed clinical notes, the derived grammar achieved a recall of 0.644, precision of 0.737, and average cross-bracketing of 0.61, which demonstrated better performance than a control grammar with the semantic information removed. Error analysis revealed shortcomings that could be addressed to improve performance. The results indicated the feasibility of an approach which automatically incorporates terminology semantics in the building of an operational grammar. Although the current performance of the unsupervised solution does not adequately replace manual engineering, we believe once the performance issues are addressed, it could serve as an aide in a semi-supervised solution.
Natural language processing; Biomedical terminology; Semantic grammar; Probabilistic parsing
Adverse drug events (ADEs) detection and assessment is at the center of pharmacovigilance. Data mining of systems, such as FDA’s Adverse Event Reporting System (AERS) and more recently, Electronic Health Records (EHRs), can aid in the automatic detection and analysis of ADEs. Although different data mining approaches have been shown to be valuable, it is still crucial to improve the quality of the generated signals.
To leverage structural similarity by developing molecular fingerprint-based models (MFBMs) to strengthen ADE signals generated from EHR data.
A reference standard of drugs known to be causally associated with the adverse event pancreatitis was used to create a MFBM. Electronic Health Records (EHRs) from the New York Presbyterian Hospital were mined to generate structured data. Disproportionality Analysis (DPA) was applied to the data, and 278 possible signals related to the ADE pancreatitis were detected. Candidate drugs associated with these signals were then assessed using the MFBM to find the most promising candidates based on structural similarity.
The use of MFBM as a means to strengthen or prioritize signals generated from the EHR significantly improved the detection accuracy of ADEs related to pancreatitis. MFBM also highlights the etiology of the ADE by identifying structurally similar drugs, which could follow a similar mechanism of action.
The method proposed in this paper provides evidence of being a promising adjunct to existing automated ADE detection and analysis approaches.
In this paper we present a new pharmacovigilance data mining technique based on the biclustering paradigm, which is designed to identify drug groups that share a common set of adverse events in FDA’s spontaneous reporting system. A taxonomy of biclusters is developed, revealing that a significant number of bone fide adverse drug event (ADE) biclusters are identified. Statistical tests indicate that it is extremely unlikely that the discovered bicluster structures as well as their content arose by chance. Some of the biclusters classified as indeterminate provide support for previously unrecognized and potentially novel ADEs. In addition, we demonstrate the importance of the proposed methodology to several important aspects of pharmacovigilance such as: providing insight into the etiology of ADEs, facilitating the identification of novel ADEs, suggesting methods and rational for aggregating terminologies, highlighting areas of focus, and as a data exploratory tool.
Pharmacovigilance; Adverse Drug Events; Biclustering; Clustering; FDA Adverse Event Reporting System
Knowledge acquisition of relations between biomedical entities is critical for many automated biomedical applications, including pharmacovigilance and decision support. Automated acquisition of statistical associations from biomedical and clinical documents has shown some promise. However, acquisition of clinically meaningful relations (i.e. specific associations) remains challenging because textual information is noisy and co-occurrence does not typically determine specific relations. In this work, we focus on acquisition of two types of relations from clinical reports: disease-manifestation related symptom (MRS) and drug-adverse drug event (ADE), and explore the use of filtering by sections of the report to improve performance. Evaluation indicated that applying the filters improved recall (disease-MRS: from 0.85 to 0.90; drug-ADE: from 0.43 to 0.75) and precision (disease-MRS: from 0.82 to 0.92; drug-ADE: from 0.16 to 0.31). This preliminary study demonstrates that selecting information in narrative electronic reports based on the section improves the detection of disease-MRS and drug-ADE types of relations. Further investigation of complementary methods, such as more sophisticated statistical methods, more complex temporal models and use of information from other knowledge sources, is needed.
knowledge acquisition; natural language processing (NLP); text mining; pharmacovigilance; decision support; electronic health record (EHR)
Electronic health record (EHR) systems offer an exceptional opportunity for studying many diseases and their associated medical conditions within a population. The increasing number of clinical record entries that have become available electronically provides access to rich, large sets of patients' longitudinal medical information. By integrating and comparing relations found in the EHRs with those already reported in the literature, we are able to verify existing and to identify rare or novel associations. Of particular interest is the identification of rare disease co-morbidities, where the small numbers of diagnosed patients make robust statistical analysis difficult. Here, we introduce ADAMS, an Application for Discovering Disease Associations using Multiple Sources, which contains various statistical and language processing operations. We apply ADAMS to the New York-Presbyterian Hospital's EHR to combine the information from the relational diagnosis tables and textual discharge summaries with those from PubMed and Wikipedia in order to investigate the co-morbidities of the rare diseases Kaposi sarcoma, toxoplasmosis, and Kawasaki disease. In addition to finding well-known characteristics of diseases, ADAMS can identify rare or previously unreported associations. In particular, we report a statistically significant association between Kawasaki disease and diagnosis of autistic disorder.
Vaccination; immunization; human influenza; workplace; disease outbreaks; whooping cough; diphtheria; tetanus; vaccine; letter
Adverse drug events (ADEs) create a serious problem causing substantial harm to patients. An executable standardized knowledgebase of drug-ADE relations which is publicly available would be valuable so that it could be used for ADE detection. The literature is an important source that could be used to generate a knowledgebase of drug-ADE pairs. In this paper, we report on a method that automatically determines whether a specific adverse event (AE) is caused by a specific drug based on the content of PubMed citations. A drug-ADE classification method was initially developed to detect neutropenia based on a pre-selected set of drugs. This method was then applied to a different set of 76 drugs to determine if they caused neutropenia. For further proof of concept this method was applied to 48 drugs to determine whether they caused another AE, myocardial infarction. Results showed that AUROC was 0.93 and 0.86 respectively.
Knowledge of medication indications is significant for automatic applications aimed at improving patient safety, such as computerized physician order entry and clinical decision support systems. The Electronic Health Record (EHR) contains pertinent information related to patient safety such as information related to appropriate prescribing. However, the reasons for medication prescriptions are usually not explicitly documented in the patient record. This paper describes a method that determines the reasons for medication uses based on information occurring in outpatient notes. The method utilizes drug-indication knowledge that we acquired, and natural language processing. Evaluation showed the method obtained a sensitivity of 62.8%, specificity of 93.9%, precision of 90% and F-measure of 73.9%. This pilot study demonstrated that linking external drug indication knowledge to the EHR for determining the reasons for medication use was promising, but also revealed some challenges. Future work will focus on increasing the accuracy and coverage of the indication knowledge and evaluating its performance using a much larger set of drugs frequently used in the outpatient population.
Multi-item adverse drug event (ADE) associations are associations relating multiple drugs to possibly multiple adverse events. The current standard in pharmacovigilance is bivariate association analysis, where each single drug-adverse effect combination is studied separately. The importance and difficulty in the detection of multi-item ADE associations was noted in several prominent pharmacovigilance studies. In this paper we examine the application of a well established data mining method known as association rule mining, which we tailored to the above problem, and demonstrate its value. The method was applied to the FDAs spontaneous adverse event reporting system (AERS) with minimal restrictions and expectations on its output, an experiment that has not been previously done on the scale and generality proposed in this work.
Based on a set of 162,744 reports of suspected ADEs reported to AERS and published in the year 2008, our method identified 1167 multi-item ADE associations. A taxonomy that characterizes the associations was developed based on a representative sample. A significant number (67% of the total) of potential multi-item ADE associations identified were characterized and clinically validated by a domain expert as previously recognized ADE associations. Several potentially novel ADEs were also identified. A smaller proportion (4%) of associations were characterized and validated as known drug-drug interactions.
Our findings demonstrate that multi-item ADEs are present and can be extracted from the FDA’s adverse effect reporting system using our methodology, suggesting that our method is a valid approach for the initial identification of multi-item ADEs. The study also revealed several limitations and challenges that can be attributed to both the method and quality of data.
Information visualization techniques, which take advantage of the bandwidth of human vision, are powerful tools for organizing and analyzing a large amount of data. In the postgenomic era, information visualization tools are indispensable for biomedical research. This paper aims to present an overview of current applications of information visualization techniques in bioinformatics for visualizing different types of biological data, such as from genomics, proteomics, expression profiling and structural studies. Finally, we discuss the challenges of information visualization in bioinformatics related to dealing with more complex biological information in the emerging fields of systems biology and systems medicine.
Information visualization; Bioinformatics
Natural language processing (NLP) is a high throughput technology because it can process vast quantities of text within a reasonable time period. It has the potential to substantially facilitate biomedical research by extracting, linking, and organizing massive amounts of information that occur in biomedical journal articles as well as in textual fields of biological databases. Until recently, much of the work in biological NLP and text mining has revolved around recognizing the occurrence of biomolecular entities in articles, and in extracting particular relationships among the entities. Now, researchers have recognized a need to link the extracted information to ontologies or knowledge bases, which is a more difficult task. One such knowledge base is Gene Ontology annotations (GOA), which significantly increases semantic computations over the function, cellular components and processes of genes. For multicellular organisms, these annotations can be refined with phenotypic context, such as the cell type, tissue, and organ because establishing phenotypic contexts in which a gene is expressed is a crucial step for understanding the development and the molecular underpinning of the pathophysiology of diseases. In this paper, we propose a system, PhenoGO, which automatically augments annotations in GOA with additional context. PhenoGO utilizes an existing NLP system, called BioMedLEE, an existing knowledge-based phenotype organizer system (PhenOS) in conjunction with MeSH indexing and established biomedical ontologies. More specifically, PhenoGO adds phenotypic contextual information to existing associations between gene products and GO terms as specified in GOA. The system also maps the context to identifiers that are associated with different biomedical ontologies, including the UMLS, Cell Ontology, Mouse Anatomy, NCBI taxonomy, GO, and Mammalian Phenotype Ontology. In addition, PhenoGO was evaluated for coding of anatomical and cellular information and assigning the coded phenotypes to the correct GOA; results obtained show that PhenoGO has a precision of 91% and recall of 92%, demonstrating that the PhenoGO NLP system can accurately encode a large number of anatomical and cellular ontologies to GO annotations. The PhenoGO Database may be accessed at the following URL: http://www.phenoGO.org
Visualizing relations among biological information to facilitate understanding is crucial to biological research during the post-genomic era. Although different systems have been developed to view gene-phenotype relations for specific databases, very few have been designed specifically as a general flexible tool for visualizing multidimensional genotypic and phenotypic information together. Our goal is to develop a method for visualizing multidimensional genotypic and phenotypic information and a model that unifies different biological databases in order to present the integrated knowledge using a uniform interface.
We developed a novel, flexible and generalizable visualization tool, called PhenoGenesviewer (PGviewer), which in this paper was used to display gene-phenotype relations from a human-curated database (OMIM) and from an automatic method using a Natural Language Processing tool called BioMedLEE. Data obtained from multiple databases were first integrated into a uniform structure and then organized by PGviewer. PGviewer provides a flexible query interface that allows dynamic selection and ordering of any desired dimension in the databases. Based on users’ queries, results can be visualized using hierarchical expandable trees that present views specified by users according to their research interests. We believe that this method, which allows users to dynamically organize and visualize multiple dimensions, is a potentially powerful and promising tool that should substantially facilitate biological research.
The study of protein-protein interactions is essential to define the molecular networks that contribute to maintain homeostasis of an organism’s body functions. Disruptions in protein interaction networks have been shown to result in diseases in both humans and animals. Monogenic diseases disrupting biochemical pathways such as hereditary coagulopathies (e.g. hemophilia), provided a deep insight in the biochemical pathways of acquired coagulopathies of complex diseases. Indeed, a variety of complex liver diseases can lead to decreased synthesis of the same set of coagulation factors as in hemophilia. Similarly, more complex diseases such as different cancers have been shown to result from malfunctions of common proteins pathways. In order to discover, in high throughput, the molecular underpinnings of poorly characterized diseases, we present a statistical method to identify shared protein interaction network(s) between diseases. Integrating (i) a protein interaction network with (ii) disease to protein relationships derived from mining Gene Ontology annotations and the biomedical literature with natural language understanding (PhenoGO), we identified protein-protein interactions that were associated with pairs of diseases and calculated the statistical significance of the occurrence of interactions in the protein interaction knowledgebase. Significant correlations between diseases and shared protein networks were identified and evaluated in this study, demonstrating the high precision of the approach and correct non-trivial predictions, signifying the potential for discovery. In conclusion, we demonstrate that the associations between diseases are directly correlated to their underlying protein-protein interaction networks, possibly providing insight into the underlying molecular mechanisms of phenotypes and biological processes disrupted in related diseases.
Despite advances in the gene annotation process, the functions of a large portion of the gene products remain insufficiently characterized. In addition, the “in silico” prediction of novel Gene Ontology (GO) annotations for partially characterized gene functions or processes is highly dependent on reverse genetic or function genomics approaches.
We propose a novel approach, Information Theory-based Semantic Similarity (ITSS), to automatically predict molecular functions of genes based on Gene Ontology annotations. We have demonstrated using a 10-fold cross-validation that the ITSS algorithm obtains prediction accuracies (Precision 97%, Recall 77%) comparable to other machine learning algorithms when applied to similarly dense annotated portions of the GO datasets. In addition, such method can generate highly accurate predictions in sparsely annotated portions of GO, in which previous algorithm failed to do so. As a result, our technique generates an order of magnitude more gene function predictions than previous methods. Further, this paper presents the first historical rollback validation for the predicted GO annotations, which may represent more realistic conditions for an evaluation than generally used cross-validations type of evaluations. By manually assessing a random sample of 100 predictions conducted in a historical roll-back evaluation, we estimate that a minimum precision of 51% (95% confidence interval: 43%–58%) can be achieved for the human GO Annotation file dated 2003.
The program is available on request. The 97,732 positive predictions of novel gene annotations from the 2005 GO Annotation dataset are available at http://phenos.bsd.uchicago.edu/mphenogo/prediction_result_2005.txt.
Natural language processing (NLP) techniques are increasingly being used in biology to automate the capture of new biological discoveries in text, which are being reported at a rapid rate. To facilitate the computational reuse and integration of information buried in unstructured text, we propose a schema that represents a comprehensive set of biological entities and relations as expressed in natural language. In addition, the schema connects different scales of biological information, and provides links from the textual information to existing ontologies, which are essential in biology for integration, organization, dissemination, and knowledge management of heterogeneous information. A comprehensive representation for otherwise heterogeneous datasets, such as the one proposed, are critical for advancing systems biology because they allow for acquisition and reuse of unprecedented volumes of diverse types of knowledge and information from text.
A novel representational schema, PGschema, was developed that enables translation of information in textual narratives to a well-defined data structure comprising genotypic and phenotypic concepts from established ontologies along with modifiers and relationships. Initial evaluation for coverage of a selected set of entities showed that 85% of the information could be represented. Moreover, PGschema can be realized automatically in an XML format by using natural language techniques to process the text.
The ability to adequately and efficiently integrate unstructured, heterogeneous datasets, which are incumbent to systems biology and medicine, is one of the primary limitations to their comprehensive analysis. Natural language processing (NLP) and biomedical ontologies are automated methods for capturing, standardizing and integrating information across diverse sources, including narrative text. We have utilized the BioMedLEE NLP system to extract and encode, using standard ontologies (e.g., Cell Type Ontology, Mammalian Phenotype, Gene Ontology), biomolecular mechanisms and clinical phenotypes from the scientific literature. We subsequently applied semantic processing techniques to the structured BioMedLEE output to determine the relationships between these biomolecular and clinical phenotype concepts. We conducted an evaluation that shows an average precision and recall of BioMedLEE with respect to annotating phrases comprised of cell type, anatomy/disease, and gene/protein concepts were 86% and 78%, respectively. The precision of the asserted phenotype-molecular relationships was 75%.
Knowledge of medical entities, such as drug-related information is critical for many automated biomedical applications, such as decision support and pharmacovigilance. In this work, heterogeneous information sources were integrated automatically to obtain drug-related knowledge. We focus on one type of knowledge, drug-treats-condition, in the study and propose a framework for integrating disparate knowledge sources. Evaluation based on a random sample of drug-condition pairs indicated an overall coverage of 96%, recall of 98% and a precision of 87%. In conclusion, the preliminary study demonstrated that the knowledge generated from this study was comparable to the manually curated gold standard and that this method of automatically integrating knowledge sources is effective. The automated method should also be applicable to integrate other clinical knowledge, such as drug-related knowledge with omics information.
Many adverse drug effects (ADEs) can be attributed to drug interactions. Spontaneous reporting systems (SRS) provide a rich opportunity to detect novel post-marketed drug interaction adverse effects (DIAEs), as they include populations not well represented in clinical trials. However, their identification in SRS is nontrivial. Most existing research have addressed the statistical issues used to test or verify DIAEs, but not their identification as part of a systematic large scale database-wide mining process as discussed in this work. This paper examines the application of a highly optimized and tailored implementation of the Apriori algorithm, as well as methods addressing data quality issues, to the identification of DIAEs in FDAs SRS.
It is vital to detect the full safety profile of a drug throughout its market life. Current pharmacovigilance systems still have substantial limitations, however. The objective of our work is to demonstrate the feasibility of using natural language processing (NLP), the comprehensive Electronic Health Record (EHR), and association statistics for pharmacovigilance purposes.
Narrative discharge summaries were collected from the Clinical Information System at New York Presbyterian Hospital (NYPH). MedLEE, an NLP system, was applied to the collection to identify medication events and entities which could be potential adverse drug events (ADEs). Co-occurrence statistics with adjusted volume tests were used to detect associations between the two types of entities, to calculate the strengths of the associations, and to determine their cutoff thresholds. Seven drugs/drug classes (ibuprofen, morphine, warfarin, bupropion, paroxetine, rosiglitazone, ACE inhibitors) with known ADEs were selected to evaluate the system.
One hundred thirty-two potential ADEs were found to be associated with the 7 drugs. Overall recall and precision were 0.75 and 0.31 for known ADEs respectively. Importantly, qualitative evaluation using historic roll back design suggested that novel ADEs could be detected using our system.
This study provides a framework for the development of active, high-throughput and prospective systems which could potentially unveil drug safety profiles throughout their entire market life. Our results demonstrate that the framework is feasible although there are some challenging issues. To the best of our knowledge, this is the first study using comprehensive unstructured data from the EHR for pharmacovigilance.
To assess the performance of electronic health record data for syndromic surveillance and to assess the feasibility of broadly distributed surveillance.
Two systems were developed to identify influenza-like illness and gastrointestinal infectious disease in ambulatory electronic health record data from a network of community health centers. The first system used queries on structured data and was designed for this specific electronic health record. The second used natural language processing of narrative data, but its queries were developed independently from this health record. Both were compared to influenza isolates and to a verified emergency department chief complaint surveillance system.
Lagged cross-correlation and graphs of the three time series.
For influenza-like illness, both the structured and narrative data correlated well with the influenza isolates and with the emergency department data, achieving cross-correlations of 0.89 (structured) and 0.84 (narrative) for isolates and 0.93 and 0.89 for emergency department data, and having similar peaks during influenza season. For gastrointestinal infectious disease, the structured data correlated fairly well with the emergency department data (0.81) with a similar peak, but the narrative data correlated less well (0.47).
It is feasible to use electronic health records for syndromic surveillance. The structured data performed best but required knowledge engineering to match the health record data to the queries. The narrative data illustrated the potential performance of a broadly disseminated system and achieved mixed results.