Routine integration of genotype data into drug decision-making could improve patient safety, particularly if many relevant genetic variants can be assayed simultaneously before target drug prescribing. The frequency of pharmacogenetic prescribing opportunities and the potential adverse events (AE) mitigated are unknown. We examined the frequency with which 56 medications with known outcomes influenced by variant alleles were prescribed in a cohort of 52,942 medical home patients at Vanderbilt University Medical Center. Within a five-year window, we estimated that 64.8% (95% CI: 64.4%-65.2%) of individuals were exposed to at least one medication with an established pharmacogenetic association. Using previously published results for six medications with well-characterized, severe genetically-linked AEs, we estimated that 398 events (95% CI, 225 - 583) could have been prevented with an effective preemptive genotyping program. Our results suggest that multiplexed, preemptive genotyping may represent an efficient alternative approach to current single use (“reactive”) methods and may improve safety.
The clinical research data sets exchanged in international epidemiology research often lack the elements needed to assess their suitability for use in multi-region meta-analyses. While the missing information is generally known to local investigators, it is not contained in the files exchanged between sites. Instead, such content must be solicited by the study coordinating center though a series of lengthy phone and electronic communications: an informal process whose reproducibility and accuracy decays over time. This report describes a set of supplemental information needed to assess whether clinical research data from diverse research sites are truly comparable, and what metadata (“data about the data”) should be preserved when a data set is archived for future use. We propose a structured Extensible Markup Language (XML) model that captures this information. The authors hope this model will be a first step towards preserving the metadata associated with clinical research data sets, thereby improving the quality of international data exchange, data archiving, and merged-data research using data collected in many different countries, languages and care settings.
Programming Languages; Software Design; Knowledge Representation (Computer); Database Management Systems
To identify novel genetic loci influencing interindividual variation in red blood cell (RBC) traits in African-Americans, we conducted a genome-wide association study (GWAS) in 2315 individuals, divided into discovery (n = 1904) and replication (n = 411) cohorts. The traits included hemoglobin concentration (HGB), hematocrit (HCT), RBC count, mean corpuscular volume (MCV), mean corpuscular hemoglobin (MCH), and mean corpuscular hemoglobin concentration (MCHC). Patients were participants in the electronic MEdical Records and GEnomics (eMERGE) network and underwent genotyping of ~1.2 million single-nucleotide polymorphisms on the Illumina Human1M-Duo array. Association analyses were performed adjusting for age, sex, site, and population stratification. Three loci previously associated with resistance to malaria—HBB (11p15.4), HBA1/HBA2 (16p13.3), and G6PD (Xq28)—were associated (P ≤ 1 × 10−6) with RBC traits in the discovery cohort. The loci replicated in the replication cohort (P ≤ 0.02), and were significant at a genome-wide significance level (P < 5 × 10−8) in the combined cohort. The proportions of variance in RBC traits explained by significant variants at these loci were as follows: rs7120391 (near HBB) 1.3% of MCHC, rs9924561 (near HBA1/A2) 5.5% of MCV, 6.9% of MCH and 2.9% of MCHC, and rs1050828 (in G6PD) 2.4% of RBC count, 2.9% of MCV, and 1.4% of MCH, respectively. We were not able to replicate loci identified by a previous GWAS of RBC traits in a European ancestry cohort of similar sample size, suggesting that the genetic architecture of RBC traits differs by race. In conclusion, genetic variants that confer resistance to malaria are associated with RBC traits in African-Americans.
red blood cell (RBC) traits; genome-wide association study; African-Americans; natural selection; informatics; electronic medical record
The era of “Personalized Medicine,” guided by individual molecular variation in DNA, RNA, expressed proteins and other forms of high volume molecular data brings new requirements and challenges to the design and implementation of Electronic Health Records (EHRs). In this article we describe the characteristics of biomolecular data that differentiate it from other classes of data commonly found in EHRs, enumerate a set of technical desiderata for its management in healthcare settings, and offer a candidate technical approach to its compact and efficient representation in operational systems.
Electronic Health Records; Genomics; Knowledge representation; Data compression
The promise of “personalized medicine” guided by an understanding of each individual’s genome has been fostered by increasingly powerful and economical methods to acquire clinically relevant features. We describe operational implementation of prospective genotyping linked to an advanced clinical decision support system to guide individualized healthcare in a large academic health center. This approach to personalized medicine includes patient and healthcare provider engagement, identifying relevant genetic variation for implementation, assay reliability, point-of-care decision support, and necessary institutional investments. In one year, approximately 3,000 patients, most scheduled for cardiac catheterization, were genotyped on a multiplexed platform including CYP2C19 variants that modulate response to the widely-used antiplatelet drug clopidogrel. These data are deposited into the Electronic Medical Record and point-of-care decision support is deployed when clopidogrel is prescribed for those with variant genotypes. The establishment of programs such as this is a first step toward implementing and evaluating strategies for personalized medicine.
Drug-Drug Interactions; Personalized Medicine; Pharmacogenetics; Translational Medicine; Adverse Drug Reactions
In 2008, 11 new fellows were elected to the American College of Medical Informatics, and were inducted into the College at a ceremony held in conjunction with the American Medical Informatics Association conference in Washington, DC on Nov 9, 2008. A brief synopsis of the background and accomplishments of each of the new fellows is provided here, in alphabetical order.
To identify common genetic variants influencing red blood cell (RBC) traits.
Patients and Methods
We performed a genomewide association study from June 2008 through July 2011 of hemoglobin, hematocrit, RBC count, mean corpuscular volume, mean corpuscular hemoglobin, and mean corpuscular hemoglobin concentration in 12,486 patients of European ancestry from the electronic MEdical Records and Genomics (eMERGE) network. We developed an electronic medical record–based algorithm that included individuals who had RBC measurements obtained for clinical care and excluded values measured in the setting of hematopoietic disorders, comorbid conditions, or medications known to affect RBC production or a recent history of blood loss.
We identified 4 new genetic loci and replicated 11 loci previously reported to be associated with one or more RBC traits in individuals of European ancestry. Notably, genes present in 3 of the 4 newly identified loci (THRB, PTPLAD1, CDT1) and in 6 of the 11 replicated loci (KLF1, ALDH8A1, CCND3, SPTA1, FBXO7, TFR2/EPO) are implicated in erythroid differentiation and regulation of cell cycle in hematopoietic stem cells.
Genes in the erythroid differentiation and cell cycle regulation pathways influence interindividual variation in RBC indices. Our results provide insights into the molecular basis underlying variation in RBC traits.
eMERGE, electronic MEdical Records and GEnomics; EMMAX, mixed-model association-expedited; EMR, electronic medical record; eQTL, expression quantitative trait locus; GHC, Group Health Cooperative--University of Washington; GWAS, genomewide association study; HCT, hematocrit; HGB, hemoglobin; IBS, identity-by-state; LD, linkage disequilibrium; MC, Marshfield Clinic; MCH, mean corpuscular hemoglobin; MCHC, mean corpuscular hemoglobin concentration; MCV, mean corpuscular volume; MIM, Mendelian Inheritance of Man; NU, Northwestern University; RBC, red blood cell; SNP, single-nucleotide polymorphism; VUMC, Vanderbilt University Medical Center
StarBRITE is a one-stop, web-based research portal designed to meet the day-to-day needs of the Vanderbilt University and Meharry Medical College research community during the planning and conduct of research studies. StarBRITE serves as the main online location for research support addressing issues such as identification and location of resources, identification of experts, guidance for regulatory applications and approvals, regulatory assistance, funding requests, research data planning and collection, and serves as a central repository for educational offerings. To date, there have been more than 590,038 StarBRITE hits by more than 6582 cumulative users. We present here StarBRITE design objectives, details about technical infrastructure and system components, status report and activity metrics for the first 2.75-years of operation, and a report of lessons learned during organizing, launching and refining the portal.
Biomedical Informatics; Clinical Research; Translational Research; Scientific Portfolio Management; Researcher Portal; Research Services
Systematic study of clinical phenotypes is important for a better understanding of the genetic basis of human diseases and more effective gene-based disease management. A key aspect in facilitating such studies requires standardized representation of the phenotype data using common data elements (CDEs) and controlled biomedical vocabularies. In this study, the authors analyzed how a limited subset of phenotypic data is amenable to common definition and standardized collection, as well as how their adoption in large-scale epidemiological and genome-wide studies can significantly facilitate cross-study analysis.
The authors mapped phenotype data dictionaries from five different eMERGE (Electronic Medical Records and Genomics) Network sites studying multiple diseases such as peripheral arterial disease and type 2 diabetes. For mapping, standardized terminological and metadata repository resources, such as the caDSR (Cancer Data Standards Registry and Repository) and SNOMED CT (Systematized Nomenclature of Medicine), were used. The mapping process comprised both lexical (via searching for relevant pre-coordinated concepts and data elements) and semantic (via post-coordination) techniques. Where feasible, new data elements were curated to enhance the coverage during mapping. A web-based application was also developed to uniformly represent and query the mapped data elements from different eMERGE studies.
Approximately 60% of the target data elements (95 out of 157) could be mapped using simple lexical analysis techniques on pre-coordinated terms and concepts before any additional curation of terminology and metadata resources was initiated by eMERGE investigators. After curation of 54 new caDSR CDEs and nine new NCI thesaurus concepts and using post-coordination, the authors were able to map the remaining 40% of data elements to caDSR and SNOMED CT. A web-based tool was also implemented to assist in semi-automatic mapping of data elements.
This study emphasizes the requirement for standardized representation of clinical research data using existing metadata and terminology resources and provides simple techniques and software for data element mapping using experiences from the eMERGE Network.
Ritu and pupu and 12; informatics; ontologies; knowledge representations; controlled terminologies and vocabularies; machine learning; terminologies; metadata; mapping; harmonization; eMERGE Network
The 1999 debate of the American College of Medical Informatics focused on the proposition that medical informatics and nursing informatics are distinctive disciplines that require their own core curricula, training programs, and professional identities. Proponents of this position emphasized that informatics training, technology applications, and professional identities are closely tied to the activities of the health professionals they serve and that, as nursing and medicine differ, so do the corresponding efforts in information science and technology. Opponents of the proposition asserted that informatics is built on a re-usable and widely applicable set of methods that are common to all health science disciplines, and that “medical informatics” continues to be a useful name for a composite core discipline that should be studied by all students, regardless of their health profession orientation.
Observational studies of health conditions and outcomes often combine clinical care data from many sites without explicitly assessing the accuracy and completeness of these data. In order to improve the quality of data in an international multi-site observational cohort of HIV-infected patients, the authors conducted on-site, Good Clinical Practice-based audits of the clinical care datasets submitted by participating HIV clinics. Discrepancies between data submitted for research and data in the clinical records were categorized using the audit codes published by the European Organization for the Research and Treatment of Cancer. Five of seven sites had error rates >10% in key study variables, notably laboratory data, weight measurements, and antiretroviral medications. All sites had significant discrepancies in medication start and stop dates. Clinical care data, particularly antiretroviral regimens and associated dates, are prone to substantial error. Verifying data against source documents through audits will improve the quality of databases and research and can be a technique for retraining staff responsible for clinical data collection. The authors recommend that all participants in observational cohorts use data audits to assess and improve the quality of data and to guide future data collection and abstraction efforts at the point of care.
We describe a two-stage analytical approach for characterizing morbidity profile dissimilarity among patient cohorts using electronic medical records. We capture morbidities using the International Statistical Classification of Diseases and Related Health Problems (ICD-9) codes. In the first stage of the approach separate logistic regression analyses for ICD-9 sections (e.g., “hypertensive disease” or “appendicitis”) are conducted, and the odds ratios that describe adjusted differences in prevalence between two cohorts are displayed graphically. In the second stage, the results from ICD-9 section analyses are combined into a general morbidity dissimilarity index (MDI). For illustration, we examine nine cohorts of patients representing six phenotypes (or controls) derived from five institutions, each a participant in the electronic MEdical REcords and GEnomics (eMERGE) network. The phenotypes studied include type II diabetes and type II diabetes controls, peripheral arterial disease and peripheral arterial disease controls, normal cardiac conduction as measures by electrocardiography, and senile cataracts.
Electronic medical records; ICD-9; dissimilarity index; comorbidity index; population comparison; morbidity dissimilarity index
Recent genome-wide association studies (GWAS) using selected community populations have identified genomic signals in SCN10A influencing PR duration. The extent to which this can be demonstrated in cohorts derived from electronic medical records is unknown.
Methods and Results
We performed a GWAS on 2,334 European-American patients with normal ECGs without evidence of prior heart disease from the Vanderbilt DNA databank, BioVU, which accrues subjects from routine patient care. Subjects were identified using combinations of natural language processing, laboratory, and billing code queries of de-identified medical record data. Subjects were 58% female, mean (±SD) age 54±15 years, and had mean PR intervals of 158±18 milliseconds. Genotyping was performed using the Illumina Human660W-Quad platform. Our results identify four single nucleotide polymorphisms (rs6800541, rs6795970, rs6798015, rs7430477) linked to SCN10A associated with PR interval (p=5.73×10−7 to 1.78×10−6).
This GWAS confirms a gene heretofore-unimplicated in cardiac pathophysiology as a modulator of PR interval in humans. This study is one of the first replication GWAS performed using an electronic medical record-derived cohort, supporting their further use for genotype-phenotype analyses.
electronic medical records; atrioventricular conduction; genome-wide association study; natural language processing
Combining genome-wide association studies (GWAS) data with clinical information from the electronic medical record (EMR) provide unprecedented opportunities to identify genetic variants that influence susceptibility to common, complex diseases. While mining the vastness of EMR greatly expands the potential for conducting GWAS, non-standardized representation and wide variability of clinical data and phenotypes pose a major challenge to data integration and analysis. To address this requirement, we present experiences and methods developed to map phenotypic data elements from eMERGE (Electronic Medical Record and Genomics) to PhenX (Consensus Measures for Phenotypes and Exposures) and NCI’s Cancer Data Standards Registry and Repository (caDSR). Our results suggest that adopting multiple standards and biomedical terminologies will expose studies to a broader user community and enhance interoperability with a wider range of studies, in turn promoting cross-study pooling of data to detect both more subtle and more complex genotype-phenotype associations.
Electrocardiographic QRS duration, a measure of cardiac intraventricular conduction, varies ~2-fold in individuals without cardiac disease. Slow conduction may promote reentrant arrhythmias.
Methods and Results
We performed a genome-wide association study (GWAS) to identify genomic markers of QRS duration in 5,272 individuals without cardiac disease selected from electronic medical record (EMR) algorithms at five sites in the Electronic Medical Records and Genomics (eMERGE) network. The most significant loci were evaluated within the CHARGE consortium QRS GWAS meta-analysis. Twenty-three single nucleotide polymorphisms in 5 loci, previously described by CHARGE, were replicated in the eMERGE samples; 18 SNPs were in the chromosome 3 SCN5A and SCN10A loci, where the most significant SNPs were rs1805126 in SCN5A with p=1.2×10−8 (eMERGE) and p=2.5×10−20 (CHARGE) and rs6795970 in SCN10A with p=6×10−6 (eMERGE) and p=5×10−27 (CHARGE). The other loci were in NFIA, near CDKN1A, and near C6orf204. We then performed phenome-wide association studies (PheWAS) on variants in these five loci in 13,859 European Americans to search for diagnoses associated with these markers. PheWAS identified atrial fibrillation and cardiac arrhythmias as the most common associated diagnoses with SCN10A and SCN5A variants. SCN10A variants were also associated with subsequent development of atrial fibrillation and arrhythmia in the original 5,272 “heart-healthy” study population.
We conclude that DNA biobanks coupled to EMRs provide a platform not only for GWAS but may also allow broad interrogation of the longitudinal incidence of disease associated with genetic variants. The PheWAS approach implicated sodium channel variants modulating QRS duration in subjects without cardiac disease as predictors of subsequent arrhythmias.
cardiac conduction; QRS duration; atrial fibrillation; genome-wide association study; phenome-wide association study; electronic medical records
Candidate gene and genome-wide association studies (GWAS) have identified genetic variants that modulate risk for human disease; many of these associations require further study to replicate the results. Here we report the first large-scale application of the phenome-wide association study (PheWAS) paradigm within electronic medical records (EMRs), an unbiased approach to replication and discovery that interrogates relationships between targeted genotypes and multiple phenotypes. We scanned for associations between 3,144 single-nucleotide polymorphisms (previously implicated by GWAS as mediators of human traits) and 1,358 EMR-derived phenotypes in 13,835 individuals of European ancestry. This PheWAS replicated 66% (51/77) of sufficiently powered prior GWAS associations and revealed 63 potentially pleiotropic associations with P < 4.6 × 10−6 (false discovery rate < 0.1); the strongest of these novel associations were replicated in an independent cohort (n = 7,406). These findings validate PheWAS as a tool to allow unbiased interrogation across multiple phenotypes in EMR-based cohorts and to enhance analysis of the genomic basis of human disease.
Significant research has been devoted to predicting diagnosis, prognosis, and response to treatment using high-throughput assays. Rapid translation into clinical results hinges upon efficient access to up-to-date and high-quality molecular medicine modalities.
We first explain why this goal is inadequately supported by existing databases and portals and then introduce a novel semantic indexing and information retrieval model for clinical bioinformatics. The formalism provides the means for indexing a variety of relevant objects (e.g. papers, algorithms, signatures, datasets) and includes a model of the research processes that creates and validates these objects in order to support their systematic presentation once retrieved.
We test the applicability of the model by constructing proof-of-concept encodings and visual presentations of evidence and modalities in molecular profiling and prognosis of: (a) diffuse large B-cell lymphoma (DLBCL) and (b) breast cancer.
information retrieval; molecular medicine; semantic model; clinical bioinformatics; predictive computational models