The era of “Personalized Medicine,” guided by individual molecular variation in DNA, RNA, expressed proteins and other forms of high volume molecular data brings new requirements and challenges to the design and implementation of Electronic Health Records (EHRs). In this article we describe the characteristics of biomolecular data that differentiate it from other classes of data commonly found in EHRs, enumerate a set of technical desiderata for its management in healthcare settings, and offer a candidate technical approach to its compact and efficient representation in operational systems.
Electronic Health Records; Genomics; Knowledge representation; Data compression
Platelets are enucleated cell fragments derived from megakaryocytes that play key roles in hemostasis and in the pathogenesis of atherothrombosis and cancer. Platelet traits are highly heritable and identification of genetic variants associated with platelet traits and assessing their pleiotropic effects may help to understand the role of underlying biological pathways. We conducted an electronic medical record (EMR)-based study to identify common variants that influence inter-individual variation in the number of circulating platelets (PLT) and mean platelet volume (MPV), by performing a genome-wide association study (GWAS). We characterized association of variants influencing MPV and PLT using functional, pathway and disease enrichment analysis assess pleiotropic effects of such variants by performing a phenome-wide association study (PheWAS) with a wide range of EMR-derived phenotypes. A total of 13,582 participants in the electronic MEdical Records and GEnomic (eMERGE) network had data for PLT and 6,291 participants had data for MPV. We identified 5 chromosomal regions associated with PLT and 8 associated with MPV at genome-wide significance (P<5E-8). In addition, we replicated 20 SNPs (out of 56 SNPs (α: 0.05/56=9E-4)) influencing PLT and 22 SNPs (out of 29 SNPs (α: 0.05/29=2E-3)) influencing MPV in a meta-analysis of GWAS of PLT and MPV. While our GWAS did not reveal any novel associations, our functional analyses revealed that genes in these regions influence thrombopoiesis and encode kinases, membrane proteins, proteins involved in cellular trafficking, transcription factors, proteasome complex subunits, proteins of signal transduction pathways, proteins involved in megakaryocyte development and platelet production and hemostasis. PheWAS using a single-SNP Bonferroni correction for 1368 diagnoses (0.05/1368=3.6E-5) revealed that several variants in these genes have pleiotropic associations with myocardial infarction, autoimmune and hematologic disorders. We conclude that multiple genetic loci influence interindividual variation in platelet traits and also have significant pleiotropic effects; the related genes are in multiple functional pathways including those relevant to thrombopoiesis.
Thyroid stimulating hormone (TSH) hormone levels are normally tightly regulated within an individual; thus, relatively small variations may indicate thyroid disease. Genome-wide association studies (GWAS) have identified variants in PDE8B and FOXE1 that are associated with TSH levels. However, prior studies lacked racial/ethnic diversity, limiting the generalization of these findings to individuals of non-European ethnicities. The Electronic Medical Records and Genomics (eMERGE) Network is a collaboration across institutions with biobanks linked to electronic medical records (EMRs). The eMERGE Network uses EMR-derived phenotypes to perform GWAS in diverse populations for a variety of phenotypes. In this report, we identified serum TSH levels from 4,501 European American and 351 African American euthyroid individuals in the eMERGE Network with existing GWAS data. Tests of association were performed using linear regression and adjusted for age, sex, body mass index (BMI), and principal components, assuming an additive genetic model. Our results replicate the known association of PDE8B with serum TSH levels in European Americans (rs2046045 p = 1.85×10−17, β = 0.09). FOXE1 variants, associated with hypothyroidism, were not genome-wide significant (rs10759944: p = 1.08×10−6, β = −0.05). No SNPs reached genome-wide significance in African Americans. However, multiple known associations with TSH levels in European ancestry were nominally significant in African Americans, including PDE8B (rs2046045 p = 0.03, β = −0.09), VEGFA (rs11755845 p = 0.01, β = −0.13), and NFIA (rs334699 p = 1.50×10−3, β = −0.17). We found little evidence that SNPs previously associated with other thyroid-related disorders were associated with serum TSH levels in this study. These results support the previously reported association between PDE8B and serum TSH levels in European Americans and emphasize the need for additional genetic studies in more diverse populations.
Electrocardiographic QRS duration, a measure of cardiac intraventricular conduction, varies ~2-fold in individuals without cardiac disease. Slow conduction may promote reentrant arrhythmias.
Methods and Results
We performed a genome-wide association study (GWAS) to identify genomic markers of QRS duration in 5,272 individuals without cardiac disease selected from electronic medical record (EMR) algorithms at five sites in the Electronic Medical Records and Genomics (eMERGE) network. The most significant loci were evaluated within the CHARGE consortium QRS GWAS meta-analysis. Twenty-three single nucleotide polymorphisms in 5 loci, previously described by CHARGE, were replicated in the eMERGE samples; 18 SNPs were in the chromosome 3 SCN5A and SCN10A loci, where the most significant SNPs were rs1805126 in SCN5A with p=1.2×10−8 (eMERGE) and p=2.5×10−20 (CHARGE) and rs6795970 in SCN10A with p=6×10−6 (eMERGE) and p=5×10−27 (CHARGE). The other loci were in NFIA, near CDKN1A, and near C6orf204. We then performed phenome-wide association studies (PheWAS) on variants in these five loci in 13,859 European Americans to search for diagnoses associated with these markers. PheWAS identified atrial fibrillation and cardiac arrhythmias as the most common associated diagnoses with SCN10A and SCN5A variants. SCN10A variants were also associated with subsequent development of atrial fibrillation and arrhythmia in the original 5,272 “heart-healthy” study population.
We conclude that DNA biobanks coupled to EMRs provide a platform not only for GWAS but may also allow broad interrogation of the longitudinal incidence of disease associated with genetic variants. The PheWAS approach implicated sodium channel variants modulating QRS duration in subjects without cardiac disease as predictors of subsequent arrhythmias.
cardiac conduction; QRS duration; atrial fibrillation; genome-wide association study; phenome-wide association study; electronic medical records
Candidate gene and genome-wide association studies (GWAS) have identified genetic variants that modulate risk for human disease; many of these associations require further study to replicate the results. Here we report the first large-scale application of the phenome-wide association study (PheWAS) paradigm within electronic medical records (EMRs), an unbiased approach to replication and discovery that interrogates relationships between targeted genotypes and multiple phenotypes. We scanned for associations between 3,144 single-nucleotide polymorphisms (previously implicated by GWAS as mediators of human traits) and 1,358 EMR-derived phenotypes in 13,835 individuals of European ancestry. This PheWAS replicated 66% (51/77) of sufficiently powered prior GWAS associations and revealed 63 potentially pleiotropic associations with P < 4.6 × 10−6 (false discovery rate < 0.1); the strongest of these novel associations were replicated in an independent cohort (n = 7,406). These findings validate PheWAS as a tool to allow unbiased interrogation across multiple phenotypes in EMR-based cohorts and to enhance analysis of the genomic basis of human disease.
In 2008, 11 new fellows were elected to the American College of Medical Informatics, and were inducted into the College at a ceremony held in conjunction with the American Medical Informatics Association conference in Washington, DC on Nov 9, 2008. A brief synopsis of the background and accomplishments of each of the new fellows is provided here, in alphabetical order.
Routine integration of genotype data into drug decision-making could improve patient safety, particularly if many relevant genetic variants can be assayed simultaneously before target drug prescribing. The frequency of pharmacogenetic prescribing opportunities and the potential adverse events (AE) mitigated are unknown. We examined the frequency with which 56 medications with known outcomes influenced by variant alleles were prescribed in a cohort of 52,942 medical home patients at Vanderbilt University Medical Center. Within a five-year window, we estimated that 64.8% (95% CI: 64.4%-65.2%) of individuals were exposed to at least one medication with an established pharmacogenetic association. Using previously published results for six medications with well-characterized, severe genetically-linked AEs, we estimated that 398 events (95% CI, 225 - 583) could have been prevented with an effective preemptive genotyping program. Our results suggest that multiplexed, preemptive genotyping may represent an efficient alternative approach to current single use (“reactive”) methods and may improve safety.
Antiretroviral therapy (ART) decreases mortality risk in HIV-infected tuberculosis patients, but the effect of the duration of anti-tuberculosis therapy and timing of anti-tuberculosis therapy initiation in relation to ART initiation on mortality, is unclear.
We conducted a retrospective observational multi-center cohort study among HIV-infected persons concomitantly treated with Rifamycin-based anti-tuberculosis therapy and ART in Latin America. The study population included persons for whom 6 months of anti-tuberculosis therapy is recommended.
Of 253 patients who met inclusion criteria, median CD4+ lymphocyte count at ART initiation was 64 cells/mm3, 171 (68%) received >180 days of anti-tuberculosis therapy, 168 (66%) initiated anti-tuberculosis therapy before ART, and 43 (17%) died. In a multivariate Cox proportional hazards model that adjusted for CD4+ lymphocytes and HIV-1 RNA, tuberculosis diagnosed after ART initiation was associated with an increased risk of death compared to tuberculosis diagnosis before ART initiation (HR 2.40; 95% CI 1.15, 5.02; P = 0.02). In a separate model among patients surviving >6 months after tuberculosis diagnosis, after adjusting for CD4+ lymphocytes, HIV-1 RNA, and timing of ART initiation relative to tuberculosis diagnosis, receipt of >6 months of anti-tuberculosis therapy was associated with a decreased risk of death (HR 0.23; 95% CI 0.08, 0.66; P=0.007).
The increased risk of death among persons diagnosed with tuberculosis after ART initiation highlights the importance of screening for tuberculosis before ART initiation. The decreased risk of death among persons receiving > 6 months of anti-tuberculosis therapy suggests that current anti-tuberculosis treatment duration guidelines should be re-evaluated.
The clinical research data sets exchanged in international epidemiology research often lack the elements needed to assess their suitability for use in multi-region meta-analyses. While the missing information is generally known to local investigators, it is not contained in the files exchanged between sites. Instead, such content must be solicited by the study coordinating center though a series of lengthy phone and electronic communications: an informal process whose reproducibility and accuracy decays over time. This report describes a set of supplemental information needed to assess whether clinical research data from diverse research sites are truly comparable, and what metadata (“data about the data”) should be preserved when a data set is archived for future use. We propose a structured Extensible Markup Language (XML) model that captures this information. The authors hope this model will be a first step towards preserving the metadata associated with clinical research data sets, thereby improving the quality of international data exchange, data archiving, and merged-data research using data collected in many different countries, languages and care settings.
Programming Languages; Software Design; Knowledge Representation (Computer); Database Management Systems
To identify novel genetic loci influencing interindividual variation in red blood cell (RBC) traits in African-Americans, we conducted a genome-wide association study (GWAS) in 2315 individuals, divided into discovery (n = 1904) and replication (n = 411) cohorts. The traits included hemoglobin concentration (HGB), hematocrit (HCT), RBC count, mean corpuscular volume (MCV), mean corpuscular hemoglobin (MCH), and mean corpuscular hemoglobin concentration (MCHC). Patients were participants in the electronic MEdical Records and GEnomics (eMERGE) network and underwent genotyping of ~1.2 million single-nucleotide polymorphisms on the Illumina Human1M-Duo array. Association analyses were performed adjusting for age, sex, site, and population stratification. Three loci previously associated with resistance to malaria—HBB (11p15.4), HBA1/HBA2 (16p13.3), and G6PD (Xq28)—were associated (P ≤ 1 × 10−6) with RBC traits in the discovery cohort. The loci replicated in the replication cohort (P ≤ 0.02), and were significant at a genome-wide significance level (P < 5 × 10−8) in the combined cohort. The proportions of variance in RBC traits explained by significant variants at these loci were as follows: rs7120391 (near HBB) 1.3% of MCHC, rs9924561 (near HBA1/A2) 5.5% of MCV, 6.9% of MCH and 2.9% of MCHC, and rs1050828 (in G6PD) 2.4% of RBC count, 2.9% of MCV, and 1.4% of MCH, respectively. We were not able to replicate loci identified by a previous GWAS of RBC traits in a European ancestry cohort of similar sample size, suggesting that the genetic architecture of RBC traits differs by race. In conclusion, genetic variants that confer resistance to malaria are associated with RBC traits in African-Americans.
red blood cell (RBC) traits; genome-wide association study; African-Americans; natural selection; informatics; electronic medical record
The 1999 debate of the American College of Medical Informatics focused on the proposition that medical informatics and nursing informatics are distinctive disciplines that require their own core curricula, training programs, and professional identities. Proponents of this position emphasized that informatics training, technology applications, and professional identities are closely tied to the activities of the health professionals they serve and that, as nursing and medicine differ, so do the corresponding efforts in information science and technology. Opponents of the proposition asserted that informatics is built on a re-usable and widely applicable set of methods that are common to all health science disciplines, and that “medical informatics” continues to be a useful name for a composite core discipline that should be studied by all students, regardless of their health profession orientation.
The promise of “personalized medicine” guided by an understanding of each individual’s genome has been fostered by increasingly powerful and economical methods to acquire clinically relevant features. We describe operational implementation of prospective genotyping linked to an advanced clinical decision support system to guide individualized healthcare in a large academic health center. This approach to personalized medicine includes patient and healthcare provider engagement, identifying relevant genetic variation for implementation, assay reliability, point-of-care decision support, and necessary institutional investments. In one year, approximately 3,000 patients, most scheduled for cardiac catheterization, were genotyped on a multiplexed platform including CYP2C19 variants that modulate response to the widely-used antiplatelet drug clopidogrel. These data are deposited into the Electronic Medical Record and point-of-care decision support is deployed when clopidogrel is prescribed for those with variant genotypes. The establishment of programs such as this is a first step toward implementing and evaluating strategies for personalized medicine.
Drug-Drug Interactions; Personalized Medicine; Pharmacogenetics; Translational Medicine; Adverse Drug Reactions
Warfarin pharmacogenomic algorithms reduce dosing error, but perform poorly in non-European–Americans. Electronic health record (EHR) systems linked to biobanks may allow for pharmacogenomic analysis, but they have not yet been used for this purpose.
Patients & methods
We used BioVU, the Vanderbilt EHR-linked DNA repository, to identify European–Americans (n = 1022) and African–Americans (n = 145) on stable warfarin therapy and evaluated the effect of 15 pharmacogenetic variants on stable warfarin dose.
Associations between variants in VKORC1, CYP2C9 and CYP4F2 with weekly dose were observed in European–Americans as well as additional variants in CYP2C9 and CALU in African–Americans. Compared with traditional 5 mg/day dosing, implementing the US FDA recommendations or the International Warfarin Pharmacogenomics Consortium (IWPC) algorithm reduced error in weekly dose in European–Americans (13.5–12.4 and 9.5 mg/week, respectively) but less so in African–Americans (15.2–15.0 and 13.8 mg/week, respectively). By further incorporating associated variants specific for European–Americans and African–Americans in an expanded algorithm, dose-prediction error reduced to 9.1 mg/week (95% CI: 8.4–9.6) in European–Americans and 12.4 mg/week (95% CI: 10.0–13.2) in African–Americans. The expanded algorithm explained 41 and 53% of dose variation in African–Americans and European–Americans, respectively, compared with 29 and 50%, respectively, for the IWPC algorithm. Implementing these predictions via dispensable pill regimens similarly reduced dosing error.
These results validate EHR-linked DNA biorepositories as real-world resources for pharmacogenomic validation and discovery.
anticoagulants; bioinformatics; electronic health record; genes; pharmacogenomics; warfarin
To identify common genetic variants influencing red blood cell (RBC) traits.
Patients and Methods
We performed a genomewide association study from June 2008 through July 2011 of hemoglobin, hematocrit, RBC count, mean corpuscular volume, mean corpuscular hemoglobin, and mean corpuscular hemoglobin concentration in 12,486 patients of European ancestry from the electronic MEdical Records and Genomics (eMERGE) network. We developed an electronic medical record–based algorithm that included individuals who had RBC measurements obtained for clinical care and excluded values measured in the setting of hematopoietic disorders, comorbid conditions, or medications known to affect RBC production or a recent history of blood loss.
We identified 4 new genetic loci and replicated 11 loci previously reported to be associated with one or more RBC traits in individuals of European ancestry. Notably, genes present in 3 of the 4 newly identified loci (THRB, PTPLAD1, CDT1) and in 6 of the 11 replicated loci (KLF1, ALDH8A1, CCND3, SPTA1, FBXO7, TFR2/EPO) are implicated in erythroid differentiation and regulation of cell cycle in hematopoietic stem cells.
Genes in the erythroid differentiation and cell cycle regulation pathways influence interindividual variation in RBC indices. Our results provide insights into the molecular basis underlying variation in RBC traits.
eMERGE, electronic MEdical Records and GEnomics; EMMAX, mixed-model association-expedited; EMR, electronic medical record; eQTL, expression quantitative trait locus; GHC, Group Health Cooperative--University of Washington; GWAS, genomewide association study; HCT, hematocrit; HGB, hemoglobin; IBS, identity-by-state; LD, linkage disequilibrium; MC, Marshfield Clinic; MCH, mean corpuscular hemoglobin; MCHC, mean corpuscular hemoglobin concentration; MCV, mean corpuscular volume; MIM, Mendelian Inheritance of Man; NU, Northwestern University; RBC, red blood cell; SNP, single-nucleotide polymorphism; VUMC, Vanderbilt University Medical Center
New computer technologies have made it feasible to represent, store, and communicate high resolution biomedical images via electronic means. Traditional two dimensional medical images such as those on printed pages have been supplemented by three dimensional images which can be rendered, rotated, and “dissected” from any point of view. The library of the future will provide electronic access not only to words and numbers, but to pictures, sounds, and other nontextual information. There currently exist few widely-accepted standards for the representation and communication of complex images, yet such standards will be critical to the feasibility and usefulness of digital image collections in the life sciences. The National Library of Medicine is embarked on a project to develop a complete digital volumetric representation of an adult human male and female. This “Visible Human Project” will address the issue of standards for computer representation of biological structure.
StarBRITE is a one-stop, web-based research portal designed to meet the day-to-day needs of the Vanderbilt University and Meharry Medical College research community during the planning and conduct of research studies. StarBRITE serves as the main online location for research support addressing issues such as identification and location of resources, identification of experts, guidance for regulatory applications and approvals, regulatory assistance, funding requests, research data planning and collection, and serves as a central repository for educational offerings. To date, there have been more than 590,038 StarBRITE hits by more than 6582 cumulative users. We present here StarBRITE design objectives, details about technical infrastructure and system components, status report and activity metrics for the first 2.75-years of operation, and a report of lessons learned during organizing, launching and refining the portal.
Biomedical Informatics; Clinical Research; Translational Research; Scientific Portfolio Management; Researcher Portal; Research Services
Systematic study of clinical phenotypes is important for a better understanding of the genetic basis of human diseases and more effective gene-based disease management. A key aspect in facilitating such studies requires standardized representation of the phenotype data using common data elements (CDEs) and controlled biomedical vocabularies. In this study, the authors analyzed how a limited subset of phenotypic data is amenable to common definition and standardized collection, as well as how their adoption in large-scale epidemiological and genome-wide studies can significantly facilitate cross-study analysis.
The authors mapped phenotype data dictionaries from five different eMERGE (Electronic Medical Records and Genomics) Network sites studying multiple diseases such as peripheral arterial disease and type 2 diabetes. For mapping, standardized terminological and metadata repository resources, such as the caDSR (Cancer Data Standards Registry and Repository) and SNOMED CT (Systematized Nomenclature of Medicine), were used. The mapping process comprised both lexical (via searching for relevant pre-coordinated concepts and data elements) and semantic (via post-coordination) techniques. Where feasible, new data elements were curated to enhance the coverage during mapping. A web-based application was also developed to uniformly represent and query the mapped data elements from different eMERGE studies.
Approximately 60% of the target data elements (95 out of 157) could be mapped using simple lexical analysis techniques on pre-coordinated terms and concepts before any additional curation of terminology and metadata resources was initiated by eMERGE investigators. After curation of 54 new caDSR CDEs and nine new NCI thesaurus concepts and using post-coordination, the authors were able to map the remaining 40% of data elements to caDSR and SNOMED CT. A web-based tool was also implemented to assist in semi-automatic mapping of data elements.
This study emphasizes the requirement for standardized representation of clinical research data using existing metadata and terminology resources and provides simple techniques and software for data element mapping using experiences from the eMERGE Network.
Ritu and pupu and 12; informatics; ontologies; knowledge representations; controlled terminologies and vocabularies; machine learning; terminologies; metadata; mapping; harmonization; eMERGE Network
DNA biobanks linked to comprehensive electronic health records systems are potentially powerful resources for pharmacogenetic studies. This study sought to develop natural-language-processing algorithms to extract drug-dose information from clinical text, and to assess the capabilities of such tools to automate the data-extraction process for pharmacogenetic studies.
Materials and methods
A manually validated warfarin pharmacogenetic study identified a cohort of 1125 patients with a stable warfarin dose, in which 776 patients were managed by Coumadin Clinic physicians, and the remaining 349 patients were managed by their providers. The authors developed two algorithms to extract weekly warfarin doses from both data sets: a regular expression-based program for semistructured Coumadin Clinic notes; and an advanced weekly dose calculator based on an existing medication information extraction system (MedEx) for narrative providers' notes. The authors then conducted an association analysis between an automatically extracted stable weekly dose of warfarin and four genetic variants of VKORC1 and CYP2C9 genes. The performance of the weekly dose-extraction program was evaluated by comparing it with a gold standard containing manually curated weekly doses. Precision, recall, F-measure, and overall accuracy were reported. Associations between known variants in VKORC1 and CYP2C9 and warfarin stable weekly dose were performed with linear regression adjusted for age, gender, and body mass index.
The authors' evaluation showed that the MedEx-based system could determine patients' warfarin weekly doses with 99.7% recall, 90.8% precision, and 93.8% accuracy. Using the automatically extracted weekly doses of warfarin, the authors successfully replicated the previous known associations between warfarin stable dose and genetic variants in VKORC1 and CYP2C9.
Automated learning; knowledge representations; discovery; text and data-mining methods; other methods of information extraction; natural-language processing; NLP; warfarin; old epass; Genetics; translational research—application of biological knowledge to clinical care; improving the education and skills training of health professionals; linking the genotype and phenotype
Observational studies of health conditions and outcomes often combine clinical care data from many sites without explicitly assessing the accuracy and completeness of these data. In order to improve the quality of data in an international multi-site observational cohort of HIV-infected patients, the authors conducted on-site, Good Clinical Practice-based audits of the clinical care datasets submitted by participating HIV clinics. Discrepancies between data submitted for research and data in the clinical records were categorized using the audit codes published by the European Organization for the Research and Treatment of Cancer. Five of seven sites had error rates >10% in key study variables, notably laboratory data, weight measurements, and antiretroviral medications. All sites had significant discrepancies in medication start and stop dates. Clinical care data, particularly antiretroviral regimens and associated dates, are prone to substantial error. Verifying data against source documents through audits will improve the quality of databases and research and can be a technique for retraining staff responsible for clinical data collection. The authors recommend that all participants in observational cohorts use data audits to assess and improve the quality of data and to guide future data collection and abstraction efforts at the point of care.