PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (939712)

Clipboard (0)
None

Related Articles

1.  Pitfalls of Merging GWAS Data: Lessons Learned in the eMERGE Network and Quality Control Procedures to Maintain High Data Quality 
Genetic epidemiology  2011;35(8):887-898.
Genome-wide association studies (GWAS) are a useful approach in the study of the genetic components of complex phenotypes. Aside from large cohorts, GWAS have generally been limited to the study of one or a few diseases or traits. The emergence of biobanks linked to electronic medical records (EMRs) allows the efficient re-use of genetic data to yield meaningful genotype-phenotype associations for multiple phenotypes or traits. Phase I of the electronic MEdical Records and GEnomics (eMERGE-I) Network is a National Human Genome Research Institute (NHGRI)-supported consortium composed of five sites to perform various genetic association studies using DNA repositories and EMR systems. Each eMERGE site has developed EMR-based algorithms to comprise a core set of fourteen phenotypes for extraction of study samples from each site’s DNA repository. Each eMERGE site selected samples for a specific phenotype, and these samples were genotyped at either the Broad Institute or at the Center for Inherited Disease Research (CIDR) using the Illumina Infinium BeadChip technology. In all, approximately 17,000 samples from across the five sites were genotyped. A unified quality control (QC) pipeline was developed by the eMERGE Genomics Working Group and used to ensure thorough cleaning of the data. This process includes examination of sample quality, marker quality, and various batch effects. Upon completion of the genotyping and QC analyses for each site’s primary study, the eMERGE Coordinating Center merged the datasets from all five sites. This larger merged dataset re-entered the established eMERGE QC pipeline. Based on lessons learned during the process, additional analyses and QC checkpoints were added to the pipeline to ensure proper merging. Here we explore the challenges associated with combining datasets from different genotyping centers and describe the expansion to the eMERGE QC pipeline for merged datasets. These additional steps will be useful as the eMERGE project expands to include additional sites in eMERGE-II and also serve as a starting point for investigators merging multiple genotype data sets accessible through the National Center for Biotechnology Information (NCBI) in the database of Genotypes and Phenotypes (dbGaP). Our experience demonstrates that merging multiple datasets after additional QC can be an efficient use of genotype data despite new challenges that appear in the process.
doi:10.1002/gepi.20639
PMCID: PMC3592376  PMID: 22125226
quality control; genome-wide association (GWAS); eMERGE; dbGaP; merging datasets
2.  The Electronic Medical Records and Genomics (eMERGE) Network: Past, Present and Future 
The Electronic Medical Records and Genomics (eMERGE) Network is a National Human Genome Research Institute (NHGRI)-funded consortium engaged in the development of methods and best-practices for utilizing the Electronic Medical Record (EMR) as a tool for genomic research. Now in its sixth year, its second funding cycle and comprising nine research groups and a coordinating center, the network has played a major role in validating the concept that clinical data derived from EMRs can be used successfully for genomic research. Current work is advancing knowledge in multiple disciplines at the intersection of genomics and healthcare informatics, particularly electronic phenotyping, genome-wide association studies, genomic medicine implementation and the ethical and regulatory issues associated with genomics research and returning results to study participants. Here we describe the evolution, accomplishments, opportunities and challenges of the network since its inception as a five-group consortium focused on genotype-phenotype associations for genomic discovery to its current form as a nine-group consortium pivoting towards implementation of genomic medicine.
doi:10.1038/gim.2013.72
PMCID: PMC3795928  PMID: 23743551
electronic medical records; personalized medicine; genome-wide association studies; genetics and genomics; collaborative research
3.  Genome- and Phenome-Wide Analysis of Cardiac Conduction Identifies Markers of Arrhythmia Risk 
Circulation  2013;127(13):1377-1385.
Background
Electrocardiographic QRS duration, a measure of cardiac intraventricular conduction, varies ~2-fold in individuals without cardiac disease. Slow conduction may promote reentrant arrhythmias.
Methods and Results
We performed a genome-wide association study (GWAS) to identify genomic markers of QRS duration in 5,272 individuals without cardiac disease selected from electronic medical record (EMR) algorithms at five sites in the Electronic Medical Records and Genomics (eMERGE) network. The most significant loci were evaluated within the CHARGE consortium QRS GWAS meta-analysis. Twenty-three single nucleotide polymorphisms in 5 loci, previously described by CHARGE, were replicated in the eMERGE samples; 18 SNPs were in the chromosome 3 SCN5A and SCN10A loci, where the most significant SNPs were rs1805126 in SCN5A with p=1.2×10−8 (eMERGE) and p=2.5×10−20 (CHARGE) and rs6795970 in SCN10A with p=6×10−6 (eMERGE) and p=5×10−27 (CHARGE). The other loci were in NFIA, near CDKN1A, and near C6orf204. We then performed phenome-wide association studies (PheWAS) on variants in these five loci in 13,859 European Americans to search for diagnoses associated with these markers. PheWAS identified atrial fibrillation and cardiac arrhythmias as the most common associated diagnoses with SCN10A and SCN5A variants. SCN10A variants were also associated with subsequent development of atrial fibrillation and arrhythmia in the original 5,272 “heart-healthy” study population.
Conclusions
We conclude that DNA biobanks coupled to EMRs provide a platform not only for GWAS but may also allow broad interrogation of the longitudinal incidence of disease associated with genetic variants. The PheWAS approach implicated sodium channel variants modulating QRS duration in subjects without cardiac disease as predictors of subsequent arrhythmias.
doi:10.1161/CIRCULATIONAHA.112.000604
PMCID: PMC3713791  PMID: 23463857
cardiac conduction; QRS duration; atrial fibrillation; genome-wide association study; phenome-wide association study; electronic medical records
4.  Quality Control Procedures for Genome Wide Association Studies 
Genome-wide association studies (GWAS) are being conducted at an unprecedented rate in population-based cohorts and have increased our understanding of the pathophysiology of complex disease. The recent application of GWAS to clinic-based cohorts has also yielded genetic predictors of clinical outcomes. Regardless of context, the practical utility of this information will ultimately depend upon the quality of the original data. Quality control (QC) procedures for GWAS are computationally intensive, operationally challenging, and constantly evolving. With each new dataset, new realities are discovered about GWAS data and best practices continue to be developed. The Genomics Workgroup of the National Human Genome Research Institute (NHGRI) funded electronic Medical Records and Genomics (eMERGE) network has invested considerable effort in developing strategies for QC of these data. The lessons learned by this group will be valuable for other investigators dealing with large scale genomic datasets. Here we enumerate some of the challenges in QC of GWAS data and describe the approaches that the eMERGE network is using for quality assurance in GWAS data, thereby minimizing potential bias and error in GWAS results. In this protocol we discuss common issues associated with QC of GWAS data, including data file formats, software packages for data manipulation and analysis, sex chromosome anomalies, sample identity, sample relatedness, population substructure, batch effects, and marker quality. We propose best practices and discuss areas of ongoing and future research.
doi:10.1002/0471142905.hg0119s68
PMCID: PMC3066182  PMID: 21234875
5.  eMERGEing progress in genomics—the first seven years 
Frontiers in Genetics  2014;5:184.
The electronic MEdical Records & GEnomics (eMERGE) network was established in 2007 by the National Human Genome Research Institute (NHGRI) of the National Institutes of Health (NIH) in part to explore the utility of electronic medical records (EMRs) in genome science. The initial focus was on discovery primarily using the genome-wide association paradigm, but more recently, the network has begun evaluating mechanisms to implement new genomic information coupled to clinical decision support into EMRs. Herein, we describe this evolution including the development of the individual and merged eMERGE genomic datasets, the contribution the network has made toward genomic discovery and human health, and the steps taken toward the next generation genotype-phenotype association studies and clinical implementation.
doi:10.3389/fgene.2014.00184
PMCID: PMC4060012  PMID: 24987407
biobanks; genome-wide association studies; pharmacogenomics; electronic medical records
6.  Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network 
Background
Genetic studies require precise phenotype definitions, but electronic medical record (EMR) phenotype data are recorded inconsistently and in a variety of formats.
Objective
To present lessons learned about validation of EMR-based phenotypes from the Electronic Medical Records and Genomics (eMERGE) studies.
Materials and methods
The eMERGE network created and validated 13 EMR-derived phenotype algorithms. Network sites are Group Health, Marshfield Clinic, Mayo Clinic, Northwestern University, and Vanderbilt University.
Results
By validating EMR-derived phenotypes we learned that: (1) multisite validation improves phenotype algorithm accuracy; (2) targets for validation should be carefully considered and defined; (3) specifying time frames for review of variables eases validation time and improves accuracy; (4) using repeated measures requires defining the relevant time period and specifying the most meaningful value to be studied; (5) patient movement in and out of the health plan (transience) can result in incomplete or fragmented data; (6) the review scope should be defined carefully; (7) particular care is required in combining EMR and research data; (8) medication data can be assessed using claims, medications dispensed, or medications prescribed; (9) algorithm development and validation work best as an iterative process; and (10) validation by content experts or structured chart review can provide accurate results.
Conclusions
Despite the diverse structure of the five EMRs of the eMERGE sites, we developed, validated, and successfully deployed 13 electronic phenotype algorithms. Validation is a worthwhile process that not only measures phenotype performance but also strengthens phenotype algorithm definitions and enhances their inter-institutional sharing.
doi:10.1136/amiajnl-2012-000896
PMCID: PMC3715338  PMID: 23531748
electronic medical record; electronic health record; genomics; phenotype; validation studies
7.  Genetic Variants Associated with Serum Thyroid Stimulating Hormone (TSH) Levels in European Americans and African Americans from the eMERGE Network 
PLoS ONE  2014;9(12):e111301.
Thyroid stimulating hormone (TSH) hormone levels are normally tightly regulated within an individual; thus, relatively small variations may indicate thyroid disease. Genome-wide association studies (GWAS) have identified variants in PDE8B and FOXE1 that are associated with TSH levels. However, prior studies lacked racial/ethnic diversity, limiting the generalization of these findings to individuals of non-European ethnicities. The Electronic Medical Records and Genomics (eMERGE) Network is a collaboration across institutions with biobanks linked to electronic medical records (EMRs). The eMERGE Network uses EMR-derived phenotypes to perform GWAS in diverse populations for a variety of phenotypes. In this report, we identified serum TSH levels from 4,501 European American and 351 African American euthyroid individuals in the eMERGE Network with existing GWAS data. Tests of association were performed using linear regression and adjusted for age, sex, body mass index (BMI), and principal components, assuming an additive genetic model. Our results replicate the known association of PDE8B with serum TSH levels in European Americans (rs2046045 p = 1.85×10−17, β = 0.09). FOXE1 variants, associated with hypothyroidism, were not genome-wide significant (rs10759944: p = 1.08×10−6, β = −0.05). No SNPs reached genome-wide significance in African Americans. However, multiple known associations with TSH levels in European ancestry were nominally significant in African Americans, including PDE8B (rs2046045 p = 0.03, β = −0.09), VEGFA (rs11755845 p = 0.01, β = −0.13), and NFIA (rs334699 p = 1.50×10−3, β = −0.17). We found little evidence that SNPs previously associated with other thyroid-related disorders were associated with serum TSH levels in this study. These results support the previously reported association between PDE8B and serum TSH levels in European Americans and emphasize the need for additional genetic studies in more diverse populations.
doi:10.1371/journal.pone.0111301
PMCID: PMC4249871  PMID: 25436638
8.  NEXT-GENERATION ANALYSIS OF CATARACTS: DETERMINING KNOWLEDGE DRIVEN GENE-GENE INTERACTIONS USING BIOFILTER, AND GENE-ENVIRONMENT INTERACTIONS USING THE PHENX TOOLKIT* 
Investigating the association between biobank derived genomic data and the information of linked electronic health records (EHRs) is an emerging area of research for dissecting the architecture of complex human traits, where cases and controls for study are defined through the use of electronic phenotyping algorithms deployed in large EHR systems. For our study, 2580 cataract cases and 1367 controls were identified within the Marshfield Personalized Medicine Research Project (PMRP) Biobank and linked EHR, which is a member of the NHGRI-funded electronic Medical Records and Genomics (eMERGE) Network. Our goal was to explore potential gene-gene and gene-environment interactions within these data for 529,431 single nucleotide polymorphisms (SNPs) with minor allele frequency > 1%, in order to explore higher level associations with cataract risk beyond investigations of single SNP-phenotype associations. To build our SNP-SNP interaction models we utilized a prior-knowledge driven filtering method called Biofilter to minimize the multiple testing burden of exploring the vast array of interaction models possible from our extensive number of SNPs. Using the Biofilter, we developed 57,376 prior-knowledge directed SNP-SNP models to test for association with cataract status. We selected models that required 6 sources of external domain knowledge. We identified 5 statistically significant models with an interaction term with p-value < 0.05, as well as an overall model with p-value < 0.05 associated with cataract status. We also conducted gene-environment interaction analyses for all GWAS SNPs and a set of environmental factors from the PhenX Toolkit: smoking, UV exposure, and alcohol use; these environmental factors have been previously associated with the formation of cataracts. We found a total of 288 models that exhibit an interaction term with a p-value ≤ 1×10−4 associated with cataract status. Our results show these approaches enable advanced searches for epistasis and gene-environment interactions beyond GWAS, and that the EHR based approach provides an additional source of data for seeking these advanced explanatory models of the etiology of complex disease/outcome such as cataracts.
PMCID: PMC3615413  PMID: 23424120
9.  Electronic Medical Records for Genetic Research: Results of the eMERGE Consortium 
Science translational medicine  2011;3(79):79re1.
Clinical data in Electronic Medical Records (EMRs) is a potential source of longitudinal clinical data for research. The Electronic Medical Records and Genomics Network or eMERGE investigates whether data captured through routine clinical care using EMRs can identify disease phenotypes with sufficient positive and negative predictive values for use in genome wide association studies (GWAS). Using data from five different sets of EMRs, we have identified five disease phenotypes with positive predictive values of 73–98% and negative predictive values of 98–100%. A majority of EMRs captured key information (diagnoses, medications, laboratory tests) used to define phenotypes in a structured format. We identified natural language processing as an important tool to improve case identification rates. Efforts and incentives to increase the implementation of interoperable EMRs will markedly improve the availability of clinical data for genomics research.
doi:10.1126/scitranslmed.3001807
PMCID: PMC3690272  PMID: 21508311
10.  EMR-linked GWAS study: investigation of variation landscape of loci for body mass index in children 
Frontiers in Genetics  2013;4:268.
Common variations at the loci harboring the fat mass and obesity gene (FTO), MC4R, and TMEM18 are consistently reported as being associated with obesity and body mass index (BMI) especially in adult population. In order to confirm this effect in pediatric population five European ancestry cohorts from pediatric eMERGE-II network (CCHMC-BCH) were evaluated.
Method: Data on 5049 samples of European ancestry were obtained from the Electronic Medical Records (EMRs) of two large academic centers in five different genotyped cohorts. For all available samples, gender, age, height, and weight were collected and BMI was calculated. To account for age and sex differences in BMI, BMI z-scores were generated using 2000 Centers of Disease Control and Prevention (CDC) growth charts. A Genome-wide association study (GWAS) was performed with BMI z-score. After removing missing data and outliers based on principal components (PC) analyses, 2860 samples were used for the GWAS study. The association between each single nucleotide polymorphism (SNP) and BMI was tested using linear regression adjusting for age, gender, and PC by cohort. The effects of SNPs were modeled assuming additive, recessive, and dominant effects of the minor allele. Meta-analysis was conducted using a weighted z-score approach.
Results: The mean age of subjects was 9.8 years (range 2–19). The proportion of male subjects was 56%. In these cohorts, 14% of samples had a BMI ≥95 and 28 ≥ 85%. Meta analyses produced a signal at 16q12 genomic region with the best result of p = 1.43 × 10-7 [p(rec) = 7.34 × 10-8) for the SNP rs8050136 at the first intron of FTO gene (z = 5.26) and with no heterogeneity between cohorts (p = 0.77). Under a recessive model, another published SNP at this locus, rs1421085, generates the best result [z = 5.782, p(rec) = 8.21 × 10-9]. Imputation in this region using dense 1000-Genome and Hapmap CEU samples revealed 71 SNPs with p < 10-6, all at the first intron of FTO locus. When hetero-geneity was permitted between cohorts, signals were also obtained in other previously identified loci, including MC4R (rs12964056, p = 6.87 × 10-7, z = -4.98), cholecystokinin CCK (rs8192472, p = 1.33 × 10-6, z = -4.85), Interleukin 15 (rs2099884, p = 1.27 × 10-5, z = 4.34), low density lipoprotein receptor-related protein 1B [LRP1B (rs7583748, p = 0.00013, z = -3.81)] and near transmembrane protein 18 (TMEM18) (rs7561317, p = 0.001, z = -3.17). We also detected a novel locus at chromosome 3 at COL6A5 [best SNP = rs1542829, minor allele frequency (MAF) of 5% p = 4.35 × 10-9, z = 5.89].
Conclusion: An EMR linked cohort study demonstrates that the BMI-Z measurements can be successfully extracted and linked to genomic data with meaningful confirmatory results. We verified the high prevalence of childhood rate of overweight and obesity in our cohort (28%). In addition, our data indicate that genetic variants in the first intron of FTO, a known adult genetic risk factor for BMI, are also robustly associated with BMI in pediatric population.
doi:10.3389/fgene.2013.00268
PMCID: PMC3847941  PMID: 24348519
BMI; obesity; polymorphism; GWAS
11.  Evaluating Phenotypic Data Elements for Genetics and Epidemiological Research: Experiences from the eMERGE and PhenX Network Projects 
Abstract*
Combining genome-wide association studies (GWAS) data with clinical information from the electronic medical record (EMR) provide unprecedented opportunities to identify genetic variants that influence susceptibility to common, complex diseases. While mining the vastness of EMR greatly expands the potential for conducting GWAS, non-standardized representation and wide variability of clinical data and phenotypes pose a major challenge to data integration and analysis. To address this requirement, we present experiences and methods developed to map phenotypic data elements from eMERGE (Electronic Medical Record and Genomics) to PhenX (Consensus Measures for Phenotypes and Exposures) and NCI’s Cancer Data Standards Registry and Repository (caDSR). Our results suggest that adopting multiple standards and biomedical terminologies will expose studies to a broader user community and enhance interoperability with a wider range of studies, in turn promoting cross-study pooling of data to detect both more subtle and more complex genotype-phenotype associations.
PMCID: PMC3248754  PMID: 22211178
12.  Electronic medical records and genomics (eMERGE) network exploration in cataract: Several new potential susceptibility loci 
Molecular Vision  2014;20:1281-1295.
Purpose
Cataract is the leading cause of blindness in the world, and in the United States accounts for approximately 60% of Medicare costs related to vision. The purpose of this study was to identify genetic markers for age-related cataract through a genome-wide association study (GWAS).
Methods
In the electronic medical records and genomics (eMERGE) network, we ran an electronic phenotyping algorithm on individuals in each of five sites with electronic medical records linked to DNA biobanks. We performed a GWAS using 530,101 SNPs from the Illumina 660W-Quad in a total of 7,397 individuals (5,503 cases and 1,894 controls). We also performed an age-at-diagnosis case-only analysis.
Results
We identified several statistically significant associations with age-related cataract (45 SNPs) as well as age at diagnosis (44 SNPs). The 45 SNPs associated with cataract at p<1×10−5 are in several interesting genes, including ALDOB, MAP3K1, and MEF2C. All have potential biologic relationships with cataracts.
Conclusions
This is the first genome-wide association study of age-related cataract, and several regions of interest have been identified. The eMERGE network has pioneered the exploration of genomic associations in biobanks linked to electronic health records, and this study is another example of the utility of such resources. Explorations of age-related cataract including validation and replication of the association results identified herein are needed in future studies.
PMCID: PMC4168835  PMID: 25352737
13.  A Genome- and Phenome-Wide Association Study to Identify Genetic Variants Influencing Platelet Count and Volume and their Pleiotropic Effects 
Human genetics  2013;133(1):10.1007/s00439-013-1355-7.
Platelets are enucleated cell fragments derived from megakaryocytes that play key roles in hemostasis and in the pathogenesis of atherothrombosis and cancer. Platelet traits are highly heritable and identification of genetic variants associated with platelet traits and assessing their pleiotropic effects may help to understand the role of underlying biological pathways. We conducted an electronic medical record (EMR)-based study to identify common variants that influence inter-individual variation in the number of circulating platelets (PLT) and mean platelet volume (MPV), by performing a genome-wide association study (GWAS). We characterized association of variants influencing MPV and PLT using functional, pathway and disease enrichment analysis assess pleiotropic effects of such variants by performing a phenome-wide association study (PheWAS) with a wide range of EMR-derived phenotypes. A total of 13,582 participants in the electronic MEdical Records and GEnomic (eMERGE) network had data for PLT and 6,291 participants had data for MPV. We identified 5 chromosomal regions associated with PLT and 8 associated with MPV at genome-wide significance (P<5E-8). In addition, we replicated 20 SNPs (out of 56 SNPs (α: 0.05/56=9E-4)) influencing PLT and 22 SNPs (out of 29 SNPs (α: 0.05/29=2E-3)) influencing MPV in a meta-analysis of GWAS of PLT and MPV. While our GWAS did not reveal any novel associations, our functional analyses revealed that genes in these regions influence thrombopoiesis and encode kinases, membrane proteins, proteins involved in cellular trafficking, transcription factors, proteasome complex subunits, proteins of signal transduction pathways, proteins involved in megakaryocyte development and platelet production and hemostasis. PheWAS using a single-SNP Bonferroni correction for 1368 diagnoses (0.05/1368=3.6E-5) revealed that several variants in these genes have pleiotropic associations with myocardial infarction, autoimmune and hematologic disorders. We conclude that multiple genetic loci influence interindividual variation in platelet traits and also have significant pleiotropic effects; the related genes are in multiple functional pathways including those relevant to thrombopoiesis.
doi:10.1007/s00439-013-1355-7
PMCID: PMC3880605  PMID: 24026423
14.  Cataract research using electronic health records 
BMC Ophthalmology  2011;11:32.
Background
The eMERGE (electronic MEdical Records and Genomics) network, funded by the National Human Genome Research Institute, is a national consortium formed to develop, disseminate, and apply approaches to research that combine DNA biorepositories with electronic health record (EHR) systems for large-scale, high-throughput genetic research. Marshfield Clinic is one of five sites in the eMERGE network and primarily studied: 1) age-related cataract and 2) HDL-cholesterol levels. The purpose of this paper is to describe the approach to electronic evaluation of the epidemiology of cataract using the EHR for a large biobank and to assess previously identified epidemiologic risk factors in cases identified by electronic algorithms.
Methods
Electronic algorithms were used to select individuals with cataracts in the Personalized Medicine Research Project database. These were analyzed for cataract prevalence, age at cataract, and previously identified risk factors.
Results
Cataract diagnoses and surgeries, though not type of cataract, were successfully identified using electronic algorithms. Age specific prevalence of both cataract (22% compared to 17.2%) and cataract surgery (11% compared to 5.1%) were higher when compared to the Eye Diseases Prevalence Research Group. The risk factors of age, gender, diabetes, and steroid use were confirmed.
Conclusions
Using electronic health records can be a viable and efficient tool to identify cataracts for research. However, using retrospective data from this source can be confounded by historical limits on data availability, differences in the utilization of healthcare, and changes in exposures over time.
doi:10.1186/1471-2415-11-32
PMCID: PMC3226674  PMID: 22078460
Cataract; prevalence; risk factors; epidemiology; electronic health record
15.  Mapping clinical phenotype data elements to standardized metadata repositories and controlled terminologies: the eMERGE Network experience 
Background
Systematic study of clinical phenotypes is important for a better understanding of the genetic basis of human diseases and more effective gene-based disease management. A key aspect in facilitating such studies requires standardized representation of the phenotype data using common data elements (CDEs) and controlled biomedical vocabularies. In this study, the authors analyzed how a limited subset of phenotypic data is amenable to common definition and standardized collection, as well as how their adoption in large-scale epidemiological and genome-wide studies can significantly facilitate cross-study analysis.
Methods
The authors mapped phenotype data dictionaries from five different eMERGE (Electronic Medical Records and Genomics) Network sites studying multiple diseases such as peripheral arterial disease and type 2 diabetes. For mapping, standardized terminological and metadata repository resources, such as the caDSR (Cancer Data Standards Registry and Repository) and SNOMED CT (Systematized Nomenclature of Medicine), were used. The mapping process comprised both lexical (via searching for relevant pre-coordinated concepts and data elements) and semantic (via post-coordination) techniques. Where feasible, new data elements were curated to enhance the coverage during mapping. A web-based application was also developed to uniformly represent and query the mapped data elements from different eMERGE studies.
Results
Approximately 60% of the target data elements (95 out of 157) could be mapped using simple lexical analysis techniques on pre-coordinated terms and concepts before any additional curation of terminology and metadata resources was initiated by eMERGE investigators. After curation of 54 new caDSR CDEs and nine new NCI thesaurus concepts and using post-coordination, the authors were able to map the remaining 40% of data elements to caDSR and SNOMED CT. A web-based tool was also implemented to assist in semi-automatic mapping of data elements.
Conclusion
This study emphasizes the requirement for standardized representation of clinical research data using existing metadata and terminology resources and provides simple techniques and software for data element mapping using experiences from the eMERGE Network.
doi:10.1136/amiajnl-2010-000061
PMCID: PMC3128396  PMID: 21597104
Ritu and pupu and 12; informatics; ontologies; knowledge representations; controlled terminologies and vocabularies; machine learning; terminologies; metadata; mapping; harmonization; eMERGE Network
16.  Imputation and quality control steps for combining multiple genome-wide datasets 
Frontiers in Genetics  2014;5:370.
The electronic MEdical Records and GEnomics (eMERGE) network brings together DNA biobanks linked to electronic health records (EHRs) from multiple institutions. Approximately 51,000 DNA samples from distinct individuals have been genotyped using genome-wide SNP arrays across the nine sites of the network. The eMERGE Coordinating Center and the Genomics Workgroup developed a pipeline to impute and merge genomic data across the different SNP arrays to maximize sample size and power to detect associations with a variety of clinical endpoints. The 1000 Genomes cosmopolitan reference panel was used for imputation. Imputation results were evaluated using the following metrics: accuracy of imputation, allelic R2 (estimated correlation between the imputed and true genotypes), and the relationship between allelic R2 and minor allele frequency. Computation time and memory resources required by two different software packages (BEAGLE and IMPUTE2) were also evaluated. A number of challenges were encountered due to the complexity of using two different imputation software packages, multiple ancestral populations, and many different genotyping platforms. We present lessons learned and describe the pipeline implemented here to impute and merge genomic data sets. The eMERGE imputed dataset will serve as a valuable resource for discovery, leveraging the clinical data that can be mined from the EHR.
doi:10.3389/fgene.2014.00370
PMCID: PMC4263197  PMID: 25566314
imputation; genome-wide association; eMERGE; electronic health records
17.  High Density GWAS for LDL Cholesterol in African Americans Using Electronic Medical Records Reveals a Strong Protective Variant in APOE 
Only one LDL-C GWAS has been reported in African Americans. We performed a GWAS of LDL-C in African Americans using data extracted from electronic medical records (EMR) in the eMERGE network. African Americans were genotyped on the Illumina 1M chip. All LDL-C measurements, prescriptions, and diagnoses of concomitant disease were extracted from EMR. We created two analytic datasets; one dataset having median LDL-C calculated after the exclusion of some lab values based on co-morbidities and medication (n = 618) and another dataset having median LDL-C calculated without any exclusions (n = 1249). Rs7412 in APOE was strongly associated with LDL-C at levels of GWAS significance in both datasets (p < 5 X 10−8). In the dataset with exclusions, a decrease of 20.0 mg/dl per minor allele was observed. The effect size was attenuated (12.3 mg/dl) in the dataset without any lab values excluded. Although other signals in APOE have been detected in previous GWAS, this large and important SNP association has not been well detected in large GWAS because rs7412 was not included on many genotyping arrays. Use of median LDL-C extracted from EMR after exclusions for medications and co-morbidities increased the percentage of trait variance explained by genetic variation.
doi:10.1111/j.1752-8062.2012.00446.x
PMCID: PMC3521536  PMID: 23067351
GWAS; LDL; electronic medical records
18.  Return of Individual Research Results from Genome-wide Association Studies: Experience of the Electronic Medical Records & Genomics (eMERGE) Network 
Purpose
Return of individual genetic results to research participants, including participants in archives and biorepositories, is receiving increased attention. However, few groups have deliberated on specific results or weighed deliberations against relevant local contextual factors.
Methods
The Electronic Medical Records and GEnomics (eMERGE) network, which includes five biorepositories conducting genome-wide association studies, convened a Return of Results Oversight Committee (RROC) to identify potentially returnable results. Network-wide deliberations were then brought to local constituencies for final decision-making.
Results
Defining results that should be considered for return required input from clinicians with relevant expertise and much deliberation. The RROC identified two sex chromosomal anomalies, Klinefelter Syndrome and Turner Syndrome, as well as homozygosity for Factor V Leiden, as findings that could warrant reporting. Views about returning HFE gene mutations associated with hemochromatosis were mixed due to low penetrance. Review of EMRs suggested that most participants with detected abnormalities were unaware of these findings. Local considerations relevant to return varied and, to date, four sites have elected not to return findings (return was not possible at one site).
Conclusion
The eMERGE experience reveals the complexity of return of results decision-making and provides a potential deliberative model for adoption in other collaborative contexts.
doi:10.1038/gim.2012.15
PMCID: PMC3723451  PMID: 22361898
Result return; biorepository; electronic medical records; deliberation; context
19.  Return of results in the genomic medicine projects of the eMERGE network 
The electronic Medical Records and Genomics (eMERGE) (Phase I) network was established in 2007 to further genomic discovery using biorepositories linked to the electronic health record (EHR). In Phase II, which began in 2011, genomic discovery efforts continue and in addition the network is investigating best practices for implementing genomic medicine, in particular, the return of genomic results in the EHR for use by physicians at point-of-care. To develop strategies for addressing the challenges of implementing genomic medicine in the clinical setting, the eMERGE network is conducting studies that return clinically-relevant genomic results to research participants and their health care providers. These genomic medicine pilot studies include returning individual genetic variants associated with disease susceptibility or drug response, as well as genetic risk scores for common “complex” disorders. Additionally, as part of a network-wide pharmacogenomics-related project, targeted resequencing of 84 pharmacogenes is being performed and select genotypes of pharmacogenetic relevance are being placed in the EHR to guide individualized drug therapy. Individual sites within the eMERGE network are exploring mechanisms to address incidental findings generated by resequencing of the 84 pharmacogenes. In this paper, we describe studies being conducted within the eMERGE network to develop best practices for integrating genomic findings into the EHR, and the challenges associated with such work.
doi:10.3389/fgene.2014.00050
PMCID: PMC3972474  PMID: 24723935
genomics; electronic health records; incidental findings; implementation; genetic counseling; next generation sequencing; pharmacogenetics
20.  The Electronic Medical Records and Genomics (eMERGE) Network: past, present, and future 
Genetics in Medicine  2013;15(10):761-771.
The Electronic Medical Records and Genomics Network is a National Human Genome Research Institute–funded consortium engaged in the development of methods and best practices for using the electronic medical record as a tool for genomic research. Now in its sixth year and second funding cycle, and comprising nine research groups and a coordinating center, the network has played a major role in validating the concept that clinical data derived from electronic medical records can be used successfully for genomic research. Current work is advancing knowledge in multiple disciplines at the intersection of genomics and health-care informatics, particularly for electronic phenotyping, genome-wide association studies, genomic medicine implementation, and the ethical and regulatory issues associated with genomics research and returning results to study participants. Here, we describe the evolution, accomplishments, opportunities, and challenges of the network from its inception as a five-group consortium focused on genotype–phenotype associations for genomic discovery to its current form as a nine-group consortium pivoting toward the implementation of genomic medicine.
Genet Med 15 10, 761–771.
doi:10.1038/gim.2013.72
PMCID: PMC3795928  PMID: 23743551
collaborative research; electronic medical records; genetics and genomics; genome-wide association studies; personalized medicine
21.  The Autism Simplex Collection: an international, expertly phenotyped autism sample for genetic and phenotypic analyses 
Molecular Autism  2014;5:34.
Background
There is an urgent need for expanding and enhancing autism spectrum disorder (ASD) samples, in order to better understand causes of ASD.
Methods
In a unique public-private partnership, 13 sites with extensive experience in both the assessment and diagnosis of ASD embarked on an ambitious, 2-year program to collect samples for genetic and phenotypic research and begin analyses on these samples. The program was called The Autism Simplex Collection (TASC). TASC sample collection began in 2008 and was completed in 2010, and included nine sites from North America and four sites from Western Europe, as well as a centralized Data Coordinating Center.
Results
Over 1,700 trios are part of this collection, with DNA from transformed cells now available through the National Institute of Mental Health (NIMH). Autism Diagnostic Interview-Revised (ADI-R) and Autism Diagnostic Observation Schedule-Generic (ADOS-G) measures are available for all probands, as are standardized IQ measures, Vineland Adaptive Behavioral Scales (VABS), the Social Responsiveness Scale (SRS), Peabody Picture Vocabulary Test (PPVT), and physical measures (height, weight, and head circumference). At almost every site, additional phenotypic measures were collected, including the Broad Autism Phenotype Questionnaire (BAPQ) and Repetitive Behavior Scale-Revised (RBS-R), as well as the non-word repetition scale, Communication Checklist (Children’s or Adult), and Aberrant Behavior Checklist (ABC). Moreover, for nearly 1,000 trios, the Autism Genome Project Consortium (AGP) has carried out Illumina 1 M SNP genotyping and called copy number variation (CNV) in the samples, with data being made available through the National Institutes of Health (NIH). Whole exome sequencing (WES) has been carried out in over 500 probands, together with ancestry matched controls, and this data is also available through the NIH. Additional WES is being carried out by the Autism Sequencing Consortium (ASC), where the focus is on sequencing complete trios. ASC sequencing for the first 1,000 samples (all from whole-blood DNA) is complete and data will be released in 2014. Data is being made available through NIH databases (database of Genotypes and Phenotypes (dbGaP) and National Database for Autism Research (NDAR)) with DNA released in Dist 11.0. Primary funding for the collection, genotyping, sequencing and distribution of TASC samples was provided by Autism Speaks and the NIH, including the National Institute of Mental Health (NIMH) and the National Human Genetics Research Institute (NHGRI).
Conclusions
TASC represents an important sample set that leverages expert sites. Similar approaches, leveraging expert sites and ongoing studies, represent an important path towards further enhancing available ASD samples.
doi:10.1186/2040-2392-5-34
PMCID: PMC4228819  PMID: 25392729
22.  Use of diverse electronic medical record systems to identify genetic risk for type 2 diabetes within a genome-wide association study 
Objective
Genome-wide association studies (GWAS) require high specificity and large numbers of subjects to identify genotype–phenotype correlations accurately. The aim of this study was to identify type 2 diabetes (T2D) cases and controls for a GWAS, using data captured through routine clinical care across five institutions using different electronic medical record (EMR) systems.
Materials and Methods
An algorithm was developed to identify T2D cases and controls based on a combination of diagnoses, medications, and laboratory results. The performance of the algorithm was validated at three of the five participating institutions compared against clinician review. A GWAS was subsequently performed using cases and controls identified by the algorithm, with samples pooled across all five institutions.
Results
The algorithm achieved 98% and 100% positive predictive values for the identification of diabetic cases and controls, respectively, as compared against clinician review. By standardizing and applying the algorithm across institutions, 3353 cases and 3352 controls were identified. Subsequent GWAS using data from five institutions replicated the TCF7L2 gene variant (rs7903146) previously associated with T2D.
Discussion
By applying stringent criteria to EMR data collected through routine clinical care, cases and controls for a GWAS were identified that subsequently replicated a known genetic variant. The use of standard terminologies to define data elements enabled pooling of subjects and data across five different institutions to achieve the robust numbers required for GWAS.
Conclusions
An algorithm using commonly available data from five different EMR can accurately identify T2D cases and controls for genetic study across multiple institutions.
doi:10.1136/amiajnl-2011-000439
PMCID: PMC3277617  PMID: 22101970
Analytics; application of biological knowledge to clinical care; bioinformatics; biomedical informatics; clinical phenotyping; controlled terminologies and vocabularies; data mining; EHR; EMR secondary and meaningful use; genetic epidemiology; genetics; genome-wide association studies; genomics; HIT data standards; improving the education and skills training of health professionals; infection control; information retrieval; knowledge representations; linking the genotype and phenotype; medical informatics; modeling; natural-language processing; ontologies; pharmacogenomics; phenotyping; reuseability; translational research
23.  Genetic Variants That Confer Resistance to Malaria Are Associated with Red Blood Cell Traits in African-Americans: An Electronic Medical Record-based Genome-Wide Association Study 
G3: Genes|Genomes|Genetics  2013;3(7):1061-1068.
To identify novel genetic loci influencing interindividual variation in red blood cell (RBC) traits in African-Americans, we conducted a genome-wide association study (GWAS) in 2315 individuals, divided into discovery (n = 1904) and replication (n = 411) cohorts. The traits included hemoglobin concentration (HGB), hematocrit (HCT), RBC count, mean corpuscular volume (MCV), mean corpuscular hemoglobin (MCH), and mean corpuscular hemoglobin concentration (MCHC). Patients were participants in the electronic MEdical Records and GEnomics (eMERGE) network and underwent genotyping of ~1.2 million single-nucleotide polymorphisms on the Illumina Human1M-Duo array. Association analyses were performed adjusting for age, sex, site, and population stratification. Three loci previously associated with resistance to malaria—HBB (11p15.4), HBA1/HBA2 (16p13.3), and G6PD (Xq28)—were associated (P ≤ 1 × 10−6) with RBC traits in the discovery cohort. The loci replicated in the replication cohort (P ≤ 0.02), and were significant at a genome-wide significance level (P < 5 × 10−8) in the combined cohort. The proportions of variance in RBC traits explained by significant variants at these loci were as follows: rs7120391 (near HBB) 1.3% of MCHC, rs9924561 (near HBA1/A2) 5.5% of MCV, 6.9% of MCH and 2.9% of MCHC, and rs1050828 (in G6PD) 2.4% of RBC count, 2.9% of MCV, and 1.4% of MCH, respectively. We were not able to replicate loci identified by a previous GWAS of RBC traits in a European ancestry cohort of similar sample size, suggesting that the genetic architecture of RBC traits differs by race. In conclusion, genetic variants that confer resistance to malaria are associated with RBC traits in African-Americans.
doi:10.1534/g3.113.006452
PMCID: PMC3704235  PMID: 23696099
red blood cell (RBC) traits; genome-wide association study; African-Americans; natural selection; informatics; electronic medical record
24.  Replication of genetic loci for ages at menarche and menopause in the multi-ethnic Population Architecture using Genomics and Epidemiology (PAGE) study 
Human Reproduction (Oxford, England)  2013;28(6):1695-1706.
STUDY QUESTION
Do genetic associations identified in genome-wide association studies (GWAS) of age at menarche (AM) and age at natural menopause (ANM) replicate in women of diverse race/ancestry from the Population Architecture using Genomics and Epidemiology (PAGE) Study?
SUMMARY ANSWER
We replicated GWAS reproductive trait single nucleotide polymorphisms (SNPs) in our European descent population and found that many SNPs were also associated with AM and ANM in populations of diverse ancestry.
WHAT IS KNOWN ALREADY
Menarche and menopause mark the reproductive lifespan in women and are important risk factors for chronic diseases including obesity, cardiovascular disease and cancer. Both events are believed to be influenced by environmental and genetic factors, and vary in populations differing by genetic ancestry and geography. Most genetic variants associated with these traits have been identified in GWAS of European-descent populations.
STUDY DESIGN, SIZE, DURATION
A total of 42 251 women of diverse ancestry from PAGE were included in cross-sectional analyses of AM and ANM.
MATERIALS, SETTING, METHODS
SNPs previously associated with ANM (n = 5 SNPs) and AM (n = 3 SNPs) in GWAS were genotyped in American Indians, African Americans, Asians, European Americans, Hispanics and Native Hawaiians. To test SNP associations with ANM or AM, we used linear regression models stratified by race/ethnicity and PAGE sub-study. Results were then combined in race-specific fixed effect meta-analyses for each outcome. For replication and generalization analyses, significance was defined at P < 0.01 for ANM analyses and P < 0.017 for AM analyses.
MAIN RESULTS AND THE ROLE OF CHANCE
We replicated findings for AM SNPs in the LIN28B locus and an intergenic region on 9q31 in European Americans. The LIN28B SNPs (rs314277 and rs314280) were also significantly associated with AM in Asians, but not in other race/ethnicity groups. Linkage disequilibrium (LD) patterns at this locus varied widely among the ancestral groups. With the exception of an intergenic SNP at 13q34, all ANM SNPs replicated in European Americans. Three were significantly associated with ANM in other race/ethnicity populations: rs2153157 (6p24.2/SYCP2L), rs365132 (5q35/UIMC1) and rs16991615 (20p12.3/MCM8). While rs1172822 (19q13/BRSK1) was not significant in the populations of non-European descent, effect sizes showed similar trends.
LIMITATIONS, REASONS FOR CAUTION
Lack of association for the GWAS SNPs in the non-European American groups may be due to differences in locus LD patterns between these groups and the European-descent populations included in the GWAS discovery studies; and in some cases, lower power may also contribute to non-significant findings.
WIDER IMPLICATIONS OF THE FINDINGS
The discovery of genetic variants associated with the reproductive traits provides an important opportunity to elucidate the biological mechanisms involved with normal variation and disorders of menarche and menopause. In this study we replicated most, but not all reported SNPs in European descent populations and examined the epidemiologic architecture of these early reported variants, describing their generalizability and effect size across differing ancestral populations. Such data will be increasingly important for prioritizing GWAS SNPs for follow-up in fine-mapping and resequencing studies, as well as in translational research.
STUDY FUNDING/COMPETING INTEREST(S)
The Population Architecture Using Genomics and Epidemiology (PAGE) program is funded by the National Human Genome Research Institute (NHGRI), supported by U01HG004803 (CALiCo), U01HG004798 (EAGLE), U01HG004802 (MEC), U01HG004790 (WHI) and U01HG004801 (Coordinating Center), and their respective NHGRI ARRA supplements. The authors report no conflicts of interest.
doi:10.1093/humrep/det071
PMCID: PMC3657124  PMID: 23508249
menopause; menarche; genome-wide association study; race/ethnicity; single nucleotide polymorphism
25.  Stakeholder engagement: a key component of integrating genomic information into electronic health records 
Integrating genomic information into clinical care and the electronic health record can facilitate personalized medicine through genetically guided clinical decision support. Stakeholder involvement is critical to the success of these implementation efforts. Prior work on implementation of clinical information systems provides broad guidance to inform effective engagement strategies. We add to this evidence-based recommendations that are specific to issues at the intersection of genomics and the electronic health record. We describe stakeholder engagement strategies employed by the Electronic Medical Records and Genomics Network, a national consortium of US research institutions funded by the National Human Genome Research Institute to develop, disseminate, and apply approaches that combine genomic and electronic health record data. Through select examples drawn from sites of the Electronic Medical Records and Genomics Network, we illustrate a continuum of engagement strategies to inform genomic integration into commercial and homegrown electronic health records across a range of health-care settings. We frame engagement as activities to consult, involve, and partner with key stakeholder groups throughout specific phases of health information technology implementation. Our aim is to provide insights into engagement strategies to guide genomic integration based on our unique network experiences and lessons learned within the broader context of implementation research in biomedical informatics. On the basis of our collective experience, we describe key stakeholder practices, challenges, and considerations for successful genomic integration to support personalized medicine.
doi:10.1038/gim.2013.127
PMCID: PMC3909653  PMID: 24030437
electronic health records; genomics; health information technology; personalized medicine; stakeholder engagement; translational medical research

Results 1-25 (939712)