Electrocardiographic QRS duration, a measure of cardiac intraventricular conduction, varies ~2-fold in individuals without cardiac disease. Slow conduction may promote reentrant arrhythmias.
Methods and Results
We performed a genome-wide association study (GWAS) to identify genomic markers of QRS duration in 5,272 individuals without cardiac disease selected from electronic medical record (EMR) algorithms at five sites in the Electronic Medical Records and Genomics (eMERGE) network. The most significant loci were evaluated within the CHARGE consortium QRS GWAS meta-analysis. Twenty-three single nucleotide polymorphisms in 5 loci, previously described by CHARGE, were replicated in the eMERGE samples; 18 SNPs were in the chromosome 3 SCN5A and SCN10A loci, where the most significant SNPs were rs1805126 in SCN5A with p=1.2×10−8 (eMERGE) and p=2.5×10−20 (CHARGE) and rs6795970 in SCN10A with p=6×10−6 (eMERGE) and p=5×10−27 (CHARGE). The other loci were in NFIA, near CDKN1A, and near C6orf204. We then performed phenome-wide association studies (PheWAS) on variants in these five loci in 13,859 European Americans to search for diagnoses associated with these markers. PheWAS identified atrial fibrillation and cardiac arrhythmias as the most common associated diagnoses with SCN10A and SCN5A variants. SCN10A variants were also associated with subsequent development of atrial fibrillation and arrhythmia in the original 5,272 “heart-healthy” study population.
We conclude that DNA biobanks coupled to EMRs provide a platform not only for GWAS but may also allow broad interrogation of the longitudinal incidence of disease associated with genetic variants. The PheWAS approach implicated sodium channel variants modulating QRS duration in subjects without cardiac disease as predictors of subsequent arrhythmias.
cardiac conduction; QRS duration; atrial fibrillation; genome-wide association study; phenome-wide association study; electronic medical records
Candidate gene and genome-wide association studies (GWAS) have identified genetic variants that modulate risk for human disease; many of these associations require further study to replicate the results. Here we report the first large-scale application of the phenome-wide association study (PheWAS) paradigm within electronic medical records (EMRs), an unbiased approach to replication and discovery that interrogates relationships between targeted genotypes and multiple phenotypes. We scanned for associations between 3,144 single-nucleotide polymorphisms (previously implicated by GWAS as mediators of human traits) and 1,358 EMR-derived phenotypes in 13,835 individuals of European ancestry. This PheWAS replicated 66% (51/77) of sufficiently powered prior GWAS associations and revealed 63 potentially pleiotropic associations with P < 4.6 × 10−6 (false discovery rate < 0.1); the strongest of these novel associations were replicated in an independent cohort (n = 7,406). These findings validate PheWAS as a tool to allow unbiased interrogation across multiple phenotypes in EMR-based cohorts and to enhance analysis of the genomic basis of human disease.
Type 2 diabetes (T2D) is a complex metabolic disease that disproportionately affects African Americans. Genome-wide association studies (GWAS) have identified several loci that contribute to T2D in European Americans, but few studies have been performed in admixed populations. We first performed a GWAS of 1,563 African Americans from the Vanderbilt Genome-Electronic Records Project and Northwestern University NUgene Project as part of the electronic Medical Records and Genomics (eMERGE) network. We successfully replicate an association in TCF7L2, previously identified by GWAS in this African American dataset. We were unable to identify novel associations at p<5.0×10−8 by GWAS. Using admixture mapping as an alternative method for discovery, we performed a genome-wide admixture scan that suggests multiple candidate genes associated with T2D. One finding, TCIRG1, is a T-cell immune regulator expressed in the pancreas and liver that has not been previously implicated for T2D. We performed subsequent fine-mapping to further assess the association between TCIRG1 and T2D in >5,000 African Americans. We identified 13 independent associations between TCIRG1, CHKA, and ALDH3B1 genes on chromosome 11 and T2D. Our results suggest a novel region on chromosome 11 identified by admixture mapping is associated with T2D in African Americans.
Little is known about pre-mRNA splicing in Dictyostelium discoideum although its genome has been completely sequenced. Our analysis suggests that pre-mRNA splicing plays an important role in D. discoideum gene expression as two thirds of its genes contain at least one intron. Ongoing curation of the genome to date has revealed 40 genes in D. discoideum with clear evidence of alternative splicing, supporting the existence of alternative splicing in this unicellular organism. We identified 160 candidate U2-type spliceosomal proteins and related factors in D. discoideum based on 264 known human genes involved in splicing. Spliceosomal small ribonucleoproteins (snRNPs), PRP19 complex proteins and late-acting proteins are highly conserved in D. discoideum and throughout the metazoa. In non-snRNP and hnRNP families, D. discoideum orthologs are closer to those in A. thaliana, D. melanogaster and H. sapiens than to their counterparts in S. cerevisiae. Several splicing regulators, including SR proteins and CUG-binding proteins, were found in D. discoideum, but not in yeast. Our comprehensive catalog of spliceosomal proteins provides useful information for future studies of splicing in D. discoideum where the efficient genetic and biochemical manipulation will also further our general understanding of pre-mRNA splicing.
pre-mRNA splicing; spliceosomal genes; Dictyostelium discoideum; comparative genomics; splicing regulators
The Electronic Medical Records and Genomics (eMERGE) Network is a National Human Genome Research Institute (NHGRI)-funded consortium engaged in the development of methods and best-practices for utilizing the Electronic Medical Record (EMR) as a tool for genomic research. Now in its sixth year, its second funding cycle and comprising nine research groups and a coordinating center, the network has played a major role in validating the concept that clinical data derived from EMRs can be used successfully for genomic research. Current work is advancing knowledge in multiple disciplines at the intersection of genomics and healthcare informatics, particularly electronic phenotyping, genome-wide association studies, genomic medicine implementation and the ethical and regulatory issues associated with genomics research and returning results to study participants. Here we describe the evolution, accomplishments, opportunities and challenges of the network since its inception as a five-group consortium focused on genotype-phenotype associations for genomic discovery to its current form as a nine-group consortium pivoting towards implementation of genomic medicine.
electronic medical records; personalized medicine; genome-wide association studies; genetics and genomics; collaborative research
Only one LDL-C GWAS has been reported in African Americans. We performed a GWAS of LDL-C in African Americans using data extracted from electronic medical records (EMR) in the eMERGE network. African Americans were genotyped on the Illumina 1M chip. All LDL-C measurements, prescriptions, and diagnoses of concomitant disease were extracted from EMR. We created two analytic datasets; one dataset having median LDL-C calculated after the exclusion of some lab values based on co-morbidities and medication (n = 618) and another dataset having median LDL-C calculated without any exclusions (n = 1249). Rs7412 in APOE was strongly associated with LDL-C at levels of GWAS significance in both datasets (p < 5 X 10−8). In the dataset with exclusions, a decrease of 20.0 mg/dl per minor allele was observed. The effect size was attenuated (12.3 mg/dl) in the dataset without any lab values excluded. Although other signals in APOE have been detected in previous GWAS, this large and important SNP association has not been well detected in large GWAS because rs7412 was not included on many genotyping arrays. Use of median LDL-C extracted from EMR after exclusions for medications and co-morbidities increased the percentage of trait variance explained by genetic variation.
GWAS; LDL; electronic medical records
Nonmuscle myosin II plays a crucial role in a variety of cellular processes (e.g., polarity formation, cell motility, and cytokinesis). It is composed of two heavy chains, two regulatory light chains and two essential light chains. The ATPase activity of the myosin II motor domain is regulated through phosphorylation of the regulatory light chain (RLC) by myosin light chain kinase. To study myosin function and localization in cellular processes, GFP-fused RLCs are widely used; however, the exact kinetic properties of myosins with bound GFP-RLC are poorly described. More importantly, it has not been shown that a regulatory light chain fused at its N-terminus with GFP can maintain the normal phosphorylation-dependent regulation of nonmuscle myosin or serve as a substrate for myosin light chain kinase. We coexpressed N-terminal GFP-RLC with a heavy meromyosin (HMM)-like fragment of nonmuscle myosin IIA and essential light chain to characterize the phosphorylation dynamics and in vitro kinetic properties of the resulting HMM. Myosin light chain kinase phosphorylates the GFP-RLC bound to HMM IIA with the same Vmax as it does the wild type RLC bound to HMM IIA, but the Km is about two fold higher for the GFP fusion protein, meaning that it is a somewhat poorer substrate. The steady-state actin-activated MgATPase activity of the GFP-RLC HMM is very low in the absence of phosphorylation demonstrating that the GFP moiety does not prevent formation of the off state. The actin-activated MgATPase activity of phosphorylated GFP-RLC-HMM and is about half that of wild type phosphorylated HMM. The ability of phosphorylated GFP-RLC-HMM to move actin filaments in the actin gliding assay is also slightly compromised. These data indicate that despite some kinetic differences the N-terminal GFP fusion to the regulatory light chain is a reasonable model system for studying myosin function in vivo.
GFP; Nonmuscle myosin; Regulatory light chain; Enzymatic activity; In vitro motility
dictyBase (http:// dictybase.org), the model organism database for Dictyostelium discoideum, includes the complete genome sequence and expression data for this organism. Relevant literature is integrated into the database, and gene models and functional annotation are manually curated from experimental results and comparative multigenome analyses. dictyBase has recently expanded to include the genome sequences of three additional Dictyostelids, and has added new software tools to facilitate multigenome comparisons. The Dicty Stock Center, a strain and plasmid repository for Dictyostelium research has relocated to Northwestern University in 2009. This allowed us integrating all Dictyostelium resources to better serve the research community. In this chapter, we will describe how to navigate the website and highlight some of our newer improvements.
Dictyostelium discoideum; database; genomic sequence; multigenome; genome browser; Blast; gene page; functional annotation; strains; phenotypes
Clinical data in Electronic Medical Records (EMRs) is a potential source of longitudinal clinical data for research. The Electronic Medical Records and Genomics Network or eMERGE investigates whether data captured through routine clinical care using EMRs can identify disease phenotypes with sufficient positive and negative predictive values for use in genome wide association studies (GWAS). Using data from five different sets of EMRs, we have identified five disease phenotypes with positive predictive values of 73–98% and negative predictive values of 98–100%. A majority of EMRs captured key information (diagnoses, medications, laboratory tests) used to define phenotypes in a structured format. We identified natural language processing as an important tool to improve case identification rates. Efforts and incentives to increase the implementation of interoperable EMRs will markedly improve the availability of clinical data for genomics research.
Little is known about cell–substrate adhesion and how motile and adhesive forces work together in moving cells. The ability to rapidly screen a large number of insertional mutants prompted us to perform a genetic screen in Dictyostelium to isolate adhesion-deficient mutants. The resulting substrate adhesion–deficient (sad) mutants grew in plastic dishes without attaching to the substrate. The cells were often larger than their wild-type parents and displayed a rough surface with many apparent blebs. One of these mutants, sadA−, completely lacked substrate adhesion in growth medium. The sadA− mutant also showed slightly impaired cytokinesis, an aberrant F-actin organization, and a phagocytosis defect. Deletion of the sadA gene by homologous recombination recreated the original mutant phenotype. Expression of sadA–GFP in sadA-null cells restored the wild-type phenotype. In sadA–GFP-rescued mutant cells, sadA–GFP localized to the cell surface, appropriate for an adhesion molecule. SadA contains nine putative transmembrane domains and three conserved EGF-like repeats in a predicted extracellular domain. The EGF repeats are similar to corresponding regions in proteins known to be involved in adhesion, such as tenascins and integrins. Our data combined suggest that sadA is the first substrate adhesion receptor to be identified in Dictyostelium.
Dictyostelium; cell–substrate adhesion; EGF-like repeats; phagocytosis; cytokinesis
Approaches with high spatial and temporal resolution are required to understand the regulation of nonmuscle myosin II in vivo. Using fluorescence resonance energy transfer we have produced a novel biosensor allowing simultaneous determination of myosin light chain kinase (MLCK) localization and its [Ca2+]4/calmodulin-binding state in living cells. We observe transient recruitment of diffuse MLCK to stress fibers and its in situ activation before contraction. MLCK is highly active in the lamella of migrating cells, but not at the retracting tail. This unexpected result highlights a potential role for MLCK-mediated myosin contractility in the lamella as a driving force for migration. During cytokinesis, MLCK was enriched at the spindle equator during late metaphase, and was maximally activated just before cleavage furrow constriction. As furrow contraction was completed, active MLCK was redistributed to the poles of the daughter cells. These results show MLCK is a myosin regulator in the lamella and contractile ring, and pinpoints sites where myosin function may be mediated by other kinases.
myosin light chain kinase; myosin light chains; phosphorylation; cell division; FRET
Genome-wide association studies (GWAS) require high specificity and large numbers of subjects to identify genotype–phenotype correlations accurately. The aim of this study was to identify type 2 diabetes (T2D) cases and controls for a GWAS, using data captured through routine clinical care across five institutions using different electronic medical record (EMR) systems.
Materials and Methods
An algorithm was developed to identify T2D cases and controls based on a combination of diagnoses, medications, and laboratory results. The performance of the algorithm was validated at three of the five participating institutions compared against clinician review. A GWAS was subsequently performed using cases and controls identified by the algorithm, with samples pooled across all five institutions.
The algorithm achieved 98% and 100% positive predictive values for the identification of diabetic cases and controls, respectively, as compared against clinician review. By standardizing and applying the algorithm across institutions, 3353 cases and 3352 controls were identified. Subsequent GWAS using data from five institutions replicated the TCF7L2 gene variant (rs7903146) previously associated with T2D.
By applying stringent criteria to EMR data collected through routine clinical care, cases and controls for a GWAS were identified that subsequently replicated a known genetic variant. The use of standard terminologies to define data elements enabled pooling of subjects and data across five different institutions to achieve the robust numbers required for GWAS.
An algorithm using commonly available data from five different EMR can accurately identify T2D cases and controls for genetic study across multiple institutions.
Analytics; application of biological knowledge to clinical care; bioinformatics; biomedical informatics; clinical phenotyping; controlled terminologies and vocabularies; data mining; EHR; EMR secondary and meaningful use; genetic epidemiology; genetics; genome-wide association studies; genomics; HIT data standards; improving the education and skills training of health professionals; infection control; information retrieval; knowledge representations; linking the genotype and phenotype; medical informatics; modeling; natural-language processing; ontologies; pharmacogenomics; phenotyping; reuseability; translational research
dictyBase (http://dictybase.org) is the model organism database for the social amoeba Dictyostelium discoideum. This contribution provides an update on dictyBase that has been previously presented. During the past 3 years, dictyBase has taken significant strides toward becoming a genome portal for the whole Amoebozoa clade. In its latest release, dictyBase has scaled up to host multiple Dictyostelids, including Dictyostelium purpureum [Sucgang, Kuo, Tian, Salerno, Parikh, Feasley, Dalin, Tu, Huang, Barry et al.(2011) (Comparative genomics of the social amoebae Dictyostelium discoideum and Dictyostelium purpureum. Genome Biol., 12, R20)], Dictyostelium fasciculatum and Polysphondylium pallidum [Heidel, Lawal, Felder, Schilde, Helps, Tunggal, Rivero, John, Schleicher, Eichinger et al. (2011) (Phylogeny-wide analysis of social amoeba genomes highlights ancient origins for complex intercellular communication. Genome Res., 21, 1882–1891)]. The new release includes a new Genome Browser with RNAseq expression, interspecies Basic Local Alignment Search Tool alignments and a unified Basic Local Alignment Search Tool search for cross-species comparisons.
Previous work from our laboratory showed that the Dictyostelium discoideum SadA protein plays a central role in cell-substrate adhesion. SadA null cells exhibit a loss of adhesion, a disrupted actin cytoskeleton, and a cytokinesis defect. How SadA mediates these phenotypes is unknown. This work addresses the mechanism of SadA function, demonstrating an important role for the C-terminal cytoplasmic tail in SadA function. We found that a SadA tailless mutant was unable to rescue the sadA adhesion deficiency, and overexpression of the SadA tail domain reduced adhesion in wild-type cells. We also show that SadA is closely associated with the actin cytoskeleton. Mutagenesis studies suggested that four serine residues in the tail, S924/S925 and S940/S941, may regulate association of SadA with the actin cytoskeleton. Glutathione S-transferase pull-down assays identified at least one likely interaction partner of the SadA tail, cortexillin I, a known actin bundling protein. Thus, our data demonstrate an important role for the carboxy-terminal cytoplasmic tail in SadA function and strongly suggest that a phosphorylation event in this tail regulates an interaction with cortexillin I. Based on our data, we propose a model for the function of SadA.
The social amoebae (Dictyostelia) are a diverse group of Amoebozoa that achieve multicellularity by aggregation and undergo morphogenesis into fruiting bodies with terminally differentiated spores and stalk cells. There are four groups of dictyostelids, with the most derived being a group that contains the model species Dictyostelium discoideum.
We have produced a draft genome sequence of another group dictyostelid, Dictyostelium purpureum, and compare it to the D. discoideum genome. The assembly (8.41 × coverage) comprises 799 scaffolds totaling 33.0 Mb, comparable to the D. discoideum genome size. Sequence comparisons suggest that these two dictyostelids shared a common ancestor approximately 400 million years ago. In spite of this divergence, most orthologs reside in small clusters of conserved synteny. Comparative analyses revealed a core set of orthologous genes that illuminate dictyostelid physiology, as well as differences in gene family content. Interesting patterns of gene conservation and divergence are also evident, suggesting function differences; some protein families, such as the histidine kinases, have undergone little functional change, whereas others, such as the polyketide synthases, have undergone extensive diversification. The abundant amino acid homopolymers encoded in both genomes are generally not found in homologous positions within proteins, so they are unlikely to derive from ancestral DNA triplet repeats. Genes involved in the social stage evolved more rapidly than others, consistent with either relaxed selection or accelerated evolution due to social conflict.
The findings from this new genome sequence and comparative analysis shed light on the biology and evolution of the Dictyostelia.
The eMERGE (electronic MEdical Records and GEnomics) Network is an NHGRI-supported consortium of five institutions to explore the utility of DNA repositories coupled to Electronic Medical Record (EMR) systems for advancing discovery in genome science. eMERGE also includes a special emphasis on the ethical, legal and social issues related to these endeavors.
The five sites are supported by an Administrative Coordinating Center. Setting of network goals is initiated by working groups: (1) Genomics, (2) Informatics, and (3) Consent & Community Consultation, which also includes active participation by investigators outside the eMERGE funded sites, and (4) Return of Results Oversight Committee. The Steering Committee, comprised of site PIs and representatives and NHGRI staff, meet three times per year, once per year with the External Scientific Panel.
The primary site-specific phenotypes for which samples have undergone genome-wide association study (GWAS) genotyping are cataract and HDL, dementia, electrocardiographic QRS duration, peripheral arterial disease, and type 2 diabetes. A GWAS is also being undertaken for resistant hypertension in ≈2,000 additional samples identified across the network sites, to be added to data available for samples already genotyped. Funded by ARRA supplements, secondary phenotypes have been added at all sites to leverage the genotyping data, and hypothyroidism is being analyzed as a cross-network phenotype. Results are being posted in dbGaP. Other key eMERGE activities include evaluation of the issues associated with cross-site deployment of common algorithms to identify cases and controls in EMRs, data privacy of genomic and clinically-derived data, developing approaches for large-scale meta-analysis of GWAS data across five sites, and a community consultation and consent initiative at each site.
Plans are underway to expand the network in diversity of populations and incorporation of GWAS findings into clinical care.
By combining advanced clinical informatics, genome science, and community consultation, eMERGE represents a first step in the development of data-driven approaches to incorporate genomic information into routine healthcare delivery.
The present article proposes the adoption of a community-defined, uniform, generic description of the core attributes of biological databases, BioDBCore. The goals of these attributes are to provide a general overview of the database landscape, to encourage consistency and interoperability between resources; and to promote the use of semantic and syntactic standards. BioDBCore will make it easier for users to evaluate the scope and relevance of available resources. This new resource will increase the collective impact of the information present in biological databases.
The present article proposes the adoption of a community-defined, uniform, generic description of the core attributes of biological databases, BioDBCore. The goals of these attributes are to provide a general overview of the database landscape, to encourage consistency and interoperability between resources and to promote the use of semantic and syntactic standards. BioDBCore will make it easier for users to evaluate the scope and relevance of available resources. This new resource will increase the collective impact of the information present in biological databases.
dictyBase (http://www.dictybase.org), the model organism database for Dictyostelium, aims to provide the broad biomedical research community with well integrated, high quality data and tools for Dictyostelium discoideum and related species. dictyBase houses the complete genome sequence, ESTs, and the entire body of literature relevant to Dictyostelium. This information is curated to provide accurate gene models and functional annotations, with the goal of fully annotating the genome to provide a ‘reference genome’ in the Amoebozoa clade. We highlight several new features in the present update: (i) new annotations; (ii) improved interface with web 2.0 functionality; (iii) the initial steps towards a genome portal for the Amoebozoa; (iv) ortholog display; and (v) the complete integration of the Dicty Stock Center with dictyBase.
Background Vast sample sizes are often essential in the quest to disentangle the complex interplay of the genetic, lifestyle, environmental and social factors that determine the aetiology and progression of chronic diseases. The pooling of information between studies is therefore of central importance to contemporary bioscience. However, there are many technical, ethico-legal and scientific challenges to be overcome if an effective, valid, pooled analysis is to be achieved. Perhaps most critically, any data that are to be analysed in this way must be adequately ‘harmonized’. This implies that the collection and recording of information and data must be done in a manner that is sufficiently similar in the different studies to allow valid synthesis to take place.
Methods This conceptual article describes the origins, purpose and scientific foundations of the DataSHaPER (DataSchema and Harmonization Platform for Epidemiological Research; http://www.datashaper.org), which has been created by a multidisciplinary consortium of experts that was pulled together and coordinated by three international organizations: P3G (Public Population Project in Genomics), PHOEBE (Promoting Harmonization of Epidemiological Biobanks in Europe) and CPT (Canadian Partnership for Tomorrow Project).
Results The DataSHaPER provides a flexible, structured approach to the harmonization and pooling of information between studies. Its two primary components, the ‘DataSchema’ and ‘Harmonization Platforms’, together support the preparation of effective data-collection protocols and provide a central reference to facilitate harmonization. The DataSHaPER supports both ‘prospective’ and ‘retrospective’ harmonization.
Conclusion It is hoped that this article will encourage readers to investigate the project further: the more the research groups and studies are actively involved, the more effective the DataSHaPER programme will ultimately be.
Data synthesis; data quality; data pooling; harmonization; meta-analysis; DataSHaPER; prospective harmonization; retrospective harmonization
The human genome has been extensively annotated with Gene Ontology for biological functions, but minimally computationally annotated for diseases.
We used the Unified Medical Language System (UMLS) MetaMap Transfer tool (MMTx) to discover gene-disease relationships from the GeneRIF database. We utilized a comprehensive subset of UMLS, which is disease-focused and structured as a directed acyclic graph (the Disease Ontology), to filter and interpret results from MMTx. The results were validated against the Homayouni gene collection using recall and precision measurements. We compared our results with the widely used Online Mendelian Inheritance in Man (OMIM) annotations.
The validation data set suggests a 91% recall rate and 97% precision rate of disease annotation using GeneRIF, in contrast with a 22% recall and 98% precision using OMIM. Our thesaurus-based approach allows for comparisons to be made between disease containing databases and allows for increased accuracy in disease identification through synonym matching. The much higher recall rate of our approach demonstrates that annotating human genome with Disease Ontology and GeneRIF for diseases dramatically increases the coverage of the disease annotation of human genome.
dictyBase (http://dictybase.org) is the model organism database for Dictyostelium discoideum. It houses the complete genome sequence, ESTs and the entire body of literature relevant to Dictyostelium. This information is curated to provide accurate gene models and functional annotations, with the goal of fully annotating the genome. This dictyBase update describes the annotations and features implemented since 2006, including improved strain and phenotype representation, integration of predicted transcriptional regulatory elements, protein domain information, biochemical pathways, improved searching and a wiki tool that allows members of the research community to provide annotations.
Dictyostelium discoideum is a model system for studying many important physiological processes including chemotaxis, phagocytosis, and signal transduction. The recent sequencing of the genome has revealed the presence of over 12,500 protein-coding genes. The model organism database dictyBase hosts the genome sequence as well as a large amount of manually curated information.
We present here an anatomy ontology for Dictyostelium based upon the life cycle of the organism.
Anatomy ontologies are necessary to annotate species-specific events such as phenotypes, and the Dictyostelium anatomy ontology provides an essential tool for curation of the Dictyostelium genome.