Jeff Sloan and colleagues describe the development of the Patient-Reported Outcomes Quality of Life (PROQOL) instrument, which captures and stores patient-recorded outcomes in the medical record for patients with diabetes.
Please see later in the article for the Editors' Summary
There are lots of question-based data elements from the pharmacogenomics research network (PGRN) studies. Many data elements contain temporal information. To semantically represent these elements so that they can be machine processiable is a challenging problem for the following reasons: (1) the designers of these studies usually do not have the knowledge of any computer modeling and query languages, so that the original data elements usually are represented in spreadsheets in human languages; and (2) the time aspects in these data elements can be too complex to be represented faithfully in a machine-understandable way.
In this paper, we introduce our efforts on representing these data elements using semantic web technologies. We have developed an ontology, CNTRO, for representing clinical events and their temporal relations in the web ontology language (OWL). Here we use CNTRO to represent the time aspects in the data elements. We have evaluated 720 time-related data elements from PGRN studies. We adapted and extended the knowledge representation requirements for EliXR-TIME to categorize our data elements. A CNTRO-based SPARQL query builder has been developed to customize users’ own SPARQL queries for each knowledge representation requirement. The SPARQL query builder has been evaluated with a simulated EHR triple store to ensure its functionalities.
A large amount of medication information resides in the unstructured text found in electronic medical records, which requires advanced techniques to be properly mined. In clinical notes, medication information follows certain semantic patterns (eg, medication, dosage, frequency, and mode). Some medication descriptions contain additional word(s) between medication attributes. Therefore, it is essential to understand the semantic patterns as well as the patterns of the context interspersed among them (ie, context patterns) to effectively extract comprehensive medication information. In this paper we examined both semantic and context patterns, and compared those found in Mayo Clinic and i2b2 challenge data. We found that some variations exist between the institutions but the dominant patterns are common.
medication extraction; electronic medical record; natural language processing
Clinical data in Electronic Medical Records (EMRs) is a potential source of longitudinal clinical data for research. The Electronic Medical Records and Genomics Network or eMERGE investigates whether data captured through routine clinical care using EMRs can identify disease phenotypes with sufficient positive and negative predictive values for use in genome wide association studies (GWAS). Using data from five different sets of EMRs, we have identified five disease phenotypes with positive predictive values of 73–98% and negative predictive values of 98–100%. A majority of EMRs captured key information (diagnoses, medications, laboratory tests) used to define phenotypes in a structured format. We identified natural language processing as an important tool to improve case identification rates. Efforts and incentives to increase the implementation of interoperable EMRs will markedly improve the availability of clinical data for genomics research.
To develop an algorithm for the discovery of drug treatment patterns for endocrine breast cancer therapy within an electronic medical record and to test the hypothesis that information extracted using it is comparable to the information found by traditional methods.
The electronic medical charts of 1507 patients diagnosed with histologically confirmed primary invasive breast cancer.
The automatic drug treatment classification tool consisted of components for: (1) extraction of drug treatment-relevant information from clinical narratives using natural language processing (clinical Text Analysis and Knowledge Extraction System); (2) extraction of drug treatment data from an electronic prescribing system; (3) merging information to create a patient treatment timeline; and (4) final classification logic.
Agreement between results from the algorithm and from a nurse abstractor is measured for categories: (0) no tamoxifen or aromatase inhibitor (AI) treatment; (1) tamoxifen only; (2) AI only; (3) tamoxifen before AI; (4) AI before tamoxifen; (5) multiple AIs and tamoxifen cycles in no specific order; and (6) no specific treatment dates. Specificity (all categories): 96.14%–100%; sensitivity (categories (0)–(4)): 90.27%–99.83%; sensitivity (categories (5)–(6)): 0–23.53%; positive predictive values: 80%–97.38%; negative predictive values: 96.91%–99.93%.
Our approach illustrates a secondary use of the electronic medical record. The main challenge is event temporality.
We present an algorithm for automated treatment classification within an electronic medical record to combine information extracted through natural language processing with that extracted from structured databases. The algorithm has high specificity for all categories, high sensitivity for five categories, and low sensitivity for two categories.
Clinical natural language processing; electronic medical record; phenotype extraction; endocrine therapy; breast cancer treatment; human-computer interaction and human-centered computing; providing just-in-time access to the biomedical literature and other health information; applications that link biomedical knowledge from diverse primary sources (includes automated indexing); linking the genotype and phenotype; discovery; NLP; asdf d; pet; biostatistics; ontologies; knowledge representations; controlled terminologies and vocabularies; information retrieval; HIT data standards; pharmacogenomics
The beta phase of the 11th revision of International Classification of Diseases (ICD-11) intends to accept public input through a distributed model of authoring. One of the core use cases is to create textual definitions for the ICD categories. The objective of the present study is to design, develop, and evaluate approaches to support ICD-11 textual definitions authoring using Semantic Web technology. We investigated a number of heterogeneous resources related to the definitions of diseases, including the linked open data (LOD) from DBpedia, the textual definitions from the Unified Medical Language System (UMLS) and the formal definitions of the Systematized Nomenclature of Medicine—Clinical Terms (SNOMED CT). We integrated them in a Semantic Web framework (i.e., the Linked Data in a Resource Description Framework [RDF] triple store), which is being proposed as a backend in a prototype platform for collaborative authoring of ICD-11 beta. We performed a preliminary evaluation on the usefulness of our approaches and discussed the potential challenges from both technical and clinical perspectives.
Semantic Web Technology; RDF; SPARQL; ICD-11; SNOMED CT; DBpedia
Knowledge-driven text mining is becoming an important research area for identifying pharmacogenomics target genes. However, few of such studies have been focused on the pharmacogenomics targets of adverse drug events (ADEs). The objective of the present study is to build a framework of knowledge integration and discovery that aims to support pharmacogenomics target predication of ADEs. We integrate a semantically annotated literature corpus Semantic MEDLINE with a semantically coded ADE knowledgebase known as ADEpedia using a semantic web based framework. We developed a knowledge discovery approach combining a network analysis of a protein-protein interaction (PPI) network and a gene functional classification approach. We performed a case study of drug-induced long QT syndrome for demonstrating the usefulness of the framework in predicting potential pharmacogenomics targets of ADEs.
The Pharmacogenomics Knowledge Base (PharmGKB)  is a publicly available central resource for pharmacogenomics data and knowledge, which is being widely employed into clinical practice and thus requires using clinical terminologies. Hence, harmonizing PharmGKB drug data with well annotated drug terminologies will facilitate its integration with other related resources and support data representation, interpretation, and exchange within and across heterogeneous sources and applications. In this study, we extracted drug and drug class from PharmGKB, and mapped them to National Drug File Reference Terminology (NDF-RT) , which is integrated into RxNorm specified by U.S. Meaningful Use regulations. And the evaluated mapping results will be provided to PharmGKB for further investigation and collaboration.
Prevalence of abdominal aortic aneurysm (AAA) is increasing due to longer life expectancy and implementation of screening programs. Patient-specific longitudinal measurements of AAA are important to understand pathophysiology of disease development and modifiers of abdominal aortic size. In this paper, we applied natural language processing (NLP) techniques to process radiology reports and developed a rule-based algorithm to identify AAA patients and also extract the corresponding aneurysm size with the examination date. AAA patient cohorts were determined by a hierarchical approach that: 1) selected potential AAA reports using keywords; 2) classified reports into AAA-case vs. non-case using rules; and 3) determined the AAA patient cohort based on a report-level classification. Our system was built in an Unstructured Information Management Architecture framework that allows efficient use of existing NLP components. Our system produced an F-score of 0.961 for AAA-case report classification with an accuracy of 0.984 for aneurysm size extraction.
Information extraction (IE), a natural language processing (NLP) task that automatically extracts structured or semi-structured information from free text, has become popular in the clinical domain for supporting automated systems at point-of-care and enabling secondary use of electronic health records (EHRs) for clinical and translational research. However, a high performance IE system can be very challenging to construct due to the complexity and dynamic nature of human language. In this paper, we report an IE framework for cohort identification using EHRs that is a knowledge-driven framework developed under the Unstructured Information Management Architecture (UIMA). A system to extract specific information can be developed by subject matter experts through expert knowledge engineering of the externalized knowledge resources used in the framework.
Gene Wiki Plus (GeneWiki+) and the Online Mendelian Inheritance in Man (OMIM) are publicly available resources for sharing information about disease-gene and gene-SNP associations in humans. While immensely useful to the scientific community, both resources are manually curated, thereby making the data entry and publication process time-consuming, and to some degree, error-prone. To this end, this study investigates Semantic Web technologies to validate existing and potentially discover new genotype-phenotype associations in GWP and OMIM. In particular, we demonstrate the applicability of SPARQL queries for identifying associations not explicitly stated for commonly occurring chronic diseases in GWP and OMIM, and report our preliminary findings for coverage, completeness, and validity of the associations. Our results highlight the benefits of Semantic Web querying technology to validate existing disease-gene associations as well as identify novel associations although further evaluation and analysis is required before such information can be applied and used effectively.
Standardized representations for pharmacogenomics data are seldom used, which leads to data heterogeneity and hinders data reuse and integration. In this study, we attempted to represent data elements from the Pharmacogenomics Research Network (PGRN) that are related to four categories, patient, drug, disease and laboratory, in a standard way using Clinical Element Models (CEMs), which have been adopted in the Strategic Health IT Advanced Research Project, secondary use of EHR (SHARPn) as a library of common logical models that facilitate consistent data representation, interpretation, and exchange within and across heterogeneous sources and applications. This was accomplished by grouping PGRN data elements into categories based on UMLS semantic type, then mapping each to one or more CEM attributes using a web-based tool that was developed to support curation activities. This study demonstrates the successful application of SHARPn CEMs to the pharmacogenomic domain. It also identified several categories of data elements that are not currently supported by SHARPn CEMs, which represent opportunities for further development and collaboration.
The increasing adoption of electronic health records (EHRs) due to Meaningful Use is providing unprecedented opportunities to enable secondary use of EHR data. Significant emphasis is being given to the development of algorithms and methods for phenotype extraction from EHRs to facilitate population-based studies for clinical and translational research. While preliminary work has shown demonstrable progress, it is becoming increasingly clear that developing, implementing and testing phenotyping algorithms is a time- and resource-intensive process. To this end, in this manuscript we propose an efficient machine learning technique—distributional associational rule mining (ARM)—for semi-automatic modeling of phenotyping algorithms. ARM provides a highly efficient and robust framework for discovering the most predictive set of phenotype definition criteria and rules from large datasets, and compared to other machine learning techniques, such as logistic regression and support vector machines, our preliminary results indicate not only significantly improved performance, but also generation of rule patterns that are amenable to human interpretation
A standardized Adverse Drug Events (ADEs) knowledge base that encodes known ADE knowledge can be very useful in improving ADE detection for drug safety surveillance. In our previous study, we developed the ADEpedia that is a standardized knowledge base of ADEs based on drug product labels. The objectives of the present study are 1) to integrate normalized ADE knowledge from the Unified Medical Language System (UMLS) into the ADEpedia; and 2) to enrich the knowledge base with the drug-disorder co-occurrence data from a 51-million-document electronic medical records (EMRs) system. We extracted 266,832 drug-disorder concept pairs from the UMLS, covering 14,256 (1.69%) distinct drug concepts and 19,006 (3.53%) distinct disorder concepts. Of them, 71,626 (26.8%) concept pairs from UMLS co-occurred in the EMRs. We performed a preliminary evaluation on the utility of the UMLS ADE data. In conclusion, we have built an ADEpedia 2.0 framework that intends to integrate known ADE knowledge from disparate sources. The UMLS is a useful source for providing standardized ADE knowledge relevant to indications, contraindications and adverse effects, and complementary to the ADE data from drug product labels. The statistics from EMRs would enable the meaningful use of ADE data for drug safety surveillance.
Mayo Clinic's Enterprise Data Trust is a collection of data from patient care, education, research, and administrative transactional systems, organized to support information retrieval, business intelligence, and high-level decision making. Structurally it is a top-down, subject-oriented, integrated, time-variant, and non-volatile collection of data in support of Mayo Clinic's analytic and decision-making processes. It is an interconnected piece of Mayo Clinic's Enterprise Information Management initiative, which also includes Data Governance, Enterprise Data Modeling, the Enterprise Vocabulary System, and Metadata Management. These resources enable unprecedented organization of enterprise information about patient, genomic, and research data. While facile access for cohort definition or aggregate retrieval is supported, a high level of security, retrieval audit, and user authentication ensures privacy, confidentiality, and respect for the trust imparted by our patients for the respectful use of information about their conditions.
The National Center for Biomedical Ontology is now in its seventh year. The goals of this National Center for Biomedical Computing are to: create and maintain a repository of biomedical ontologies and terminologies; build tools and web services to enable the use of ontologies and terminologies in clinical and translational research; educate their trainees and the scientific community broadly about biomedical ontology and ontology-based technology and best practices; and collaborate with a variety of groups who develop and use ontologies and terminologies in biomedicine. The centerpiece of the National Center for Biomedical Ontology is a web-based resource known as BioPortal. BioPortal makes available for research in computationally useful forms more than 270 of the world's biomedical ontologies and terminologies, and supports a wide range of web services that enable investigators to use the ontologies to annotate and retrieve data, to generate value sets and special-purpose lexicons, and to perform advanced analytics on a wide range of biomedical data.
Collaborative technologies; knowledge representations; knowledge acquisition and knowledge management; controlled terminologies and vocabularies; ontologies; knowledge bases; applications that link biomedical knowledge from diverse primary sources (includes automated indexing); statistical analysis of large datasets; methods for integration of information from disparate sources; discovery; and text and data mining methods; automated learning; information retrieval; HIT data standards; representing; identifying; and modeling biological structures; developing and refining ehr data standards (including image standards)
To evaluate data fragmentation across healthcare centers with regard to the accuracy of a high-throughput clinical phenotyping (HTCP) algorithm developed to differentiate (1) patients with type 2 diabetes mellitus (T2DM) and (2) patients with no diabetes.
Materials and methods
This population-based study identified all Olmsted County, Minnesota residents in 2007. We used provider-linked electronic medical record data from the two healthcare centers that provide >95% of all care to County residents (ie, Olmsted Medical Center and Mayo Clinic in Rochester, Minnesota, USA). Subjects were limited to residents with one or more encounter January 1, 2006 through December 31, 2007 at both healthcare centers. DM-relevant data on diagnoses, laboratory results, and medication from both centers were obtained during this period. The algorithm was first executed using data from both centers (ie, the gold standard) and then from Mayo Clinic alone. Positive predictive values and false-negative rates were calculated, and the McNemar test was used to compare categorization when data from the Mayo Clinic alone were used with the gold standard. Age and sex were compared between true-positive and false-negative subjects with T2DM. Statistical significance was accepted as p<0.05.
With data from both medical centers, 765 subjects with T2DM (4256 non-DM subjects) were identified. When single-center data were used, 252 T2DM subjects (1573 non-DM subjects) were missed; an additional false-positive 27 T2DM subjects (215 non-DM subjects) were identified. The positive predictive values and false-negative rates were 95.0% (513/540) and 32.9% (252/765), respectively, for T2DM subjects and 92.6% (2683/2898) and 37.0% (1573/4256), respectively, for non-DM subjects. Age and sex distribution differed between true-positive (mean age 62.1; 45% female) and false-negative (mean age 65.0; 56.0% female) T2DM subjects.
The findings show that application of an HTCP algorithm using data from a single medical center contributes to misclassification. These findings should be considered carefully by researchers when developing and executing HTCP algorithms.
Algorithms; electronic medical record; research techniques; type 2 diabetes mellitus; EMR secondary and meaningful use; EHR; information retrieval; modeling; data mining; medical informatics; infection control; phenotyping; biomedical informatics; ontologies; knowledge representations; controlled terminologies and vocabularies; information retrieval; HIT data standards
The U.S. FDA/CDC Vaccine Adverse Event Reporting System (VAERS) provides a valuable data source for post-vaccination adverse event analyses. The structured data in the system has been widely used, but the information in the write-up narratives is rarely included in these kinds of analyses. In fact, the unstructured nature of the narratives makes the data embedded in them difficult to be used for any further studies.
We developed an ontology-based approach to represent the data in the narratives in a “machine-understandable” way, so that it can be easily queried and further analyzed. Our focus is the time aspect in the data for time trending analysis. The Time Event Ontology (TEO), Ontology of Adverse Events (OAE), and Vaccine Ontology (VO) are leveraged for the semantic representation of this purpose. A VAERS case report is presented as a use case for the ontological representations. The advantages of using our ontology-based Semantic web representation and data analysis are emphasized.
We believe that representing both the structured data and the data from write-up narratives in an integrated, unified, and “machine-understandable” way can improve research for vaccine safety analyses, causality assessments, and retrospective studies.
Structured Product Labeling (SPL) is a document markup standard approved by Health Level Seven (HL7) and adopted by United States Food and Drug Administration (FDA) as a mechanism for exchanging drug product information. The SPL drug labels contain rich information about FDA approved clinical drugs. However, the lack of linkage to standard drug ontologies hinders their meaningful use. NDF-RT (National Drug File Reference Terminology) and NLM RxNorm as standard drug ontology were used to standardize and profile the product labels.
In this paper, we present a framework that intends to map SPL drug labels with existing drug ontologies: NDF-RT and RxNorm. We also applied existing categorical annotations from the drug ontologies to classify SPL drug labels into corresponding classes. We established the classification and relevant linkage for SPL drug labels using the following three approaches. First, we retrieved NDF-RT categorical information from the External Pharmacologic Class (EPC) indexing SPLs. Second, we used the RxNorm and NDF-RT mappings to classify and link SPLs with NDF-RT categories. Third, we profiled SPLs using RxNorm term type information. In the implementation process, we employed a Semantic Web technology framework, in which we stored the data sets from NDF-RT and SPLs into a RDF triple store, and executed SPARQL queries to retrieve data from customized SPARQL endpoints. Meanwhile, we imported RxNorm data into MySQL relational database.
In total, 96.0% SPL drug labels were mapped with NDF-RT categories whereas 97.0% SPL drug labels are linked to RxNorm codes. We found that the majority of SPL drug labels are mapped to chemical ingredient concepts in both drug ontologies whereas a relatively small portion of SPL drug labels are mapped to clinical drug concepts.
The profiling outcomes produced by this study would provide useful insights on meaningful use of FDA SPL drug labels in clinical applications through standard drug ontologies such as NDF-RT and RxNorm.
The ability to conduct genome-wide association studies (GWAS) has enabled new exploration of how genetic variations contribute to health and disease etiology. However, historically GWAS have been limited by inadequate sample size due to associated costs for genotyping and phenotyping of study subjects. This has prompted several academic medical centers to form “biobanks” where biospecimens linked to personal health information, typically in electronic health records (EHRs), are collected and stored on a large number of subjects. This provides tremendous opportunities to discover novel genotype-phenotype associations and foster hypotheses generation.
In this work, we study how emerging Semantic Web technologies can be applied in conjunction with clinical and genotype data stored at the Mayo Clinic Biobank to mine the phenotype data for genetic associations. In particular, we demonstrate the role of using Resource Description Framework (RDF) for representing EHR diagnoses and procedure data, and enable federated querying via standardized Web protocols to identify subjects genotyped for Type 2 Diabetes and Hypothyroidism to discover gene-disease associations. Our study highlights the potential of Web-scale data federation techniques to execute complex queries.
This study demonstrates how Semantic Web technologies can be applied in conjunction with clinical data stored in EHRs to accurately identify subjects with specific diseases and phenotypes, and identify genotype-phenotype associations.
To extract physician-asserted drug side effects from electronic medical record clinical narratives.
Materials and methods
Pattern matching rules were manually developed through examining keywords and expression patterns of side effects to discover an individual side effect and causative drug relationship. A combination of machine learning (C4.5) using side effect keyword features and pattern matching rules was used to extract sentences that contain side effect and causative drug pairs, enabling the system to discover most side effect occurrences. Our system was implemented as a module within the clinical Text Analysis and Knowledge Extraction System.
The system was tested in the domain of psychiatry and psychology. The rule-based system extracting side effects and causative drugs produced an F score of 0.80 (0.55 excluding allergy section). The hybrid system identifying side effect sentences had an F score of 0.75 (0.56 excluding allergy section) but covered more side effect and causative drug pairs than individual side effect extraction.
The rule-based system was able to identify most side effects expressed by clear indication words. More sophisticated semantic processing is required to handle complex side effect descriptions in the narrative. We demonstrated that our system can be trained to identify sentences with complex side effect descriptions that can be submitted to a human expert for further abstraction.
Our system was able to extract most physician-asserted drug side effects. It can be used in either an automated mode for side effect extraction or semi-automated mode to identify side effect sentences that can significantly simplify abstraction by a human expert.
Natural language processing; machine learning; information extraction; electronic medical record; Information storage and retrieval (text and images); discovery; and text and data mining methods; Other methods of information extraction; Natural-language processing; bioinformatics; Ontologies; Knowledge representations, Controlled terminologies and vocabularies; Information Retrieval; HIT Data Standards; Human-computer interaction and human-centered computing; Providing just-in-time access to the biomedical literature and other health information; Applications that link biomedical knowledge from diverse primary sources (includes automated indexing); Linking the genotype and phenotype
The ability to conduct genome-wide association studies (GWAS) has enabled new exploration of how genetic variations contribute to health and disease etiology. However, historically GWAS have been limited by inadequate sample size due to associated costs for genotyping and phenotyping of study subjects. This has prompted several academic medical centers to form “biobanks” where biospecimens linked to personal health information, typically in electronic health records (EHRs), are collected and stored on large number of subjects. This provides tremendous opportunities to discover novel genotype-phenotype associations and foster hypothesis generation. In this work, we study how emerging Semantic Web technologies can be applied in conjunction with clinical and genotype data stored at the Mayo Clinic Biobank to mine the phenotype data for genetic associations. In particular, we demonstrate the role of using Resource Description Framework (RDF) for representing EHR diagnoses and procedure data, and enable federated querying via standardized Web protocols to identify subjects genotyped with Type 2 Diabetes for discovering gene-disease associations. Our study highlights the potential of Web-scale data federation techniques to execute complex queries.
With increasing adoption of electronic health records (EHRs), the need for formal representations for EHR-driven phenotyping algorithms has been recognized for some time. The recently proposed Quality Data Model from the National Quality Forum (NQF) provides an information model and a grammar that is intended to represent data collected during routine clinical care in EHRs as well as the basic logic required to represent the algorithmic criteria for phenotype definitions. The QDM is further aligned with Meaningful Use standards to ensure that the clinical data and algorithmic criteria are represented in a consistent, unambiguous and reproducible manner. However, phenotype definitions represented in QDM, while structured, cannot be executed readily on existing EHRs. Rather, human interpretation, and subsequent implementation is a required step for this process. To address this need, the current study investigates open-source JBoss® Drools rules engine for automatic translation of QDM criteria into rules for execution over EHR data. In particular, using Apache Foundation’s Unstructured Information Management Architecture (UIMA) platform, we developed a translator tool for converting QDM defined phenotyping algorithm criteria into executable Drools rules scripts, and demonstrated their execution on real patient data from Mayo Clinic to identify cases for Coronary Artery Disease and Diabetes. To the best of our knowledge, this is the first study illustrating a framework and an approach for executing phenotyping criteria modeled in QDM using the Drools business rules management system.
A semantic lexicon which associates words and phrases in text to concepts is critical for extracting and encoding clinical information in free text and therefore achieving semantic interoperability between structured and unstructured data in Electronic Health Records (EHRs). Directly using existing standard terminologies may have limited coverage with respect to concepts and their corresponding mentions in text. In this paper, we analyze how tokens and phrases in a large corpus distribute and how well the UMLS captures the semantics. A corpus-driven semantic lexicon, MedLex, has been constructed where the semantics is based on the UMLS assisted with variants mined and usage information gathered from clinical text. The detailed corpus analysis of tokens, chunks, and concept mentions shows the UMLS is an invaluable source for natural language processing. Increasing the semantic coverage of tokens provides a good foundation in capturing clinical information comprehensively. The study also yields some insights in developing practical NLP systems.
To identify common genetic variants influencing red blood cell (RBC) traits.
Patients and Methods
We performed a genomewide association study from June 2008 through July 2011 of hemoglobin, hematocrit, RBC count, mean corpuscular volume, mean corpuscular hemoglobin, and mean corpuscular hemoglobin concentration in 12,486 patients of European ancestry from the electronic MEdical Records and Genomics (eMERGE) network. We developed an electronic medical record–based algorithm that included individuals who had RBC measurements obtained for clinical care and excluded values measured in the setting of hematopoietic disorders, comorbid conditions, or medications known to affect RBC production or a recent history of blood loss.
We identified 4 new genetic loci and replicated 11 loci previously reported to be associated with one or more RBC traits in individuals of European ancestry. Notably, genes present in 3 of the 4 newly identified loci (THRB, PTPLAD1, CDT1) and in 6 of the 11 replicated loci (KLF1, ALDH8A1, CCND3, SPTA1, FBXO7, TFR2/EPO) are implicated in erythroid differentiation and regulation of cell cycle in hematopoietic stem cells.
Genes in the erythroid differentiation and cell cycle regulation pathways influence interindividual variation in RBC indices. Our results provide insights into the molecular basis underlying variation in RBC traits.
eMERGE, electronic MEdical Records and GEnomics; EMMAX, mixed-model association-expedited; EMR, electronic medical record; eQTL, expression quantitative trait locus; GHC, Group Health Cooperative--University of Washington; GWAS, genomewide association study; HCT, hematocrit; HGB, hemoglobin; IBS, identity-by-state; LD, linkage disequilibrium; MC, Marshfield Clinic; MCH, mean corpuscular hemoglobin; MCHC, mean corpuscular hemoglobin concentration; MCV, mean corpuscular volume; MIM, Mendelian Inheritance of Man; NU, Northwestern University; RBC, red blood cell; SNP, single-nucleotide polymorphism; VUMC, Vanderbilt University Medical Center