In our study, data captured in EMRs for routine clinical care proved adequate to define five disease phenotypes across five different study sites with robust positive and negative predictive values. Encouragingly, several recent reports (30
) demonstrate that GWAS based on EMR-derived phenotypes successfully replicated identification of genetic sequences associated with increased disease risk. Although we could achieve high PPVs using case identification algorithms based on data captured through routine clinical care, we note some attrition in the number of cases identified by this approach compared with disease-focused prospective case identification. In our study, electronic algorithms identified 71% and 90% of the possible cases within two prospectively collected disease cohorts. Reduction in case identification rates may be compensated by the efficiency and scalability of electronic algorithms across EMRs.
Across the five unique EMRs, diagnosis codes, medications, and laboratory tests were readily extracted to identify phenotypes for GWAS. Race/ethnicity, family history, exposure history (e.g. smoking) and environmental exposures were documented less frequently across all EMRs, and where present, often were captured in free text form (e.g. clinicians notes) and without consistent or standard nomenclature. Capturing interpreted test results that are typically not recorded as structured data elements (e.g., arterial doppler and electrocardiogram data) and clinician diagnoses (such as found on a problem list) generally required NLP. As a result, significant informatics efforts were required to tailor algorithms to each institution’s EMR to accurately identify each phenotype.
Both “home-grown” and commercial EMRs demonstrated high PPV rates across the primary phenotypes. Given the far wider population using commercial EMRs in routine clinical care, this finding suggests potential for broad dissemination of our approach to identify cases and controls for genetic analyses to achieve well-powered studies, although the impact of differences among commercial EMR systems is unclear. Regardless of EMR type, study sites leveraged strengths in EMR data quality and site-specific data extraction methods to optimize phenotyping algorithms, often using data categories with a high proportion of structured data at sites without NLP capacity.
Historically, institutions with significant free text documentation in their EMRs developed or adapted robust NLP tools to extract data for further analysis (20,33,343-). NLP enabled sites to improve case finding by searching across a wider range of EMR data categories. The observation that NLP tools allowed identification of 129% more cases than were identified using purely structured data and string-matching only emphasizes the value of information captured in free text and is consistent with prior studies (35
). As a consortium, eMERGE identified use of NLP to extract data from text documents as a critical tool to improve data quality for phenotyping. Sites with NLP experience shared best practices with other consortium sites to develop NLP capacity at all sites. However, in our study, even sites without NLP tools successfully identified their primary phenotype, and one site successfully replicated previously identified genotype-phenotype associations for five diseases, including type 2 diabetes (31
). Certain phenotype identification algorithms, such as those for type 2 diabetes, were implemented without use of sophisticated NLP; other algorithms, such as those for identifying cardiac conduction problems, were implemented with a combination of NLP and structured data extraction. This variation reflected institutional informatics capacity and a bias towards selection of phenotypes using data captured in structured formats at sites without NLP capacity. Sites without NLP capacity may be limited to identifying phenotypes using only data categories captured in structured fields. Approaches using only structured data could still achieve comparable PPVs, but would have lower case identification rates. However, efficient access to data across the entire spectrum of clinical EMRs, can compensate for lower identification rates to identify adequate numbers for genetic studies.
Some data categories consistently reflected low rates of structured data capture (). The EMRs in this study used Office of Management and Budget categories for race/ethnicity (38
). In this study, low rates of documentation of race and ethnicity in the EMRs are consistent with prior studies of routine physician practice (39
). However, lower rates of race and ethnicity documentation in EMRs may not significantly impact subsequent genetic studies. For genetic studies, ancestry estimates derived from genotype data are often used in primary association analyses rather than self-reported race/ethnicity, though the latter clearly adds important sociocultural information independent of genetic ancestry that may be useful in more refined analyses (40
). Similarly, in our study, family history was primarily documented in clinician notes and was not readily extracted even with NLP tools. One site with a vendor based EMR featured a family history section enabling a mixture of structured and unstructured data capture, but attracted low rates of physician documentation. Our findings are consistent with prior studies, although current efforts are underway to promote standardized collection of key elements of family history within EMRs (41
Environmental exposures play a significant role in expression of disease in genetically susceptible populations (44
). Unfortunately, environmental factors, such as exposure to environmental toxins or contaminants, are rarely captured in existing EMRs, with the notable exception of smoking status. Substantial improvements in methods to collect and link environmental data to clinical data in EMRs may enable future studies of the association between disease and environment (48
In our chart review, we identified a number of common data quality issues. Foremost, the absence of information may not reflect the absence of condition. Depending on the institution, significant care might be rendered at outside institutions and therefore would not appear in the study site’s EMR. To address this limitation, we defined minimum data requirements (e.g. two documented clinical visits) to enhance the opportunity for clinical documentation beyond a single visit. We encountered instances of structured results violating acceptable ranges of possibility (e.g. a weight of 1000 kg, a height of 6 inches), requiring post-extraction censoring of impossible values. Lack of data equivalency posed challenges in merging data within a single EMR and across EMRs. Often data is imprecisely labeled such that different measures might be inappropriately mixed together. For example, laboratory tests with similar names (e.g. glucose) might represent different tests (e.g. blood glucose concentration vs urine glucose concentration). Similarly, diagnostic certainty differed depending on whether the diagnoses were entered in clinical notes or for billing purposes and differed across sites due to local billing practices (49
). We identified use of data standards for EMR documentation as a necessary foundation to improve data quality and achieve data equivalence across sites. As a consortium, we used the federally endorsed Consolidated Health Informatics (CHI) standards (LOINC, ICD9/SNOMED, RxNorm) to promote data equivalency, and facilitate data sharing between sites (50
). Phenotyping algorithms most commonly included diagnosis codes, medications, and laboratory tests, which are well covered by the CHI standards ICD9, RxNorm, and LOINC, respectively.
Our study sites represented academic medical centers or institutions with significant research programs and may have a greater focus on rigorous data collection for potential future research, limiting the generalizability of our findings to non-research oriented clinical care settings. However, recent national initiatives may promote more complete and standardized data collection across EMR-enabled clinical care settings. Greater adherence to standardized data collection may facilitate the role of EMRs in research and enable the sharing of phenotype definitions across EMR systems. The Centers for Medicare and Medicaid Services and the Office of the National Coordinator have written regulations defining “Meaningful Use” of EMRs that promote the recording of structured data and define coding standards for data categories such as diagnoses, laboratory tests, and medications. Clear documentation in EMRs is a necessary goal to achieve “Meaningful Use” and enables measurement and improvement in quality of care. Achieving this goal likewise improves the quality and volume of data available for research. Significant financial incentives for achieving meaningful use of an EMR (up to $63,750 per provider over 4 years) may increase the future availability of structured and standardized data from EMRs. Although EMR data may not capture the nuance of the human-human interaction between patient and provider, accurate and structured capture of diagnosis, laboratory test, and medication data, supplemented with text mining tools, has proved useful for identifying disease phenotypes for GWAS within the eMERGE network.
Widespread adoption of EMRs creates the potential for a quantum shift forward in the availability of longitudinal, real-world clinical data for genetics research. Our study suggests that current EMRs used for routine clinical care can be used to identify phenotypes for genetic studies. Future investment in the dissemination, standardization, and comprehensive capture of phenotypic and environmental data in EMRs will help to achieve rapidly scalable phenotyping efforts to match the proliferation of genomics data.