In this study, we developed and validated an algorithm to identify cases with T2D and controls using standardized data elements captured through routine clinical care across five different EMR systems. Despite variations in data capture and completeness across the different systems, by applying stringent minimum criteria, and clear definition of data elements through an iterative process, we developed a final algorithm with a 98% PPV for cases and a 100% PPV for controls. We subsequently used identified samples pooled across sites to perform a GWAS.
The association tests between rs7903146 and T2D in the five EMR-derived cohorts yielded similar results to those from purposefully collected T2D case and control cohorts. In a recent meta-analysis of 29 195 T2D control subjects and 17 202 T2D case subjects from 27 populations spanning the globe Cauchi et al27
found the OR for developing T2D was 1.46 per copy of the rs7903146 T allele. We generated the exact same OR point estimate (1.46) for our EMR-derived samples using meta-analytical techniques in the pooled samples. Perhaps more importantly, our work demonstrates the power that can be achieved by combining samples across sites, evidenced by the highly significant p values from the cross-cohort analyses.
Our work expands on earlier studies to identify patients with diabetes from EMR. Previously, Wilke et al17
at Marshfield Clinic developed an effective algorithm to identify diabetes mellitus patients, but did not specifically differentiate between T1D and T2D. Other studies have utilized laboratory values, or diagnoses, laboratory tests and natural language processing to achieve high specificity for the identification of T1D and T2D.16
A related study used diagnoses and medications to identify patients with conditions that are risk factors for T2D, which were in turn used to identify patients with undiagnosed diabetes.29
We identified and addressed a number of specific challenges when developing the algorithms. We created specific definitions for cases and controls to avoid confounding by the inclusion of cases with T1D and, as much as possible, controls at risk of T2D, which has not, as yet, manifested itself. In EMR, fasting status at the time of blood draw for a patient was frequently not available. We therefore assumed that all glucose laboratory test results were not taken during the fasting state, so we used a lower glucose cut-off for controls, which resulted in lower sensitivity but higher specificity. In developing the final algorithm, a potential source of bias was recognized in that initially T2D subjects who were treated with insulin alone were excluded, although subjects with diabetes on insulin together with one of the diabetes medications listed above were eligible for inclusion. This approach would select against T2D subjects with significant pancreatic β-cell failure. Another problem presented by patients on insulin alone and an ICD-9-CM code for T2D is that some of these patients could represent patients with T1D, which was misclassified as T2D because of the age of onset or other issues. To address this, we identified patients on insulin alone as cases if they had been on a T2D medication in the past, or if they had at least two visits (on different dates) with a clinician who entered T2D diagnoses (ie, in the problem list or the encounter diagnosis).
Identifying controls presented a challenge to ensure that the control group was not ‘contaminated’ with cases, which would negatively impact power in genetic studies. We operated on the principle that absence of a diagnosis, prescribed medications, laboratory results, or other data in the EMR did not necessarily correlate with true patient status, but may reflect the selective capture of data within the EMR. Particularly at tertiary care centers, some patients receive only a portion of their care at the center. To address this challenge, we required that controls have a minimal amount of data represented in the EMR. In particular, we required controls to have had glucose testing with normal results at least once and to have at least two in-person clinician encounters. Moreover, to eliminate younger patients at increased risk of T2D but in whom the disease was not manifest, potential controls with a family history of diabetes were excluded. Another potential confounder was patients with diet controlled diabetes, although our algorithm was developed with the assumption that these patients would either have an ICD-9-CM code for T2D or an abnormal laboratory test result, which would exclude them from the control group.
Lack of standardization across EMR posed a challenge for the cross-site implementation and even within a given site where different EMR were in use. As a consortium, we identified the consolidated health informatics standards as the common lingua franca to achieve comparability of data across sites. For medications, we mapped medications to RxNORM codes at the generic name level as the common link between sites.30
For purposes of easier cross-institution sharing, we identified ingredient level RxNORM codes (included in the supplementary appendix, available online only) to reduce the total number of codes. We used LOINC codes specifically to define tests for glucose and HbA1C levels and ICD-9-CM codes for diagnoses. Despite these efforts the portability of algorithms across diverse sites poses a significant challenge and our future work is focused on developing methods to scale phenotyping more broadly. For example, we noted significant differences across sites in algorithm computing time, ranging from less than 10 s at a site using an optimized commercial data warehouse to 40 h at a site sequentially extracting categories of data using statistical software on their data warehouse. To this end, we include a link to our data dictionary, sample SQL code, and a data workflow built on an open source data mining tool for other investigators to explore: https://www.mc.vanderbilt.edu/victr/dcc/projects/acc/index.php/Library_of_Phenotype_Algorithms#Type_II_Diabetes
Our study had a number of limitations. Study sites represent institutions with a significant research focus, and this may affect how data are routinely captured within the EMR. Study sites varied in the number of years of data available in the EMR and the degree of care fragmentation. Preliminary evidence suggests that the absence of longitudinal data and fragmentation of care across sites may decrease the specificity of our algorithm. Additional studies are under way to quantify these effects in greater detail. Rates of T2D varied across sites from 1.0% of the total available biorepository to 14.8% at Mayo, compared with an approximate rate of 8% for diabetes (all types) for the general population.31
Rate differences are likely to be due to bias in sample selection for genotyping, for which only NU selected all possible T2D cases and controls for genotyping. Other sites performed the T2D case and control algorithms on their already genotyped cohorts, which were selected for genotyping based on their suitability for other phenotypes (eg, QRS duration at VU, cataracts at Marshfield, vascular disease at Mayo Clinic and dementia at Group Health Cooperative/UW). Other sources of bias include variation in biorepository recruitment (eg, Mayo Clinic's biorepository focused on patients with vascular disease, strongly associated with T2D) and variation in local coding practices.32
While the Mayo EA, VU EA, and NU AA results do not reach nominal significance they do approach significance (p=0.11, 0.06, and 0.08, respectively) and all trend in the same direction as the remaining subcohorts. The most likely explanation for the VU EA and NU AA lack of significance is reduced power from relatively small sample size for a GWAS. The Mayo EA lack of significance may be due to the selection bias, as these samples were not selected for genotyping based on the T2D case and control algorithm, but rather an algorithm designed to identify cardiovascular disease phenotypes. As noted, 14.8% of this biased Mayo cohort were identified as a T2D case, significantly higher than the national population prevalence of this disease. We suspect increased co-occurrence of cardiovascular and metabolic diseases may contribute to the reduction in significance through an increased prevalence of undiagnosed T2D among the controls. Importantly, despite the failure to achieve significance for replication of TCF7L2 at individual sites, pooling samples across sites achieved highly significant results, supporting our collective approach.
In conclusion, we describe a practical approach to the identification of T2D cases and controls for GWAS using data captured in routine clinical care across five distinct EMR. To achieve the high specificity required for GWAS, we refined an algorithm over multiple iterations, and applied stringent criteria and nationally recognized coding standards to facilitate portability across different EMR. Although the overall number of cases and controls decreased with the increased specificity needed for GWAS, by generalizing the algorithm across diverse EMR we identified the large number of cases and controls needed for a well-powered GWAS, and generated the exact OR point estimate we expected from the literature. Applying this approach across a large number of institutions provides an alternative approach for generating a large cohort of T2D cases and controls to understand better the associations between genetics and expressions of disease.