|Home | About | Journals | Submit | Contact Us | Français|
Electrocardiographic QRS duration, a measure of cardiac intraventricular conduction, varies ~2-fold in individuals without cardiac disease. Slow conduction may promote reentrant arrhythmias.
We performed a genome-wide association study (GWAS) to identify genomic markers of QRS duration in 5,272 individuals without cardiac disease selected from electronic medical record (EMR) algorithms at five sites in the Electronic Medical Records and Genomics (eMERGE) network. The most significant loci were evaluated within the CHARGE consortium QRS GWAS meta-analysis. Twenty-three single nucleotide polymorphisms in 5 loci, previously described by CHARGE, were replicated in the eMERGE samples; 18 SNPs were in the chromosome 3 SCN5A and SCN10A loci, where the most significant SNPs were rs1805126 in SCN5A with p=1.2×10−8 (eMERGE) and p=2.5×10−20 (CHARGE) and rs6795970 in SCN10A with p=6×10−6 (eMERGE) and p=5×10−27 (CHARGE). The other loci were in NFIA, near CDKN1A, and near C6orf204. We then performed phenome-wide association studies (PheWAS) on variants in these five loci in 13,859 European Americans to search for diagnoses associated with these markers. PheWAS identified atrial fibrillation and cardiac arrhythmias as the most common associated diagnoses with SCN10A and SCN5A variants. SCN10A variants were also associated with subsequent development of atrial fibrillation and arrhythmia in the original 5,272 “heart-healthy” study population.
We conclude that DNA biobanks coupled to EMRs provide a platform not only for GWAS but may also allow broad interrogation of the longitudinal incidence of disease associated with genetic variants. The PheWAS approach implicated sodium channel variants modulating QRS duration in subjects without cardiac disease as predictors of subsequent arrhythmias.
Electrocardiographic (ECG) parameters of cardiac conduction and repolarization, PR, QRS, and QT intervals, are widely used in clinical medicine and display substantial variability when measured across large populations. QRS duration represents activation time in the cardiac ventricle, and prolongation of this interval – representing global or regional slow conduction – has been associated with adverse outcomes such as sudden cardiac death.1 This variability reflects modulators such as abnormal electrolytes, underlying heart disease, or concomitant drug therapy, as well as heritable components; published estimates suggest that up to 40% of variability in the QRS interval is heritable.2–4 Using automated methods described further below, we have shown that 99% of QRS durations in over 30,000 normal subjects not receiving confounding medications fall between 65 to 108 msec.5
DNA repositories linked to Electronic Medical Record (EMR) systems have been proposed as one potential source of subjects for analyzing the relationship between genetic variation and a range of human traits.6–10 The advantages of this approach may include rapid generation of patient sets for study (since electronic data are already in place), and the ability to study large numbers of subjects accrued without bias with respect to factors such as disease or age. The National Human Genome Research Institute’s electronic MEdical Records and GEnomics (eMERGE) Network11 has, as one of its primary goals, the evaluation of the utility of EMR systems coupled to DNA repositories as a tool for genome science. Initial studies from eMERGE sites support the potential utility of EMR systems for discovery and validation of genotype-phenotype associations.12–16
We report here a GWAS of QRS duration in European-descent subjects whose first ECG in an EMR system was normal, and who at the time of the ECG lacked evidence of heart disease, potentially confounding medications, and electrolyte abnormalities. We then used a phenome-wide association study (PheWAS)16,17 to demonstrate that polymorphisms associated with QRS variability in subjects without cardiac disease were also associated with subsequent diagnoses of cardiac arrhythmias. This coupling of GWAS and PheWAS further validates the relationship between abnormal conduction and arrhythmias, and points to the development of genome-based predictors of arrhythmia susceptibility.
We developed and deployed an algorithm to identify individuals with normal ECGs and without any cardiac disease, abnormal electrolyte values, or QRS-active medications – across the five eMERGE-1 sites identified 5,272 Caucasian patients (2,488 males and 2,784 females; Table 1). The algorithm was developed and validated in the Synthetic Derivative (SD), a de-identified image of the Vanderbilt EMR that currently contains over 120 million documents on about 2 million patients.7 The SD is refreshed regularly to add new clinical information from the EMR as it is accrued.
The study population consisted of subjects with a normal ECG without evidence of cardiac disease any time before or within one month following the ECG, concurrent use of medications that interfere with ventricular conduction, and who did not have abnormal potassium, calcium, or magnesium lab values at the time of the ECG. The algorithm has been described in detail previously.13 The algorithm used natural language processing (NLP)18,19 to analyze narrative text, billing code queries, and lab queries to exclude any subjects with evidence of arrhythmia, heart failure, cardiomyopathy, myocardial ischemia/infarct, or cardiac conduction defect. The algorithm considered all physician-generated clinical documentation, including clinical notes and cardiologist-generated ECG impressions. Patients with family histories of cardiac disease were allowed by the NLP algorithm. In addition, ECGs had normal Bazett’s corrected QT intervals (<450ms), heart rates (between 50-100 bpm), and QRS (60-120 ms). The algorithm was reviewed by two physicians not involved in algorithm development, and achieved a PPV of 97% to identify patients with normal ECGs who did not have known exclusions on a random selection of 100 subjects.13 Analysis of clinical covariates in ~30,000 records with algorithm-defined normal ECGs identified gender and ancestry as modulators of QRS duration.5 Complete details of the algorithm are available from PheKB (http://phekb.org/).
The final algorithm was applied in BioVU, the Vanderbilt DNA databank that links DNA extracted from discarded blood samples to the SD.7 Patients at Vanderbilt were genotyped specifically for the purpose of studying normal QRS duration. The algorithm was then deployed across the DNA repositories at the other four eMERGE-I sites (Marshfield Clinic, Northwestern University, Mayo Clinic, Group Health Research Institute) to identify subjects with extant eMERGE-based genotyping data (based on other phenotypes; as shown in Table 1) who met algorithm-defined criteria for normal QRS. The eMERGE cohorts are described in more detail in McCarty et al. 201111 and at the Phenotype Knowledge Base (PheKB.org). Thus, all eMERGE individuals used in the analysis underwent the same algorithm to select those with normal ECGs and without prior heart disease, interfering medications, and abnormal electrolytes. To assess the performance of the algorithm when applied within external EMR systems, trained chart abstracters at Northwestern and Marshfield reviewed randomly-selected subsets of 100 subjects at Marshfield and 45 subjects at Northwestern to determine the algorithm’s accuracy at external sites. Northwestern’s evaluation also included an independent review by a board-certified internal medicine physician, with discrepancies resolved by consensus. This study included only subjects designated as “non-Hispanic white” European American in the EMR from each site. We have previously shown the EMR ancestry performs similar to self-report.20
This study was approved by each site’s Institutional Review Board. Because BioVU is de-identified and accrues individuals through left-over blood remaining after routine clinical testing, it operates as non-human subjects research according to the provisions of 45 CFR 46, as described previously.7 Individuals at other eMERGE sites were consented as part of each site’s DNA biobank.11
Genotyping was performed at the Center for Genotyping and Analysis at the Broad Institute and the Center for Inherited Disease Research (CIDR) at Johns Hopkins University. Samples of European ancestry or unknown ancestry were analyzed using the Illumina Human660W-Quadv1_A genotyping platform, consisting of 561,490 SNPs and 95,876 intensity-only probes. Data were cleaned using the quality control (QC) pipeline developed by the eMERGE Genomics Working Group.21 This process includes evaluation of sample and marker call rate, gender mismatch and anomalies, duplicate and HapMap concordance, batch effects, Hardy-Weinberg equilibrium (HWE), sample relatedness, and population stratification. After QC, 528,508 SNPs were used for analysis based on the following QC criteria: SNP call rate >99%, sample call rate >99%, minor allele frequency > 0.0001, unrelated samples only (removing all parent-offspring, full and half siblings), and individuals of European-descent only (based on STRUCTURE22 analysis of >90% probability of being in the CEU cluster).
Each eMERGE site used the QC pipeline to clean their initial datasets prior to merging all the samples. QC procedures were then performed on the merged eMERGE dataset in which data from all five sites were combined, and no significant differences across sites or genotyping center were identified. As well, all sites had comparable QC results including similar SNP and sample call rates, HWE p-values overall, and minor allele frequencies. The detailed QC report on the merged dataset will be deposited in dbGaP along with the merged dataset.
Single-locus tests of association were performed using linear regression assuming an additive genetic model for all 528,508 SNPs in a total of 5,272 individuals with a normal QRS duration. Our studies of ECG intervals in 32,949 normal individuals identified sex as a major modulator of normal QRS duration, with minor effects of age and ancestry.5 All analyses were performed unadjusted and then adjusted for age, sex, BMI, and the first principal component from Eigenstrat23 to adjust for potential population stratification, without significantly changing the key results. Since only sex is significantly associated with QRS duration via the literature, we report that here. Analyses were also performed adjusting for height and/or BMI, but these did not change the results.
Associations with the lowest P values (p<10−4) were then submitted to the recent QRS Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) meta-analysis group24 and a table of P-values from that analysis was generated. The CHARGE meta-analysis of QRS duration has been described in detail previously24; briefly, it involved 40,407 individuals selected from 15 sites restricted to those of European ancestry. Individuals with prior myocardial infarction, heart failure, arrhythmias, pacemakers, antiarrhythmic medication use, or whose QRS durations >120 ms were excluded.
To calculate the variance explained by all SNPs in the dataset, we analyzed the data using GCTA.25 Only those SNPs with minor allele frequency > 0.01, genotyping efficiency > 99.9 and HWE > 0.001 were included in the analysis (n=505,502 SNPs). The genetic relationship matrix (GRM) was computed for all 5272 subjects and all SNPs using GCTA. In order to eliminate possible cryptic relationships, subjects with GRM>0.25 were pruned from the analysis, which removed 310 subjects. The proportion of variance explained by either all SNPs or all SNPs excluding the subset of 23 SNPs significant in the CHARGE GWAS was computed on the remaining subjects for QRS duration. We compared this to a linear regression analysis using the five SNPs in Table 2 to estimate the proportion of variance explained by these loci. All analyses were adjusted for age, gender and the first principal component (previously computed).
We selected the most significant SNP associations for analysis by PheWAS.16,17 For this analysis, we combined the entire eMERGE cohort of European American individuals (n=13,859) identified across the five eMERGE sites. These individuals represent a superset of the 5,272 individuals with normal ECGs and without heart disease used for the GWAS. To define diseases, we queried all International Classification of Disease (ICD), 9th edition, codes from the respective EMRs of the five eMERGE sites.
The PheWAS software uses occurrences of ICD codes to classify each person as having one or more of 778 possible clinical phenotypes (typically diseases). For each disease, the PheWAS algorithm constructs a control population by selecting all patients that do not have the case disease or closely related diseases (e.g., a patient with a bundle branch block cannot serve as a control for complete heart block). The PheWAS methodology has previously been validated through rediscovery of known associations.16,17 Analysis of each phenotype then proceeds using a pairwise analysis of all case and control groups for each tested SNP (n=23). We have observed that positive predictive values increase when individual codes are present more than once in the EMR, and here we required each case to have at least four instances of the same ICD code in a PheWAS case group. In addition, we did not analyze phenotypes occurring in less than 50 patients (a prevalence of 0.36% in the dataset). Association analyses were performed with PLINK using logistic regression adjusted for age, gender, and the first three principal component analyses as calculated by Eigenstrat, since on this larger population, the third principal component was statistically significant.23 Analysis adjusted with and without principal components did not substantively change the results. After identification of PheWAS case and control groups using the PheWAS software, the association analyses were performed using PLINK.26
Following PheWAS analysis, we analyzed the original set of 5272 patients that met our algorithm definition for normal cardiac conduction/normal heart for subsequent development of atrial fibrillation and cardiac arrhythmias with the SCN5A rs1805126 and SCN10A rs6795970 SNPs. Phenotype definitions were drawn from the PheWAS analysis using billing codes. Kaplan-Meier analysis and Cox proportional hazard models were calculated, using the starting time as the initial normal ECG with a time-to-event analysis. Cox proportional hazard models were adjusted for age, sex, principal components as calculated above, and QRS duration.
We identified 5,272 Caucasian patients (2,488 males and 2,784 females; Table 1) across the five eMERGE-I sites. The positive predictive value (PPV) of the automated phenotype algorithm to find cases with normal ECGs and without exclusions at the development site, Vanderbilt, to identify study subjects was 97% (95% confidence interval [CI] 91-99%).13 The PPV at Northwestern University and Marshfield Clinic were 97% (95% CI 83%-100%) and 100% (95% CI 96%-100%), respectively. Combining all reviewed samples across the three sites, the PPV would be 98% (95% CI 96%-100%). The mean QRS duration was 87.9 msec (standard deviation 9.5 msec; median 88.0 msec; Figure 1A).
A total of 528,508 SNPs passed quality control of eMERGE-supported Illumina 660Quad genotyping data in these subjects. Figure 1B shows the genome-wide association analysis for QRS duration adjusted for sex; the findings were near-identical for the unadjusted analysis. There was a single association between QRS duration and a SNP (rs1805126) in SCN5A, encoding the cardiac sodium channel gene, that survived Bonferroni correction (beta=1.002 msec per copy of the T allele, p=1.45 × 10−8).
The set taken forward to the CHARGE QRS meta-analysis consortium included 108 SNPs with P-values <10−4. The retrieved P-values for this set divided into two distinct groups: 23 SNPs with P-values in the CHARGE set from 10−8 to 10−27, and 85 with P-values > 0.003. These 23 associations (Supplementary Table 1) are located in the five loci with the lowest P values reported by the CHARGE consortium: 18/23 are in the chromosome 3 locus that includes SCN5A and SCN10A, as well as other genes (e.g. EXOG and XYLB1). The other three loci are near SLC35F1 and C6orf204 (chromosome 6), near CDKN1A (chromosome 6) and in NFIA (chromosome 1). The most significant SNP for each locus is presented in Table 2. The locus zoom plot (Supplementary Figure 1) shows little linkage disequilibrium (LD) in the chromosome 3 region in HapMap Phase III (CEU), consistent with the suggestion that the SCN5A-10A finding may actually indicate multiple independent associations.24 Specifically, the most significant variants in SCN5A (rs1805126) and SCN10A (rs6795970) are not in LD (r2<0.20).
Using the GTCA25 approach, we estimated heritability for QRS at 31.1% (standard error [SE] 6.9%, p=5.7 × 10−7) using all SNPs in the dataset. Conducting the analysis without the 23 SNPs significant in CHARGE decreased the estimated heritability to 30.3%, a decrease of 0.8%. This was somewhat conservative compared to a linear regression model, which estimated an adjusted r-square value of 1.6% for the five loci in Table 2.
The PheWAS dataset consisted of 13,859 European-American subjects in the entire genotyped eMERGE cohort. The analysis focused on the most significant SNPs in each of the five loci associated with QRS (Table 3). While no associations survived a strict Bonferroni correction for significance (p=0.05/778/5=1.3×10−5), the most significant associations were particularly relevant to cardiac disease and demonstrated significantly different patterns of associations for the five QRS-associated loci. The strongest associations for the SNPs in both SCN5A and SCN10A were with the diagnoses of cardiac arrhythmias (p=7.21×10−4 for SCN10A and p=1.1×10−3 for SCN5A) and, for SCN10A, atrial fibrillation (p=8.5×10−4, Figure 2). Table 3 lists associations for the most significant SCN5A (rs1805126) and SCN10A (rs6795970) SNPs, as well as those at the other QRS-associated loci, chromosome 1 (rs2207790) and the two chromosome 6 loci (rs6906287 and rs1321313), also graphed in Supplementary Figures 2-4. The CDKN1 and C6orf204 loci were not associated with cardiac arrhythmias (p>0.3, with 80% power to detect and OR>1.12 at p=0.05), and NFIA was weakly associated with cardiac arrhythmias (OR=0.91, p=0.02). Supplementary Table 2 presents PheWAS association data for all 23 SNPs significant in CHARGE. While most SNPs in a given gene displayed similar patterns of PheWAS associations, rs11129801, the strongest SCN10A SNP in our adjusted analysis but a lesser association in CHARGE, had a very different PheWAS pattern, with the strongest associations being epilepsy, uterine cancer, and migraines; atrial fibrillation was not associated (OR=1.07, p=0.18). In agreement with these data, rs11129801 was only in weak linkage disequilibrium to the other SCN10A SNPs, such as rs6800541 (r2=0.20). Likewise, the EXOG locus had a very different PheWAS pattern of associations (prostatic hyperplasia, sexual and gender identity disorders, liver disease, kidney disease, cerebral degenerations, diarrhea) despite being nearby the SCN5A-10A region.
After PheWAS analysis, we analyzed the original set of 5,272 patients that met our algorithm definition for normal cardiac conduction/normal heart for subsequent development of atrial fibrillation and cardiac arrhythmias with the SCN5A rs1805126 and SCN10A rs6795970. In this population, 173 (3%) developed atrial fibrillation or atrial flutter at some point at least one month following the normal ECG, and 605 (11%) were coded as having any arrhythmia. QRS duration itself was associated with future development of atrial fibrillation (p=0.015) by logistic regression. As with the eMERGE PheWAS, SCN10A rs6795970 was associated with both arrhythmias (hazard ratio [HR]=0.81 per copy of the A allele, p=0.002) and atrial fibrillation/flutter (HR=0.67 per copy of the A allele, p=0.001). Moreover, this association was essentially unchanged when QRS was also included in the model (HR=0.68), indicating that the association between rs6795970 and atrial fibrillation is independent of the association between QRS and rs6795970. Similarly, the association between rs6795970 and cardiac arrhythmias was independent of QRS (HR=0.80 without QRS). Our analysis did not demonstrate an association between SCN5A rs1805126 and either atrial fibrillation (HR=1.2, p=0.14) or cardiac arrhythmias (1.03, p=0.66) in the normal QRS population. Figure 3 presents a Kaplan-Meier plot for the relationship between rs6795970 and development of atrial fibrillation.
The current study demonstrates common variants in the SCN5A-SCN10A locus are associated with QRS duration in subjects without clinical evidence of prior heart disease. These patients were derived from clinical practice settings, adding to the growing body of evidence of supporting the utility of EMR-based genomic analysis.12–16,27 The data replicate findings from a large meta-analysis24 drawn from multiple community populations, where information on potential QRS modulators such as heart disease status and medications was not as precisely controlled or excluded from all included studies. The major new finding here is that using the PheWAS study paradigm, we were able to examine the longitudinal associations of these genomic variants on disease in a hypothesis-free manner. This analysis revealed that SNPs in SCN10A (rs6795970) and SCN5A (rs1805126) strongly associated with QRS duration, are also associated with subsequent cardiac arrhythmias. SCN10A specifically is associated with atrial fibrillation. These associations between rs6795970 and atrial fibrillation and cardiac arrhythmias were also seen specifically in the original “heart healthy” study population, and were independent of the SNP’s association with QRS duration. The latter finding suggests that while variants at the SCN5A-SCN10A locus determine QRS and subsequent arrhythmia susceptibility, they may do so by divergent (pleiotropic) pathways, or that conduction slowing occurs not only in the ventricle but also in the atrium where it contributes to susceptibility to atrial fibrillation. Importantly, the selection logic for the case selection algorithm in our GWAS required the absence of cardiovascular disease at the time of the ECG. Therefore, these associations represent subsequent development of cardiac arrhythmias in subjects with these variants. This result highlights the potential of the EMR, with multiple diagnoses and longitudinal follow-up, to identify not only variants associated with disease susceptibility or trait variability, but also subsequent outcomes associated with these variants.
Drugs that block SCN5A-encoded sodium channels slow ventricular conduction, prolong QRS duration,28 and increased mortality following myocardial infarction in the Cardiac Arrhythmia Suppression Trial (CAST).29,30 Available evidence supports the view that slow conduction, particularly in the setting of scarred or ischemic myocardium, promotes reentrant excitation that leads to fatal arrhythmias,31–33 and in CAST, longer QRS durations also predicted increased mortality among patients treated with placebo.1 In addition, a genetic disease caused by SCN5A loss-of-function mutations (Brugada Syndrome) is characterized by slowed ventricular conduction and an increased risk for fatal arrhythmias.34 Interestingly, previous analyses of variable PR duration have also identified strong associations with variants in SCN10A,13,35–37 and multiple mechanisms are currently being examined to explain this effect: expression of SCN10A-encoded channels in cardiomyocytes and/or cardiac neurons, or regulation of SCN5A-10A expression.38–40
Our PheWAS analysis suggests that SCN10A, SCN5A, and EXOG variants are associated with a cardiac arrhythmia billing codes (entered either by physicians or professional coders); this includes atrial fibrillation and flutter, supraventricular and ventricular tachycardia, cardiac arrest, and other unspecified arrhythmias. SCN10A rs6795970 was specifically associated with atrial fibrillation and flutter, which was noted by Pfeufer et al.36 but not Holm et al.35 Chambers et al.37 previously demonstrated associations between SCN10A rs6795970 and heart block and ventricular fibrillation; we also noted an association with first-degree atrioventricular block (p=0.009, Supplementary Table 2). Mouse studies have demonstrated expression of Nav1.8 in vagal and spinal afferents in gastrointestinal mucosa and myenteric plexes,41 and interestingly SCN10A was also associated with cholecystitis in the PheWAS. Variants in CDKN1A and C6orf204, however, were not associated with cardiac arrhythmias, although we cannot exclude that weak associations may exist. In contrast, the C6orf204 locus seems most associated with neoplastic disorders (colorectal cancer, prostatic hypertrophy, and melanoma); its strongest cardiovascular disease association was atherosclerosis (odds ratio 1.094, p=0.03). Thus, PheWAS suggests that although all these regions may be associated with QRS interval, only those SNPs in the SCN5A-10A region seem significantly associated with subsequent development of arrhythmias.
Recent growth in large GWAS meta-analysis has shown the power of large numbers to find genomic influences of given traits and disease. In this study, the development of the phenotype algorithm at one site was followed by its use at four other sites with different EMR systems. Algorithm performance was similar to find patients meeting inclusion and exclusion criteria at the three sites who evaluated the algorithm, providing further validation of the transportability of EMR phenotype algorithms.16,42 Estimates of heritability of QRS using all SNPs are consistent with other reports; the heritability associated with the very limited subset of 23 SNPs implicated here appears to explain a surprisingly large proportion of this heritability estimate. This analysis also indicates that, as with other traits, there is extensive “missing heritability” when QRS duration is analyzed as a function of common genomic variation.
This report highlights both limitations and real and potential advantages of EMR-based genomic research. The present study involved analysis of subjects accrued in the initial stages of the eMERGE network, and as such a major limitation was the relatively small size of the study set. Large consortia such as CHARGE have used meta-analysis to aggregate individual datasets across many sites and have demonstrated the power of the large numbers to generate highly significant results by this approach. One of the key lessons in the eMERGE experience to date has been that algorithms to identify cases and controls for genomic or other study can be successfully deployed across multiple EMR systems.16,43,44 Thus, as the number of subjects with dense genomic information across multiple EMR systems grows, this and other EMR-based studies highlight the potential for accrual of increasingly large sample sets to identify genomic predictors of variability in phenotypes such as physiologic traits or disease susceptibility.
Further, EMRs hold the promise, as suggested here, of examining longitudinal healthcare outcomes, such as disease complications or response to drug therapies. Identifying cases for such studies requires especially large resources, since subsets of subsets (e.g. drug response X in disease Y) are required. Current efforts demonstrate the feasibility of accrual of DNA collections coupled to EMRs large enough to support statistically valid analyses of rare variants and/or rare clinical events. In the current eMERGE network, there are 9 sites with EMR-based biobanks that include >250,000 subjects, and >50,000 have been genotyped on a genome-wide platform. Other resources that should expand the reach of EMR-based genomic research include the Kaiser Northern California biobank (>100,000 individuals),45 the Million Veteran’s Project (currently >100,000 individuals),46 and biobanks that will be coupled to national healthcare systems (e.g. the UK Biobank47).
Studies in the EMR environment enable PheWAS analyses: a PheWAS experiment cannot be executed in the absence of diverse diagnoses across many diagnostic classes in study subjects. The PheWAS-based associations reported here did not achieve significance using a strict Bonferroni correction; however, SCN10A rs6795970 was strongly associated with both atrial fibrillation and cardiac arrhythmias in the normal QRS population as well. The PheWAS approach is still in development and it is likely that the strategy of using only standard diagnostic codes is a limitation. The refinement of disease classifications and abstraction methods will enable more granular phenotypic subsetting, particularly when applied to increasingly large datasets described above.
A fundamental limitation of the PheWAS technique is that every diagnosis is not explicitly included or excluded in each record. Another potential limitation of EMR-based phenotyping is the accuracy of the information contained in the record. Extracting phenotypes from the EMR can result in errors if the data in the source EMR are incorrect (e.g., the EMR specifies a disease the patient does not have). Gender and ancestry testing suggest these demographic features are rarely incorrect in an EMR.7,20 Similarly, the electronic phenotyping experience in eMERGE indicates that recurrent mentions of specific diagnoses or combinations of diverse data types (such as medications plus free text plus diagnostic codes) greatly improve diagnostic accuracy of electronic phenotyping algorithms when assessed as positive predictive value using hand curation as the gold standard.48 EMRs contain data on subjects exposed to a healthcare system and as such EMR-based studies may not be generalizable to a broad population. The conduct of EMR-based studies across a variety of geographical locations and practice settings (as in eMERGE) potentially mitigates this issue.
In summary, a genome-wide association study conducted across multiple EMR systems replicated known associations for a readily available index of cardiac conduction, the QRS duration. The algorithm deployed allowed us to analyze subjects with normal electrocardiograms, and no evidence of heart disease or confounding drugs or electrolyte abnormalities, and the phenome-wide association established genomic variants predicting slower conduction in this population also associated with subsequent development of arrhythmias. Thus, the present findings are consonant with a view that individual susceptibility to serious arrhythmias is determined in part by genetically-determined variability in cardiac electrophysiologic behaviors. Furthermore, this study highlights the advantages of a genotyped EMR population to explore the subsequent emergence of clinically-important phenotypes not ascertained in the original study design.
QRS duration, a measure of cardiac intraventricular conduction, varies ~2-fold in individuals without cardiac disease. Slow conduction may promote reentrant arrhythmias. We identified 5,272 individuals with normal ECGs and no evidence of cardiac disease from electronic medical records at five institutions and performed genome-wide association analysis. We found variants in 5 loci associated with QRS duration in normal individuals, including SCN5A, SCN10A, NFIA, near CDKN1A, and near SLC35F1 that were replicated in the CHARGE consortium QRS meta-analysis. Subsequently, we performed phenome-wide association studies on associated SNPs. These analyses demonstrated with SCN5A-10A variants were associated with future development of atrial fibrillation and cardiac arrhythmias. We conclude that DNA biobanks coupled to EMRs provide a platform not only for GWAS but also for broadly interrogating the longitudinal incidence of disease associated with genetic variants.
Supplemental Figure 1: LocusZoom plot of the region in chromosome 3 in which most of the replicated associations (18/23) were identified. All variants with p<10−4 in the eMERGE data set were replicated at p<5×10−8 in the CHARGE data set.
Supplemental Figure 2: Phenome-wide association study plot for CDKN1A (rs1321313).
Supplemental Figure 3: Phenome-wide association study plot for C6orf204 (rs6906287).
Supplemental Figure 4: Phenome-wide association study plot for NFIA (rs2207790).
Supplemental Table 1: Replicated SNPs in the QRS GWAS analysis. Reported betas and pvalues for eMERGE analysis are adjusted for sex, and all betas for CHARGE and eMERGE analyses are for the coded allele.
Supplemental Table 2: PheWAS associations for replicated SNPs (N=23). Data for all associations with P<0.01 are shown. ICD9 codes for given associations can be downloaded from http://knowledgemap.mc.vanderbilt.edu/research/content/phewas.
Funding Sources: This work was supported by the electronic MEdical Records and GEnomics (eMERGE) Network, initiated and funded by the National Human Genome Research Institute, with additional funding from the National Institute of General Medical Sciences, through the following grants: U01-HG-004610 (Group Health Cooperative); U01-HG-004608 (Marshfield Clinic); U01-HG-04599 (Mayo Clinic); U01-HG-004609 (Northwestern University); U01-HG-04603 (Vanderbilt University, also serving as the Administrative Coordinating Center). BioVU also receives support through the National Center for Research Resources UL1 RR024975, which is now at the National Center for Advancing Translational Sciences, 2 UL1 TR000445. Dr. Sotoodehnia was supported by R01-HL088456. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
Conflict of Interest Disclosures: None.
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.