Primary open angle glaucoma (POAG) is a complex disease and is one of the major leading causes of blindness worldwide. Genome-wide association studies have successfully identified several common variants associated with glaucoma; however, most of these variants only explain a small proportion of the genetic risk. Apart from the standard approach to identify main effects of variants across the genome, it is believed that gene-gene interactions can help elucidate part of the missing heritability by allowing for the test of interactions between genetic variants to mimic the complex nature of biology. To explain the etiology of glaucoma, we first performed a genome-wide association study (GWAS) on glaucoma case-control samples obtained from electronic medical records (EMR) to establish the utility of EMR data in detecting non-spurious and relevant associations; this analysis was aimed at confirming already known associations with glaucoma and validating the EMR derived glaucoma phenotype. Our findings from GWAS suggest consistent evidence of several known associations in POAG. We then performed an interaction analysis for variants found to be marginally associated with glaucoma (SNPs with main effect p-value <0.01) and observed interesting findings in the electronic MEdical Records and GEnomics Network (eMERGE) network dataset. Genes from the top epistatic interactions from eMERGE data (Likelihood Ratio Test i.e. LRT p-value <1e-05) were then tested for replication in the NEIGHBOR consortium dataset. To replicate our findings, we performed a gene-based SNP-SNP interaction analysis in NEIGHBOR and observed significant gene-gene interactions (p-value <0.001) among the top 17 gene-gene models identified in the discovery phase. Variants from gene-gene interaction analysis that we found to be associated with POAG explain 3.5% of additional genetic variance in eMERGE dataset above what is explained by the SNPs in genes that are replicated from previous GWAS studies (which was only 2.1% variance explained in eMERGE dataset); in the NEIGHBOR dataset, adding replicated SNPs from gene-gene interaction analysis explain 3.4% of total variance whereas GWAS SNPs alone explain only 2.8% of variance. Exploring gene-gene interactions may provide additional insights into many complex traits when explored in properly designed and powered association studies.
The complex nature of primary-open angle glaucoma (POAG) has left researchers exploring the genetic architecture and searching for the missing heritability using a number of different study designs. Over the past decade, many studies have been conducted to explain the etiology of POAG; however, a high proportion of estimated heritability still remains unexplained. GWA studies for POAG have identified significant associations but these associations have only explained a small proportion of the genetic risk (odds ratios range between 1–3). In this paper, we sought to confirm the primary genome-wide significant associations that have been discovered so far for glaucoma in phenotypes developed from EMR data in an effort to show that EMR data can be a powerful resource for finding genetic variants influencing POAG susceptibility. Next, we tested for statistical interactions, which can be presented as an important tool in an attempt to explain POAG heritability. We used a reduced list of variants filtered by marginal main effect analysis to look for epistatic interactions. We present our results from replication of gene-based interaction analyses performed in eMERGE and the NEIGHBOR consortium data. Using expression data and annotations from various publicly available databases, the most significant genes that replicated in our analyses show expression in the eye and trabecular meshwork. Analysis for estimation of genetic variance explained by significant associations from previous GWAS and replicated variants from gene-based interactions suggest that these explain 5.6% of variance in eMERGE dataset and also explain 3.4% variance in NEIGHBOR dataset.
We explored premature stop-gain variants to test the hypothesis that variants, which are likely to have a consequence on protein structure and function, will reveal important insights with respect to the phenotypes associated with them. We performed a phenome-wide association study (PheWAS) exploring the association between a selected list of functional stop-gain genetic variants (variation resulting in truncated proteins or in nonsense-mediated decay) and an extensive group of diagnoses to identify novel associations and uncover potential pleiotropy.
In this study, we selected 25 stop-gain variants: 5 stop-gain variants with previously reported phenotypic associations, and a set of 20 putative stop-gain variants identified using dbSNP. For the PheWAS, we used data from the electronic MEdical Records and GEnomics (eMERGE) Network across 9 sites with a total of 41,057 unrelated patients. We divided all these samples into two datasets by equal proportion of eMERGE site, sex, race, and genotyping platform. We calculated single effect associations between these 25 stop-gain variants and ICD-9 defined case-control diagnoses. We also performed stratified analyses for samples of European and African ancestry. Associations were adjusted for sex, site, genotyping platform and the first three principal components to account for global ancestry. We identified previously known associations, such as variants in LPL associated with hyperglyceridemia indicating that our approach was robust. We also found a total of three significant associations with p < 0.01 in both datasets, with the most significant replicating result being LPL SNP rs328 and ICD-9 code 272.1 “Disorder of Lipoid metabolism” (pdiscovery = 2.59x10-6, preplicating = 2.7x10-4). The other two significant replicated associations identified by this study are: variant rs1137617 in KCNH2 gene associated with ICD-9 code category 244 “Acquired Hypothyroidism” (pdiscovery = 5.31x103, preplicating = 1.15x10-3) and variant rs12060879 in DPT gene associated with ICD-9 code category 996 “Complications peculiar to certain specified procedures” (pdiscovery = 8.65x103, preplicating = 4.16x10-3).
In conclusion, this PheWAS revealed novel associations of stop-gained variants with interesting phenotypes (ICD-9 codes) along with pleiotropic effects.
Electronic supplementary material
The online version of this article (doi:10.1186/s12920-016-0191-8) contains supplementary material, which is available to authorized users.
We performed a Phenome-Wide Association Study (PheWAS) to identify interrelationships between the immune system genetic architecture and a wide array of phenotypes from two de-identified electronic health record (EHR) biorepositories. We selected variants within genes encoding critical factors in the immune system and variants with known associations with autoimmunity. To define case/control status for EHR diagnoses, we used International Classification of Diseases, Ninth Revision (ICD-9) diagnosis codes from 3,024 Geisinger Clinic MyCode® subjects (470 diagnoses) and 2,899 Vanderbilt University Medical Center BioVU biorepository subjects (380 diagnoses). A pooled-analysis was also carried out for the replicating results of the two data sets. We identified new associations with potential biological relevance including SNPs in tumor necrosis factor (TNF) and ankyrin-related genes associated with acute and chronic sinusitis and acute respiratory tract infection. The two most significant associations identified were for the C6orf10 SNP rs6910071 and “rheumatoid arthritis” (ICD-9 code category 714) (pMETAL = 2.58 x 10−9) and the ATN1 SNP rs2239167 and “diabetes mellitus, type 2” (ICD-9 code category 250) (pMETAL = 6.39 x 10−9). This study highlights the utility of using PheWAS in conjunction with EHRs to discover new genotypic-phenotypic associations for immune-system related genetic loci.
The most common side effect of angiotensin converting enzyme inhibitor drugs (ACEi) is a cough. We conducted a genome wide association study (GWAS) of ACEi-induced cough among 7,080 subjects of diverse ancestries in the eMERGE network. Cases were subjects diagnosed with ACEi-induced cough. Controls were subjects with at least 6 months of ACEi use and no cough. A GWAS (1,595 cases and 5,485 controls) identified associations on chromosome 4 in an intron of KCNIP4. The strongest association was at rs145489027 (MAF=0.33, OR=1.3 [95%CI: 1.2–1.4], p=1.0×10−8). Replication for six SNPs in KCNIP4 was tested in a second eMERGE population (n=926) and in the GoDARTS cohort (n=4,309). Replication was observed at rs7675300 (OR=1.32 [1.01–1.70], p=0.04) in eMERGE and rs16870989 and rs1495509 (OR=1.15 [1.01–1.30], p=0.03 for both) in GoDARTS. The combined association at rs1495509 was significant (OR=1.23 [1.15–1.32], p=1.9×10−9). These results indicate that SNPs in KCNIP4 may modulate ACEi-induced cough risk.
ACE inhibitor; angiotensin converting enzyme inhibitor; GWAS; KCNIP4; Drug Related Side Effects and Adverse Reactions; pharmacogenetics
The future of medicine is moving towards the phase of precision medicine, with the goal to prevent and treat diseases by taking inter-individual variability into account. A large part of the variability lies in our genetic makeup. With the fast paced improvement of high-throughput methods for genome sequencing, a tremendous amount of genetics data have already been generated. The next hurdle for precision medicine is to have sufficient computational tools for analyzing large sets of data. Genome-Wide Association Studies (GWAS) have been the primary method to assess the relationship between single nucleotide polymorphisms (SNPs) and disease traits. While GWAS is sufficient in finding individual SNPs with strong main effects, it does not capture potential interactions among multiple SNPs. In many traits, a large proportion of variation remain unexplained by using main effects alone, leaving the door open for exploring the role of genetic interactions. However, identifying genetic interactions in large-scale genomics data poses a challenge even for modern computing.
For this study, we present a new algorithm, Grammatical Evolution Bayesian Network (GEBN) that utilizes Bayesian Networks to identify interactions in the data, and at the same time, uses an evolutionary algorithm to reduce the computational cost associated with network optimization. GEBN excelled in simulation studies where the data contained main effects and interaction effects. We also applied GEBN to a Type 2 diabetes (T2D) dataset obtained from the Marshfield Personalized Medicine Research Project (PMRP). We were able to identify genetic interactions for T2D cases and controls and use information from those interactions to classify T2D samples. We obtained an average testing area under the curve (AUC) of 86.8 %. We also identified several interacting genes such as INADL and LPP that are known to be associated with T2D.
Developing the computational tools to explore genetic associations beyond main effects remains a critically important challenge in human genetics. Methods, such as GEBN, demonstrate the utility of considering genetic interactions, as they likely explain some of the missing heritability.
Evolution algorithm; Bayesian Network; Genetic interactions; Discriminant analysis; Type 2 diabetes
Recent studies on copy number variation (CNV) have suggested that an increasing burden of CNVs is associated with susceptibility or resistance to disease. A large number of genes or genomic loci contribute to complex diseases such as autism. Thus, total genomic copy number burden, as an accumulation of copy number change, is a meaningful measure of genomic instability to identify the association between global genetic effects and phenotypes of interest. However, no systematic annotation pipeline has been developed to interpret biological meaning based on the accumulation of copy number change across the genome associated with a phenotype of interest. In this study, we develop a comprehensive and systematic pipeline for annotating copy number variants into genes/genomic regions and subsequently pathways and other gene groups using Biofilter – a bioinformatics tool that aggregates over a dozen publicly available databases of prior biological knowledge. Next we conduct enrichment tests of biologically defined groupings of CNVs including genes, pathways, Gene Ontology, or protein families. We applied the proposed pipeline to a CNV dataset from the Marshfield Clinic Personalized Medicine Research Project (PMRP) in a quantitative trait phenotype derived from the electronic health record – total cholesterol. We identified several significant pathways such as toll-like receptor signaling pathway and hepatitis C pathway, gene ontologies (GOs) of nucleoside triphosphatase activity (NTPase) and response to virus, and protein families such as cell morphogenesis that are associated with the total cholesterol phenotype based on CNV profiles (permutation p-value < 0.01). Based on the copy number burden analysis, it follows that the more and larger the copy number changes, the more likely that one or more target genes that influence disease risk and phenotypic severity will be affected. Thus, our study suggests the proposed enrichment pipeline could improve the interpretability of copy number burden analysis where hundreds of loci or genes contribute toward disease susceptibility via biological knowledge groups such as pathways. This CNV annotation pipeline with Biofilter can be used for CNV data from any genotyping or sequencing platform and to explore CNV enrichment for any traits or phenotypes. Biofilter continues to be a powerful bioinformatics tool for annotating, filtering, and constructing biologically informed models for association analysis – now including copy number variants.
Copy number burden; functional annotation; electronic medical record; precision medicine
Vancomycin, a commonly used antibiotic, can be nephrotoxic. Known risk factors such as age, creatinine clearance, vancomycin dose / dosing interval, and concurrent nephrotoxic medications fail to accurately predict nephrotoxicity. To identify potential genomic risk factors, we performed a genome-wide association study (GWAS) of serum creatinine levels while on vancomycin in 489 European American individuals and validated findings in three independent cohorts totaling 439 European American individuals. In primary analyses, the chromosome 6q22.31 locus was associated with increased serum creatinine levels while on vancomycin therapy (most significant variant rs2789047, risk allele A, β = -0.06, p = 1.1 x 10-7). SNPs in this region had consistent directions of effect in the validation cohorts, with a meta-p of 1.1 x 10-7. Variation in this region on chromosome 6, which includes the genes TBC1D32/C6orf170 and GJA1 (encoding connexin43), may modulate risk of vancomycin-induced kidney injury.
The QT interval, an electrocardiographic measure reflecting myocardial repolarization, is a heritable trait. QT prolongation is a risk factor for ventricular arrhythmias and sudden cardiac death (SCD) and could indicate the presence of the potentially lethal Mendelian Long QT Syndrome (LQTS). Using a genome-wide association and replication study in up to 100,000 individuals we identified 35 common variant QT interval loci, that collectively explain ∼8-10% of QT variation and highlight the importance of calcium regulation in myocardial repolarization. Rare variant analysis of 6 novel QT loci in 298 unrelated LQTS probands identified coding variants not found in controls but of uncertain causality and therefore requiring validation. Several newly identified loci encode for proteins that physically interact with other recognized repolarization proteins. Our integration of common variant association, expression and orthogonal protein-protein interaction screens provides new insights into cardiac electrophysiology and identifies novel candidate genes for ventricular arrhythmias, LQTS,and SCD.
genome-wide association study; QT interval; Long QT Syndrome; sudden cardiac death; myocardial repolarization; arrhythmias
The electronic MEdical Records and GEnomics (eMERGE) network brings together DNA biobanks linked to electronic health records (EHRs) from multiple institutions. Approximately 51,000 DNA samples from distinct individuals have been genotyped using genome-wide SNP arrays across the nine sites of the network. The eMERGE Coordinating Center and the Genomics Workgroup developed a pipeline to impute and merge genomic data across the different SNP arrays to maximize sample size and power to detect associations with a variety of clinical endpoints. The 1000 Genomes cosmopolitan reference panel was used for imputation. Imputation results were evaluated using the following metrics: accuracy of imputation, allelic R2 (estimated correlation between the imputed and true genotypes), and the relationship between allelic R2 and minor allele frequency. Computation time and memory resources required by two different software packages (BEAGLE and IMPUTE2) were also evaluated. A number of challenges were encountered due to the complexity of using two different imputation software packages, multiple ancestral populations, and many different genotyping platforms. We present lessons learned and describe the pipeline implemented here to impute and merge genomic data sets. The eMERGE imputed dataset will serve as a valuable resource for discovery, leveraging the clinical data that can be mined from the EHR.
imputation; genome-wide association; eMERGE; electronic health records
Thyroid stimulating hormone (TSH) hormone levels are normally tightly regulated within an individual; thus, relatively small variations may indicate thyroid disease. Genome-wide association studies (GWAS) have identified variants in PDE8B and FOXE1 that are associated with TSH levels. However, prior studies lacked racial/ethnic diversity, limiting the generalization of these findings to individuals of non-European ethnicities. The Electronic Medical Records and Genomics (eMERGE) Network is a collaboration across institutions with biobanks linked to electronic medical records (EMRs). The eMERGE Network uses EMR-derived phenotypes to perform GWAS in diverse populations for a variety of phenotypes. In this report, we identified serum TSH levels from 4,501 European American and 351 African American euthyroid individuals in the eMERGE Network with existing GWAS data. Tests of association were performed using linear regression and adjusted for age, sex, body mass index (BMI), and principal components, assuming an additive genetic model. Our results replicate the known association of PDE8B with serum TSH levels in European Americans (rs2046045 p = 1.85×10−17, β = 0.09). FOXE1 variants, associated with hypothyroidism, were not genome-wide significant (rs10759944: p = 1.08×10−6, β = −0.05). No SNPs reached genome-wide significance in African Americans. However, multiple known associations with TSH levels in European ancestry were nominally significant in African Americans, including PDE8B (rs2046045 p = 0.03, β = −0.09), VEGFA (rs11755845 p = 0.01, β = −0.13), and NFIA (rs334699 p = 1.50×10−3, β = −0.17). We found little evidence that SNPs previously associated with other thyroid-related disorders were associated with serum TSH levels in this study. These results support the previously reported association between PDE8B and serum TSH levels in European Americans and emphasize the need for additional genetic studies in more diverse populations.
Combining samples across multiple cohorts in large-scale scientific research programs is often required to achieve the necessary power for genome-wide association studies. Controlling for genomic ancestry through principal component analysis (PCA) to address the effect of population stratification is a common practice. In addition to local genomic variation, such as copy number variation and inversions, other factors directly related to combining multiple studies, such as platform and site recruitment bias, can drive the correlation patterns in PCA. In this report, we describe the combination and analysis of multi-ethnic cohort with biobanks linked to electronic health records for large-scale genomic association discovery analyses. First, we outline the observed site and platform bias, in addition to ancestry differences. Second, we outline a general protocol for selecting variants for input into the subject variance-covariance matrix, the conventional PCA approach. Finally, we introduce an alternative approach to PCA by deriving components from subject loadings calculated from a reference sample. This alternative approach of generating principal components controlled for site and platform bias, in addition to ancestry differences, has the advantage of fewer covariates and degrees of freedom.
principal component analysis; ancestry; biobank; loadings; genetic association study
The objective of this study was to identify genetic variants associated with angiotensin-converting enzyme (ACE) inhibitor-associated angioedema.
Participants and methods
We carried out a genome-wide association study in 175 individuals with ACE inhibitor-associated angioedema and 489 ACE inhibitor-exposed controls from Nashville (Tennessee) and Marshfield (Wisconsin). We tested for replication in 19 cases and 57 controls who participated in Ongoing Telmisartan Alone and in Combination with Ramipril Global Endpoint Trial (ONTARGET).
There were no genome-wide significant associations of any single-nucleotide polymorphism (SNP) with angioedema. Sixteen SNPs in African Americans and 41 SNPs in European Americans were associated moderately with angioedema (P<10−4) and evaluated for association in ONTARGET. The T allele of rs500766 in PRKCQ was associated with a reduced risk, whereas the G allele of rs2724635 in ETV6 was associated with an increased risk of ACE inhibitor-associated angioedema in the Nashville/Marshfield sample and ONTARGET. In a candidate gene analysis, rs989692 in the gene encoding neprilysin (MME), an enzyme that degrades bradykinin and substance P, was significantly associated with angioedema in ONTARGET and Nashville/Marshfield African Americans.
Unlike other serious adverse drug effects, ACE inhibitor-associated angioedema is not associated with a variant with a large effect size. Variants in MME and genes involved in immune regulation may be associated with ACE inhibitor-associated angioedema.
adverse drug event; angioedema; angiotensin-converting enzyme; neprilysin
Phenome-wide association studies (PheWAS) have demonstrated utility in validating genetic associations derived from traditional genetic studies as well as identifying novel genetic associations. Here we used an electronic health record (EHR)-based PheWAS to explore pleiotropy of genetic variants in the fat mass and obesity associated gene (FTO), some of which have been previously associated with obesity and type 2 diabetes (T2D). We used a population of 10,487 individuals of European ancestry with genome-wide genotyping from the Electronic Medical Records and Genomics (eMERGE) Network and another population of 13,711 individuals of European ancestry from the BioVU DNA biobank at Vanderbilt genotyped using Illumina HumanExome BeadChip. A meta-analysis of the two study populations replicated the well-described associations between FTO variants and obesity (odds ratio [OR] = 1.25, 95% Confidence Interval = 1.11–1.24, p = 2.10 × 10−9) and FTO variants and T2D (OR = 1.14, 95% CI = 1.08–1.21, p = 2.34 × 10−6). The meta-analysis also demonstrated that FTO variant rs8050136 was significantly associated with sleep apnea (OR = 1.14, 95% CI = 1.07–1.22, p = 3.33 × 10−5); however, the association was attenuated after adjustment for body mass index (BMI). Novel phenotype associations with obesity-associated FTO variants included fibrocystic breast disease (rs9941349, OR = 0.81, 95% CI = 0.74–0.91, p = 5.41 × 10−5) and trends toward associations with non-alcoholic liver disease and gram-positive bacterial infections. FTO variants not associated with obesity demonstrated other potential disease associations including non-inflammatory disorders of the cervix and chronic periodontitis. These results suggest that genetic variants in FTO may have pleiotropic associations, some of which are not mediated by obesity.
PheWAS; genetic association; pleiotropy; Exome chip; FTO; BMI
Electrocardiographic QRS duration, a measure of cardiac intraventricular conduction, varies ~2-fold in individuals without cardiac disease. Slow conduction may promote reentrant arrhythmias.
Methods and Results
We performed a genome-wide association study (GWAS) to identify genomic markers of QRS duration in 5,272 individuals without cardiac disease selected from electronic medical record (EMR) algorithms at five sites in the Electronic Medical Records and Genomics (eMERGE) network. The most significant loci were evaluated within the CHARGE consortium QRS GWAS meta-analysis. Twenty-three single nucleotide polymorphisms in 5 loci, previously described by CHARGE, were replicated in the eMERGE samples; 18 SNPs were in the chromosome 3 SCN5A and SCN10A loci, where the most significant SNPs were rs1805126 in SCN5A with p=1.2×10−8 (eMERGE) and p=2.5×10−20 (CHARGE) and rs6795970 in SCN10A with p=6×10−6 (eMERGE) and p=5×10−27 (CHARGE). The other loci were in NFIA, near CDKN1A, and near C6orf204. We then performed phenome-wide association studies (PheWAS) on variants in these five loci in 13,859 European Americans to search for diagnoses associated with these markers. PheWAS identified atrial fibrillation and cardiac arrhythmias as the most common associated diagnoses with SCN10A and SCN5A variants. SCN10A variants were also associated with subsequent development of atrial fibrillation and arrhythmia in the original 5,272 “heart-healthy” study population.
We conclude that DNA biobanks coupled to EMRs provide a platform not only for GWAS but may also allow broad interrogation of the longitudinal incidence of disease associated with genetic variants. The PheWAS approach implicated sodium channel variants modulating QRS duration in subjects without cardiac disease as predictors of subsequent arrhythmias.
cardiac conduction; QRS duration; atrial fibrillation; genome-wide association study; phenome-wide association study; electronic medical records
Genetic variants of the enzyme that metabolizes warfarin, cytochrome P-450 2C9 (CYP2C9), and of a key pharmacologic target of warfarin, vitamin K epoxide reductase (VKORC1), contribute to differences in patients’ responses to various warfarin doses, but the role of these variants during initial anticoagulation is not clear.
In 297 patients starting warfarin therapy, we assessed CYP2C9 genotypes (CYP2C9 *1, *2, and *3), VKORC1 haplotypes (designated A and non-A), clinical characteristics, response to therapy (as determined by the international normalized ratio [INR]), and bleeding events. The study outcomes were the time to the first INR within the therapeutic range, the time to the first INR of more than 4, the time above the therapeutic INR range, the INR response over time, and the warfarin dose requirement.
As compared with patients with the non-A/non-A haplotype, patients with the A/A haplotype of VKORC1 had a decreased time to the first INR within the therapeutic range (P = 0.02) and to the first INR of more than 4 (P = 0.003). In contrast, the CYP2C9 genotype was not a significant predictor of the time to the first INR within the therapeutic range (P = 0.57) but was a significant predictor of the time to the first INR of more than 4 (P = 0.03). Both the CYP2C9 genotype and VKORC1 haplotype had a significant influence on the required warfarin dose after the first 2 weeks of therapy.
Initial variability in the INR response to warfarin was more strongly associated with genetic variability in the pharmacologic target of warfarin, VKORC1, than with CYP2C9.
Alzheimer disease (AD) is a devastating neurodegenerative disease affecting more than five million Americans. In this study, we have used updated genetic linkage data from chromosome 10 in combination with expression data from serial analysis of gene expression to choose a new set of thirteen candidate genes for genetic analysis in late onset Alzheimer disease (LOAD). Results in this study identify the KIAA1462 locus as a candidate locus for LOAD in APOE4 carriers. Two genes exist at this locus, KIAA1462, a gene associated with coronary artery disease, and “rokimi”, encoding an untranslated spliced RNA The genetic architecture at this locus suggests that the gene product important in this association is either “rokimi”, or a different isoform of KIAA1462 than the isoform that is important in cardiovascular disease. Expression data suggests that isoform f of KIAA1462 is a more attractive candidate for association with LOAD in APOE4 carriers than “rokimi” which had no detectable expression in brain.
Genome-wide association studies (GWAS) are a useful approach in the study of the genetic components of complex phenotypes. Aside from large cohorts, GWAS have generally been limited to the study of one or a few diseases or traits. The emergence of biobanks linked to electronic medical records (EMRs) allows the efficient re-use of genetic data to yield meaningful genotype-phenotype associations for multiple phenotypes or traits. Phase I of the electronic MEdical Records and GEnomics (eMERGE-I) Network is a National Human Genome Research Institute (NHGRI)-supported consortium composed of five sites to perform various genetic association studies using DNA repositories and EMR systems. Each eMERGE site has developed EMR-based algorithms to comprise a core set of fourteen phenotypes for extraction of study samples from each site’s DNA repository. Each eMERGE site selected samples for a specific phenotype, and these samples were genotyped at either the Broad Institute or at the Center for Inherited Disease Research (CIDR) using the Illumina Infinium BeadChip technology. In all, approximately 17,000 samples from across the five sites were genotyped. A unified quality control (QC) pipeline was developed by the eMERGE Genomics Working Group and used to ensure thorough cleaning of the data. This process includes examination of sample quality, marker quality, and various batch effects. Upon completion of the genotyping and QC analyses for each site’s primary study, the eMERGE Coordinating Center merged the datasets from all five sites. This larger merged dataset re-entered the established eMERGE QC pipeline. Based on lessons learned during the process, additional analyses and QC checkpoints were added to the pipeline to ensure proper merging. Here we explore the challenges associated with combining datasets from different genotyping centers and describe the expansion to the eMERGE QC pipeline for merged datasets. These additional steps will be useful as the eMERGE project expands to include additional sites in eMERGE-II and also serve as a starting point for investigators merging multiple genotype data sets accessible through the National Center for Biotechnology Information (NCBI) in the database of Genotypes and Phenotypes (dbGaP). Our experience demonstrates that merging multiple datasets after additional QC can be an efficient use of genotype data despite new challenges that appear in the process.
quality control; genome-wide association (GWAS); eMERGE; dbGaP; merging datasets
Drug-induced long QT syndrome (diLQTS) is an adverse drug effect that has an important impact on drug use, development, and regulation. Here, we tested the hypothesis that common variants in key genes controlling cardiac electrical properties modify the risk of diLQTS.
Methods and Results
In a case-control setting, we included 176 patients of European descent from North America and Europe with diLQTS, defined as documented torsades de pointes during treatment with a QT prolonging drug. Control samples were obtained from 207 patients of European ancestry who displayed <50 msec QT lengthening during initiation of therapy with a QT-prolonging drug, and 837 controls from the population based KORA study. Subjects were successfully genotyped at 1,424 single nucleotide polymorphisms (SNPs) in 18 candidate genes including 1,386 SNPs tagging common haplotype blocks, and 38 non-synonymous ion channel gene SNPs. For validation we used a set of cases (n=57) and population-based controls of European descent. The SNP KCNE1 D85N (rs1805128), known to modulate an important potassium current in the heart, predicted diLQTS with an odds ratio of 9.0 (95% confidence interval: 3.5–22.9). The variant allele was present in 8.6% of cases, 2.9% of drug-exposed controls, and 1.8% of population controls. In the validation cohort the variant allele was present in 3.5% of cases, and in 1.4% of controls.
This high-density candidate SNP approach identified a key potassium channel susceptibility allele that may be associated with the rare adverse drug reaction torsades de pointes.
candidate genes; death, sudden; SNP; torsade de pointes; adverse drug events
Genome-wide association studies (GWAS) are being conducted at an unprecedented rate in population-based cohorts and have increased our understanding of the pathophysiology of complex disease. The recent application of GWAS to clinic-based cohorts has also yielded genetic predictors of clinical outcomes. Regardless of context, the practical utility of this information will ultimately depend upon the quality of the original data. Quality control (QC) procedures for GWAS are computationally intensive, operationally challenging, and constantly evolving. With each new dataset, new realities are discovered about GWAS data and best practices continue to be developed. The Genomics Workgroup of the National Human Genome Research Institute (NHGRI) funded electronic Medical Records and Genomics (eMERGE) network has invested considerable effort in developing strategies for QC of these data. The lessons learned by this group will be valuable for other investigators dealing with large scale genomic datasets. Here we enumerate some of the challenges in QC of GWAS data and describe the approaches that the eMERGE network is using for quality assurance in GWAS data, thereby minimizing potential bias and error in GWAS results. In this protocol we discuss common issues associated with QC of GWAS data, including data file formats, software packages for data manipulation and analysis, sex chromosome anomalies, sample identity, sample relatedness, population substructure, batch effects, and marker quality. We propose best practices and discuss areas of ongoing and future research.
Multiple sclerosis is a debilitating neuroimmunological and neurodegenerative disease affecting more than 400,000 individuals in the United States. Population and family-based studies have suggested that there is a strong genetic component. Numerous genomic linkage screens have identified regions of interest for MS loci. Our own second-generation genome-wide linkage study identified a handful of non-MHC regions with suggestive linkage. Several of these regions were further examined using single-nucleotide polymorphisms (SNPs) with average spacing between SNPs of approximately 1.0 Mb in a dataset of 173 multiplex families. The results of that study provided further evidence for the involvement of the chromosome 1q43 region. This region is of particular interest given linkage evidence in studies of other autoimmune and inflammatory diseases including rheumatoid arthritis and systemic lupus erythematosus. In this follow-up study, we saturated the region with ~700 SNPs (average spacing of 10kb per SNP) in search of disease associated variation within this region. We found preliminary evidence to suggest that common variation within the RGS7 locus may be involved in disease susceptibility.
multiple sclerosis; linkage; association; 1q43; RGS7
A substantial body of research supports a genetic involvement in autism. Furthermore, results from various genomic screens implicate a region on chromosome 7q31 as harboring an autism susceptibility variant. We previously narrowed this 34 cM region to a 3 cM critical region (located between D7S496 and D7S2418) using the Collaborative Linkage Study of Autism (CLSA) chromosome 7 linked families. This interval encompasses about 4.5 Mb of genomic DNA and encodes over fifty known and predicted genes. Four candidate genes (NRCAM, LRRN3, KIAA0716, and LAMB1) in this region were chosen for examination based on their proximity to the marker most consistently cosegregating with autism in these families (D7S1817), their tissue expression patterns, and likely biological relevance to autism.
Thirty-six intronic and exonic single nucleotide polymorphisms (SNPs) and one microsatellite marker within and around these four candidate genes were genotyped in 30 chromosome 7q31 linked families. Multiple SNPs were used to provide as complete coverage as possible since linkage disequilibrium can vary dramatically across even very short distances within a gene. Analyses of these data used the Pedigree Disequilibrium Test for single markers and a multilocus likelihood ratio test.
As expected, linkage disequilibrium occurred within each of these genes but we did not observe significant LD across genes. None of the polymorphisms in NRCAM, LRRN3, or KIAA0716 gave p < 0.05 suggesting that none of these genes is associated with autism susceptibility in this subset of chromosome 7-linked families. However, with LAMB1, the allelic association analysis revealed suggestive evidence for a positive association, including one individual SNP (p = 0.02) and three separate two-SNP haplotypes across the gene (p = 0.007, 0.012, and 0.012).
NRCAM, LRRN3, KIAA0716 are unlikely to be involved in autism. There is some evidence that variation in or near the LAMB1 gene may be involved in autism.