The electronic MEdical Records and GEnomics (eMERGE) network brings together DNA biobanks linked to electronic health records (EHRs) from multiple institutions. Approximately 51,000 DNA samples from distinct individuals have been genotyped using genome-wide SNP arrays across the nine sites of the network. The eMERGE Coordinating Center and the Genomics Workgroup developed a pipeline to impute and merge genomic data across the different SNP arrays to maximize sample size and power to detect associations with a variety of clinical endpoints. The 1000 Genomes cosmopolitan reference panel was used for imputation. Imputation results were evaluated using the following metrics: accuracy of imputation, allelic R2 (estimated correlation between the imputed and true genotypes), and the relationship between allelic R2 and minor allele frequency. Computation time and memory resources required by two different software packages (BEAGLE and IMPUTE2) were also evaluated. A number of challenges were encountered due to the complexity of using two different imputation software packages, multiple ancestral populations, and many different genotyping platforms. We present lessons learned and describe the pipeline implemented here to impute and merge genomic data sets. The eMERGE imputed dataset will serve as a valuable resource for discovery, leveraging the clinical data that can be mined from the EHR.
imputation; genome-wide association; eMERGE; electronic health records
Gene–gene interactions may contribute to the genetic variation underlying complex traits but have not always been taken fully into account. Statistical analyses that consider gene–gene interaction may increase the power of detecting associations, especially for low-marginal-effect markers, and may explain in part the “missing heritability.” Detecting pair-wise and higher-order interactions genome-wide requires enormous computational power. Filtering pipelines increase the computational speed by limiting the number of tests performed. We summarize existing filtering approaches to detect epistasis, after distinguishing the purposes that lead us to search for epistasis. Statistical filtering includes quality control on the basis of single marker statistics to avoid the analysis of bad and least informative data, and limits the search space for finding interactions. Biological filtering includes targeting specific pathways, integrating various databases based on known biological and metabolic pathways, gene function ontology and protein–protein interactions. It is increasingly possible to target single-nucleotide polymorphisms that have defined functions on gene expression, though not belonging to protein-coding genes. Filtering can improve the power of an interaction association study, but also increases the chance of missing important findings.
epistasis; genetic interaction; biological interaction; filtering pipeline; optimal search
Genome-wide association studies (GWAS) are a useful approach in the study of the genetic components of complex phenotypes. Aside from large cohorts, GWAS have generally been limited to the study of one or a few diseases or traits. The emergence of biobanks linked to electronic medical records (EMRs) allows the efficient re-use of genetic data to yield meaningful genotype-phenotype associations for multiple phenotypes or traits. Phase I of the electronic MEdical Records and GEnomics (eMERGE-I) Network is a National Human Genome Research Institute (NHGRI)-supported consortium composed of five sites to perform various genetic association studies using DNA repositories and EMR systems. Each eMERGE site has developed EMR-based algorithms to comprise a core set of fourteen phenotypes for extraction of study samples from each site’s DNA repository. Each eMERGE site selected samples for a specific phenotype, and these samples were genotyped at either the Broad Institute or at the Center for Inherited Disease Research (CIDR) using the Illumina Infinium BeadChip technology. In all, approximately 17,000 samples from across the five sites were genotyped. A unified quality control (QC) pipeline was developed by the eMERGE Genomics Working Group and used to ensure thorough cleaning of the data. This process includes examination of sample quality, marker quality, and various batch effects. Upon completion of the genotyping and QC analyses for each site’s primary study, the eMERGE Coordinating Center merged the datasets from all five sites. This larger merged dataset re-entered the established eMERGE QC pipeline. Based on lessons learned during the process, additional analyses and QC checkpoints were added to the pipeline to ensure proper merging. Here we explore the challenges associated with combining datasets from different genotyping centers and describe the expansion to the eMERGE QC pipeline for merged datasets. These additional steps will be useful as the eMERGE project expands to include additional sites in eMERGE-II and also serve as a starting point for investigators merging multiple genotype data sets accessible through the National Center for Biotechnology Information (NCBI) in the database of Genotypes and Phenotypes (dbGaP). Our experience demonstrates that merging multiple datasets after additional QC can be an efficient use of genotype data despite new challenges that appear in the process.
quality control; genome-wide association (GWAS); eMERGE; dbGaP; merging datasets
Research in human genetics and genetic epidemiology has grown significantly over the previous decade, particularly in the field of pharmacogenomics. Pharmacogenomics presents an opportunity for rapid translation of associated genetic polymorphisms into diagnostic measures or tests to guide therapy as part of a move towards personalized medicine. Expansion in genotyping technology has cleared the way for widespread use of whole-genome genotyping in the effort to identify novel biology and new genetic markers associated with pharmacokinetic and pharmacodynamic endpoints. With new technology and methodology regularly becoming available for use in genetic studies, a discussion on the application of such tools becomes necessary. In particular, quality control criteria have evolved with the use of GWAS as we have come to understand potential systematic errors which can be introduced into the data during genotyping. There have been several replicated pharmacogenomic associations, some of which have moved to the clinic to enact change in treatment decisions. These examples of translation illustrate the strength of evidence necessary to successfully and effectively translate a genetic discovery. In this review, the design of pharmacogenomic association studies is examined with the goal of optimizing the impact and utility of this research. Issues of ascertainment, genotyping, quality control, analysis and interpretation are considered.
Epistasis; genotyping; personalized medicine; pharmacogenomics; quality control; statistics; study design
Since publication of the human genome in 2003, geneticists have been interested in risk variant associations to resolve the etiology of traits and complex diseases. The International HapMap Consortium undertook an effort to catalog all common variation across the genome (variants with a minor allele frequency (MAF) of at least 5% in one or more ethnic groups). HapMap along with advances in genotyping technology led to genome-wide association studies which have identified common variants associated with many traits and diseases. In 2008 the 1000 Genomes Project aimed to sequence 2500 individuals and identify rare variants and 99% of variants with a MAF of <1%.
To determine whether the 1000 Genomes Project includes all the variants in HapMap, we examined the overlap between single nucleotide polymorphisms (SNPs) genotyped in the two resources using merged phase II/III HapMap data and low coverage pilot data from 1000 Genomes.
Comparison of the two data sets showed that approximately 72% of HapMap SNPs were also found in 1000 Genomes Project pilot data. After filtering out HapMap variants with a MAF of <5% (separately for each population), 99% of HapMap SNPs were found in 1000 Genomes data.
Not all variants cataloged in HapMap are also cataloged in 1000 Genomes. This could affect decisions about which resource to use for SNP queries, rare variant validation, or imputation. Both the HapMap and 1000 Genomes Project databases are useful resources for human genetics, but it is important to understand the assumptions made and filtering strategies employed by these projects.
Genome-wide association studies (GWAS) are being conducted at an unprecedented rate in population-based cohorts and have increased our understanding of the pathophysiology of complex disease. The recent application of GWAS to clinic-based cohorts has also yielded genetic predictors of clinical outcomes. Regardless of context, the practical utility of this information will ultimately depend upon the quality of the original data. Quality control (QC) procedures for GWAS are computationally intensive, operationally challenging, and constantly evolving. With each new dataset, new realities are discovered about GWAS data and best practices continue to be developed. The Genomics Workgroup of the National Human Genome Research Institute (NHGRI) funded electronic Medical Records and Genomics (eMERGE) network has invested considerable effort in developing strategies for QC of these data. The lessons learned by this group will be valuable for other investigators dealing with large scale genomic datasets. Here we enumerate some of the challenges in QC of GWAS data and describe the approaches that the eMERGE network is using for quality assurance in GWAS data, thereby minimizing potential bias and error in GWAS results. In this protocol we discuss common issues associated with QC of GWAS data, including data file formats, software packages for data manipulation and analysis, sex chromosome anomalies, sample identity, sample relatedness, population substructure, batch effects, and marker quality. We propose best practices and discuss areas of ongoing and future research.
Gene-gene interactions are proposed as one important component of the genetic architecture of complex diseases, and are just beginning to be evaluated in the context of genome wide association studies (GWAS). In addition to detecting epistasis, a benefit to interaction analysis is that it also increases power to detect weak main effects. We conducted a knowledge-driven interaction analysis of a GWAS of 931 multiple sclerosis trios to discover gene-gene interactions within established biological contexts. We identify heterogeneous signals, including a gene-gene interaction between CHRM3 and MYLK (joint p = 0.0002), an interaction between two phospholipase-β isoforms, PLCβ1 & PLCβ4 (joint p = 0.0098), and a modest interaction between ACTN1 and MYH9 (joint p = 0.0326), all localized to calcium-signaled cytoskeletal regulation. Furthermore, we discover a main effect (joint p = 5.2E-5) previously unidentified by single-locus analysis within another related gene, SCIN, a calcium-binding cytoskeleton regulatory protein. This work illustrates that knowledge-driven interaction analysis of GWAS data is a feasible approach to identify new genetic effects. The results of this study are among the first gene-gene interactions and non-immune susceptibility loci for multiple sclerosis. Further, the implicated genes cluster within inter-related biological mechanisms that suggest a neurodegenerative component to multiple sclerosis.
As genetic epidemiology looks beyond mapping single disease susceptibility loci, interest in detecting epistatic interactions between genes has grown. The dimensionality and comparisons required to search the epistatic space and the inference for a significant result pose challenges for testing epistatic disease models. The Multifactor Dimensionality Reduction Pedigree Disequilibrium Test (MDR-PDT) was developed to test for multilocus models in pedigree data. In the present study we rigorously tested MDR-PDT with new cross-validation (CV) (both 5- and 10-fold) and omnibus model selection algorithms by simulating a range of heritabilities, odds ratios, minor allele frequencies, sample sizes, and numbers of interacting loci. Power was evaluated using 100, 500, and 1000 families, with minor allele frequencies 0.2 and 0.4 and broad-sense heritabilities of 0.005, 0.01, 0.03, 0.05, and 0.1 for 2 and 3-locus purely epistatic penetrance models. We also compared the prediction error measure of effect with a predicted matched odds ratio for final model selection and testing. We report that the CV procedure is valid with the permutation test, MDR-PDT performs similarly with 5 and 10- fold CV, and that the matched odds ratio is more powerful than prediction error as the fitness metric for MDR-PDT.
Epistasis; MDR-PDT; complex disease; family-based association; bioinformatics
In genetic studies of complex disease a consideration for the investigator is detection of joint effects. The Multifactor Dimensionality Reduction (MDR) algorithm searches for these effects with an exhaustive approach. Previously unknown aspects of MDR performance were the power to detect interactive effects given large numbers of non-model loci or varying degrees of heterogeneity among multiple epistatic disease models.
To address the performance with many non-model loci, datasets of 500 cases and 500 controls with 100 to 10,000 SNPs were simulated for two-locus models, and one hundred 500-case/500-control datasets with 100 and 500 SNPs were simulated for three-locus models. Multiple levels of locus heterogeneity were simulated in several sample sizes.
These results show MDR is robust to locus heterogeneity when the definition of power is not as conservative as in previous simulation studies where all model loci were required to be found by the method. The results also indicate that MDR performance is related more strongly to broad-sense heritability than sample size and is not greatly affected by non-model loci.
A study in which a population with high heritability estimates is sampled predisposes the MDR study to success more than a larger ascertainment in a population with smaller estimates.
Epistasis; MDR; Heterogeneity
This editorial introduces BioData Mining, a new journal which publishes research articles related to advances in computational methods and techniques for the extraction of useful knowledge from heterogeneous biological data. We outline the aims and scope of the journal, introduce the publishing model and describe the open peer review policy, which fosters interaction within the research community.
The QT interval, an electrocardiographic measure reflecting myocardial repolarization, is a heritable trait. QT prolongation is a risk factor for ventricular arrhythmias and sudden cardiac death (SCD) and could indicate the presence of the potentially lethal Mendelian Long QT Syndrome (LQTS). Using a genome-wide association and replication study in up to 100,000 individuals we identified 35 common variant QT interval loci, that collectively explain ∼8-10% of QT variation and highlight the importance of calcium regulation in myocardial repolarization. Rare variant analysis of 6 novel QT loci in 298 unrelated LQTS probands identified coding variants not found in controls but of uncertain causality and therefore requiring validation. Several newly identified loci encode for proteins that physically interact with other recognized repolarization proteins. Our integration of common variant association, expression and orthogonal protein-protein interaction screens provides new insights into cardiac electrophysiology and identifies novel candidate genes for ventricular arrhythmias, LQTS,and SCD.
genome-wide association study; QT interval; Long QT Syndrome; sudden cardiac death; myocardial repolarization; arrhythmias
Environment-wide association studies (EWAS) provide a way to uncover the environmental mechanisms involved in complex traits in a high-throughput manner. Genome-wide association studies have led to the discovery of genetic variants associated with many common diseases but do not take into account the environmental component of complex phenotypes. This EWAS assesses the comprehensive association between environmental variables and the outcome of type 2 diabetes (T2D) in the Marshfield Personalized Medicine Research Project Biobank (Marshfield PMRP). We sought replication in two National Health and Nutrition Examination Surveys (NHANES). The Marshfield PMRP currently uses four tools for measuring environmental exposures and outcome traits: 1) the PhenX Toolkit includes standardized exposure and phenotypic measures across several domains, 2) the Diet History Questionnaire (DHQ) is a food frequency questionnaire, 3) the Measurement of a Person's Habitual Physical Activity scores the level of an individual's physical activity, and 4) electronic health records (EHR) employs validated algorithms to establish T2D case-control status. Using PLATO software, 314 environmental variables were tested for association with T2D using logistic regression, adjusting for sex, age, and BMI in over 2,200 European Americans. When available, similar variables were tested with the same methods and adjustment in samples from NHANES III and NHANES 1999-2002. Twelve and 31 associations were identified in the Marshfield samples at p<0.01 and p<0.05, respectively. Seven and 13 measures replicated in at least one of the NHANES at p<0.01 and p<0.05, respectively, with the same direction of effect. The most significant environmental exposures associated with T2D status included decreased alcohol use as well as increased smoking exposure in childhood and adulthood. The results demonstrate the utility of the EWAS method and survey tools for identifying environmental components of complex diseases like type 2 diabetes. These high-throughput and comprehensive investigation methods can easily be applied to investigate the relation between environmental exposures and multiple phenotypes in future analyses.
Platelets are enucleated cell fragments derived from megakaryocytes that play key roles in hemostasis and in the pathogenesis of atherothrombosis and cancer. Platelet traits are highly heritable and identification of genetic variants associated with platelet traits and assessing their pleiotropic effects may help to understand the role of underlying biological pathways. We conducted an electronic medical record (EMR)-based study to identify common variants that influence inter-individual variation in the number of circulating platelets (PLT) and mean platelet volume (MPV), by performing a genome-wide association study (GWAS). We characterized association of variants influencing MPV and PLT using functional, pathway and disease enrichment analysis assess pleiotropic effects of such variants by performing a phenome-wide association study (PheWAS) with a wide range of EMR-derived phenotypes. A total of 13,582 participants in the electronic MEdical Records and GEnomic (eMERGE) network had data for PLT and 6,291 participants had data for MPV. We identified 5 chromosomal regions associated with PLT and 8 associated with MPV at genome-wide significance (P<5E-8). In addition, we replicated 20 SNPs (out of 56 SNPs (α: 0.05/56=9E-4)) influencing PLT and 22 SNPs (out of 29 SNPs (α: 0.05/29=2E-3)) influencing MPV in a meta-analysis of GWAS of PLT and MPV. While our GWAS did not reveal any novel associations, our functional analyses revealed that genes in these regions influence thrombopoiesis and encode kinases, membrane proteins, proteins involved in cellular trafficking, transcription factors, proteasome complex subunits, proteins of signal transduction pathways, proteins involved in megakaryocyte development and platelet production and hemostasis. PheWAS using a single-SNP Bonferroni correction for 1368 diagnoses (0.05/1368=3.6E-5) revealed that several variants in these genes have pleiotropic associations with myocardial infarction, autoimmune and hematologic disorders. We conclude that multiple genetic loci influence interindividual variation in platelet traits and also have significant pleiotropic effects; the related genes are in multiple functional pathways including those relevant to thrombopoiesis.
We performed a Phenome-wide association study (PheWAS) utilizing diverse genotypic and phenotypic data existing across multiple populations in the National Health and Nutrition Examination Surveys (NHANES), conducted by the Centers for Disease Control and Prevention (CDC), and accessed by the Epidemiological Architecture for Genes Linked to Environment (EAGLE) study. We calculated comprehensive tests of association in Genetic NHANES using 80 SNPs and 1,008 phenotypes (grouped into 184 phenotype classes), stratified by race-ethnicity. Genetic NHANES includes three surveys (NHANES III, 1999–2000, and 2001–2002) and three race-ethnicities: non-Hispanic whites (n = 6,634), non-Hispanic blacks (n = 3,458), and Mexican Americans (n = 3,950). We identified 69 PheWAS associations replicating across surveys for the same SNP, phenotype-class, direction of effect, and race-ethnicity at p<0.01, allele frequency >0.01, and sample size >200. Of these 69 PheWAS associations, 39 replicated previously reported SNP-phenotype associations, 9 were related to previously reported associations, and 21 were novel associations. Fourteen results had the same direction of effect across more than one race-ethnicity: one result was novel, 11 replicated previously reported associations, and two were related to previously reported results. Thirteen SNPs showed evidence of pleiotropy. We further explored results with gene-based biological networks, contrasting the direction of effect for pleiotropic associations across phenotypes. One PheWAS result was ABCG2 missense SNP rs2231142, associated with uric acid levels in both non-Hispanic whites and Mexican Americans, protoporphyrin levels in non-Hispanic whites and Mexican Americans, and blood pressure levels in Mexican Americans. Another example was SNP rs1800588 near LIPC, significantly associated with the novel phenotypes of folate levels (Mexican Americans), vitamin E levels (non-Hispanic whites) and triglyceride levels (non-Hispanic whites), and replication for cholesterol levels. The results of this PheWAS show the utility of this approach for exposing more of the complex genetic architecture underlying multiple traits, through generating novel hypotheses for future research.
The Epidemiological Architecture for Genes Linked to Environment (EAGLE) study performed a Phenome-Wide Association Study (PheWAS) to investigate comprehensive associations between a wide range of phenotypes and single-nucleotide polymorphisms using the diverse genotypic and phenotypic data that exists across multiple populations in the National Health and Nutrition Examination Surveys (NHANES), conducted by the Centers for Disease Control and Prevention (CDC). In this study, we replicated known genotype-phenotype associations, identified genotypes associated with phenotypes related to previously reported associations, and most importantly, identified a series of novel genotype-phenotype associations. We also identified potential pleiotropy; that is, SNPs associated with more than one phenotype. We explored the features of these PheWAS results, characterizing any potential functionality of the SNPs of this study, determining association results that were found in more than one racial/ethnic group for the same SNP and phenotype, identifying novel direction of effect relationships for SNPs demonstrating potential pleiotropy, and investigating the association results in the context of gene-based biological networks. Through considering the SNP associations on multiple phenotypic outcomes, as well as through exploring pleiotropy, we may be able to leverage the results of PheWAS to uncover more of the complex underlying genomic architecture of complex traits.
Thyroid stimulating hormone (TSH) hormone levels are normally tightly regulated within an individual; thus, relatively small variations may indicate thyroid disease. Genome-wide association studies (GWAS) have identified variants in PDE8B and FOXE1 that are associated with TSH levels. However, prior studies lacked racial/ethnic diversity, limiting the generalization of these findings to individuals of non-European ethnicities. The Electronic Medical Records and Genomics (eMERGE) Network is a collaboration across institutions with biobanks linked to electronic medical records (EMRs). The eMERGE Network uses EMR-derived phenotypes to perform GWAS in diverse populations for a variety of phenotypes. In this report, we identified serum TSH levels from 4,501 European American and 351 African American euthyroid individuals in the eMERGE Network with existing GWAS data. Tests of association were performed using linear regression and adjusted for age, sex, body mass index (BMI), and principal components, assuming an additive genetic model. Our results replicate the known association of PDE8B with serum TSH levels in European Americans (rs2046045 p = 1.85×10−17, β = 0.09). FOXE1 variants, associated with hypothyroidism, were not genome-wide significant (rs10759944: p = 1.08×10−6, β = −0.05). No SNPs reached genome-wide significance in African Americans. However, multiple known associations with TSH levels in European ancestry were nominally significant in African Americans, including PDE8B (rs2046045 p = 0.03, β = −0.09), VEGFA (rs11755845 p = 0.01, β = −0.13), and NFIA (rs334699 p = 1.50×10−3, β = −0.17). We found little evidence that SNPs previously associated with other thyroid-related disorders were associated with serum TSH levels in this study. These results support the previously reported association between PDE8B and serum TSH levels in European Americans and emphasize the need for additional genetic studies in more diverse populations.
Objective: We report the first pediatric specific Phenome-Wide Association Study (PheWAS) using electronic medical records (EMRs). Given the early success of PheWAS in adult populations, we investigated the feasibility of this approach in pediatric cohorts in which associations between a previously known genetic variant and a wide range of clinical or physiological traits were evaluated. Although computationally intensive, this approach has potential to reveal disease mechanistic relationships between a variant and a network of phenotypes.
Method: Data on 5049 samples of European ancestry were obtained from the EMRs of two large academic centers in five different genotyped cohorts. Recently, these samples have undergone whole genome imputation. After standard quality controls, removing missing data and outliers based on principal components analyses (PCA), 4268 samples were used for the PheWAS study. We scanned for associations between 2476 single-nucleotide polymorphisms (SNP) with available genotyping data from previously published GWAS studies and 539 EMR-derived phenotypes. The false discovery rate was calculated and, for any new PheWAS findings, a permutation approach (with up to 1,000,000 trials) was implemented.
Results: This PheWAS found a variety of common variants (MAF > 10%) with prior GWAS associations in our pediatric cohorts including Juvenile Rheumatoid Arthritis (JRA), Asthma, Autism and Pervasive Developmental Disorder (PDD) and Type 1 Diabetes with a false discovery rate < 0.05 and power of study above 80%. In addition, several new PheWAS findings were identified including a cluster of association near the NDFIP1 gene for mental retardation (best SNP rs10057309, p = 4.33 × 10−7, OR = 1.70, 95%CI = 1.38 − 2.09); association near PLCL1 gene for developmental delays and speech disorder [best SNP rs1595825, p = 1.13 × 10−8, OR = 0.65(0.57 − 0.76)]; a cluster of associations in the IL5-IL13 region with Eosinophilic Esophagitis (EoE) [best at rs12653750, p = 3.03 × 10−9, OR = 1.73 95%CI = (1.44 − 2.07)], previously implicated in asthma, allergy, and eosinophilia; and association of variants in GCKR and JAZF1 with allergic rhinitis in our pediatric cohorts [best SNP rs780093, p = 2.18 × 10−5, OR = 1.39, 95%CI = (1.19 − 1.61)], previously demonstrated in metabolic disease and diabetes in adults.
Conclusion: The PheWAS approach with re-mapping ICD-9 structured codes for our European-origin pediatric cohorts, as with the previous adult studies, finds many previously reported associations as well as presents the discovery of associations with potentially important clinical implications.
PheWAS; ICD-9 code; genetic polymorphism
Multimodal brain image analysis : third International Workshop, MBIA 2013, held in conjunction with MICCAI 2013, Nagoya, Japan, September 22, 2013 : proceedings / Li Shen, Tianming Liu, Pew-Thian Yap, Heng Huang, Dinggang Shen, Carl-Fre.
Alzheimer's disease (AD) is the most common cause of dementia in older adults. By the time an individual has been diagnosed with AD, it may be too late for potential disease modifying therapy to strongly influence outcome. Therefore, it is critical to develop better diagnostic tools that can recognize AD at early symptomatic and especially pre-symptomatic stages. Mild cognitive impairment (MCI), introduced to describe a prodromal stage of AD, is presently classified into early and late stages (E-MCI, L-MCI) based on severity. Using a graph-based semi-supervised learning (SSL) method to integrate multimodal brain imaging data and select valid imaging-based predictors for optimizing prediction accuracy, we developed a model to differentiate E-MCI from healthy controls (HC) for early detection of AD. Multimodal brain imaging scans (MRI and PET) of 174 E-MCI and 98 HC participants from the Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort were used in this analysis. Mean targeted region-of-interest (ROI) values extracted from structural MRI (voxel-based morphometry (VBM) and FreeSurfer V5) and PET (FDG and Florbetapir) scans were used as features. Our results show that the graph-based SSL classifiers outperformed support vector machines for this task and the best performance was obtained with 66.8% cross-validated AUC (area under the ROC curve) when FDG and FreeSurfer datasets were integrated. Valid imaging-based phenotypes selected from our approach included ROI values extracted from temporal lobe, hippocampus, and amygdala. Employing a graph-based SSL approach with multimodal brain imaging data appears to have substantial potential for detecting E-MCI for early detection of prodromal AD warranting further investigation.
Mild Cognitive Impairment; Multimodal Brain Imaging Data; Data Integration; Graph-based Semi-Supervised Learning; Alzheimer's Disease
Combining samples across multiple cohorts in large-scale scientific research programs is often required to achieve the necessary power for genome-wide association studies. Controlling for genomic ancestry through principal component analysis (PCA) to address the effect of population stratification is a common practice. In addition to local genomic variation, such as copy number variation and inversions, other factors directly related to combining multiple studies, such as platform and site recruitment bias, can drive the correlation patterns in PCA. In this report, we describe the combination and analysis of multi-ethnic cohort with biobanks linked to electronic health records for large-scale genomic association discovery analyses. First, we outline the observed site and platform bias, in addition to ancestry differences. Second, we outline a general protocol for selecting variants for input into the subject variance-covariance matrix, the conventional PCA approach. Finally, we introduce an alternative approach to PCA by deriving components from subject loadings calculated from a reference sample. This alternative approach of generating principal components controlled for site and platform bias, in addition to ancestry differences, has the advantage of fewer covariates and degrees of freedom.
principal component analysis; ancestry; biobank; loadings; genetic association study
Studies in persons of European descent have suggested that mitochondrial DNA (mtDNA) haplogroups influence antiretroviral therapy (ART) toxicity. We explored associations between mtDNA variants and changes in endothelial function and biomarkers among non-Hispanic white, ART-naive subjects starting ART. A5152s was a substudy of A5142, a randomized trial of initial class-sparing ART regimens that included efavirenz or lopinavir/ritonavir with nucleoside reverse transcriptase inhibitors (NRTIs), or both without NRTIs. Brachial artery flow-mediated dilation (FMD) and cardiovascular biomarker assessments were performed at baseline and at weeks 4 and 24. Ten haplogroup-defining mtDNA polymorphisms were determined. FMD and biomarker changes from baseline to week 24 by mtDNA variant were assessed using Wilcoxon rank-sum tests. Thirty-nine non-Hispanic white participants had DNA and 24-week data. The nonsynonymous m.10398A>G mtDNA polymorphism (N=8) was associated with higher median baseline adiponectin (5.0 vs. 4.2 μg/ml; p=0.003), greater absolute (−1.9 vs. −0.2 μg/ml) and relative (−33% vs. −3%) adiponectin decreases (p<0.001 for both), and lower week 24 brachial artery FMD (3.6% vs. 5.4%; p=0.04). Individual mtDNA haplogroups, including haplogroups H (N=13) and U (N=6), were not associated with adiponectin or FMD changes. In this small pilot study, adiponectin and brachial artery FMD on ART differed in non-Hispanic whites with a nonsynonymous mtDNA variant associated with several human diseases. These preliminary findings support the hypothesis that mtDNA variation influences metabolic ART effects. Validation studies in larger populations and in different racial/ethnic groups that include m.10398G carriers are needed.
Effective cancer clinical outcome prediction for understanding of the mechanism of various types of cancer has been pursued using molecular-based data such as gene expression profiles, an approach that has promise for providing better diagnostics and supporting further therapies. However, clinical outcome prediction based on gene expression profiles varies between independent data sets. Further, single-gene expression outcome prediction is limited for cancer evaluation since genes do not act in isolation, but rather interact with other genes in complex signaling or regulatory networks. In addition, since pathways are more likely to co-operate together, it would be desirable to incorporate expert knowledge to combine pathways in a useful and informative manner.
Thus, we propose a novel approach for identifying knowledge-driven genomic interactions and applying it to discover models associated with cancer clinical phenotypes using grammatical evolution neural networks (GENN). In order to demonstrate the utility of the proposed approach, an ovarian cancer data from the Cancer Genome Atlas (TCGA) was used for predicting clinical stage as a pilot project.
We identified knowledge-driven genomic interactions associated with cancer stage from single knowledge bases such as sources of pathway-pathway interaction, but also knowledge-driven genomic interactions across different sets of knowledge bases such as pathway-protein family interactions by integrating different types of information. Notably, an integration model from different sources of biological knowledge achieved 78.82% balanced accuracy and outperformed the top models with gene expression or single knowledge-based data types alone. Furthermore, the results from the models are more interpretable because they are framed in the context of specific biological pathways or other expert knowledge.
The success of the pilot study we have presented herein will allow us to pursue further identification of models predictive of clinical cancer survival and recurrence. Understanding the underlying tumorigenesis and progression in ovarian cancer through the global view of interactions within/between different biological knowledge sources has the potential for providing more effective screening strategies and therapeutic targets for many types of cancer.
Knowledge-driven genomic interaction; Integrative analysis; Grammatical evolution neural network; Clinical outcome prediction; Ovarian cancer
Common obesity risk variants have been associated with macronutrient intake; however, these associations' generalizability across populations has not been demonstrated. We investigated the associations between 6 obesity risk variants in (or near) the NEGR1, TMEM18, BDNF, FTO, MC4R, and KCTD15 genes and macronutrient intake (carbohydrate, protein, ethanol, and fat) in 3 Population Architecture using Genomics and Epidemiology (PAGE) studies: the Multiethnic Cohort Study (1993–2006) (n = 19,529), the Atherosclerosis Risk in Communities Study (1987–1989) (n = 11,114), and the Epidemiologic Architecture for Genes Linked to Environment (EAGLE) Study, which accesses data from the Third National Health and Nutrition Examination Survey (1991–1994) (n = 6,347). We used linear regression, with adjustment for age, sex, and ethnicity, to estimate the associations between obesity risk genotypes and macronutrient intake. A fixed-effects meta-analysis model showed that the FTO rs8050136 A allele (n = 36,973) was positively associated with percentage of calories derived from fat (βmeta = 0.2244 (standard error, 0.0548); P = 4 × 10−5) and inversely associated with percentage of calories derived from carbohydrate (βmeta = −0.2796 (standard error, 0.0709); P = 8 × 10−5). In the Multiethnic Cohort Study, percentage of calories from fat assessed at baseline was a partial mediator of the rs8050136 effect on body mass index (weight (kg)/height (m)2) obtained at 10 years of follow-up (mediation of effect = 0.0823 kg/m2, 95% confidence interval: 0.0559, 0.1128). Our data provide additional evidence that the association of FTO with obesity is partially mediated by dietary intake.
energy intake; fat mass and obesity-associated (FTO) gene; obesity; percent calories from fat; race/ethnicity
Phenome-wide association studies (PheWAS) have demonstrated utility in validating genetic associations derived from traditional genetic studies as well as identifying novel genetic associations. Here we used an electronic health record (EHR)-based PheWAS to explore pleiotropy of genetic variants in the fat mass and obesity associated gene (FTO), some of which have been previously associated with obesity and type 2 diabetes (T2D). We used a population of 10,487 individuals of European ancestry with genome-wide genotyping from the Electronic Medical Records and Genomics (eMERGE) Network and another population of 13,711 individuals of European ancestry from the BioVU DNA biobank at Vanderbilt genotyped using Illumina HumanExome BeadChip. A meta-analysis of the two study populations replicated the well-described associations between FTO variants and obesity (odds ratio [OR] = 1.25, 95% Confidence Interval = 1.11–1.24, p = 2.10 × 10−9) and FTO variants and T2D (OR = 1.14, 95% CI = 1.08–1.21, p = 2.34 × 10−6). The meta-analysis also demonstrated that FTO variant rs8050136 was significantly associated with sleep apnea (OR = 1.14, 95% CI = 1.07–1.22, p = 3.33 × 10−5); however, the association was attenuated after adjustment for body mass index (BMI). Novel phenotype associations with obesity-associated FTO variants included fibrocystic breast disease (rs9941349, OR = 0.81, 95% CI = 0.74–0.91, p = 5.41 × 10−5) and trends toward associations with non-alcoholic liver disease and gram-positive bacterial infections. FTO variants not associated with obesity demonstrated other potential disease associations including non-inflammatory disorders of the cervix and chronic periodontitis. These results suggest that genetic variants in FTO may have pleiotropic associations, some of which are not mediated by obesity.
PheWAS; genetic association; pleiotropy; Exome chip; FTO; BMI
Electrocardiographic (ECG) measurements vary by ancestry. Genome-wide association studies (GWAS) have identified loci that contribute to ECG measurements; however most are performed in Europeans collected from population-based cohorts or surveys. The strongest associations reported are in NOS1AP with QT interval and SCN10A with PR and QRS durations. The extent to which these associations can be generalized to African Americans has yet to be determined. Using electronic medical records, PR and QT intervals, QRS duration, and heart rate were determined in 455 African Americans as part of the Vanderbilt Genome-Electronic Records Project and Northwestern University NUgene Project. We tested for an association between these ECG traits and >930K SNPs. We identified a total 36 novel associations with PR interval, QRS duration, QT interval, and heart rate at p< 1.0 ×10−6. Using published GWAS data, we compared our results with those previously identified in other populations. Five associations originally identified in other populations generalized with respect to statistical significance and direction of effect. A total of 43 associations have a consistent direction of effect with European and/or Asian populations. This work provides a catalogue of generalized versus non-generalized associations, a necessary step in prioritizing GWAS-identified regions for further fine-mapping in diverse populations.
Electrocardiography; African Americans; GWAS; Generalization; Electronic Medical Records
In omic research, such as genome wide association studies, researchers seek to repeat their results in other datasets to reduce false positive findings and thus provide evidence for the existence of true associations. Unfortunately this standard validation approach cannot completely eliminate false positive conclusions, and it can also mask many true associations that might otherwise advance our understanding of pathology. These issues beg the question: How can we increase the amount of knowledge gained from high throughput genetic data? To address this challenge, we present an approach that complements standard statistical validation methods by drawing attention to both potential false negative and false positive conclusions, as well as providing broad information for directing future research. The Diverse Convergent Evidence approach (DiCE) we propose integrates information from multiple sources (omics, informatics, and laboratory experiments) to estimate the strength of the available corroborating evidence supporting a given association. This process is designed to yield an evidence metric that has utility when etiologic heterogeneity, variable risk factor frequencies, and a variety of observational data imperfections might lead to false conclusions. We provide proof of principle examples in which DiCE identified strong evidence for associations that have established biological importance, when standard validation methods alone did not provide support. If used as an adjunct to standard validation methods this approach can leverage multiple distinct data types to improve genetic risk factor discovery/validation, promote effective science communication, and guide future research directions.
Replication; Validation; Complex disease; Heterogeneity; GWAS; Omics; Type 2 error; Type 1 error; False negatives; False positives