Gene–gene interactions may contribute to the genetic variation underlying complex traits but have not always been taken fully into account. Statistical analyses that consider gene–gene interaction may increase the power of detecting associations, especially for low-marginal-effect markers, and may explain in part the “missing heritability.” Detecting pair-wise and higher-order interactions genome-wide requires enormous computational power. Filtering pipelines increase the computational speed by limiting the number of tests performed. We summarize existing filtering approaches to detect epistasis, after distinguishing the purposes that lead us to search for epistasis. Statistical filtering includes quality control on the basis of single marker statistics to avoid the analysis of bad and least informative data, and limits the search space for finding interactions. Biological filtering includes targeting specific pathways, integrating various databases based on known biological and metabolic pathways, gene function ontology and protein–protein interactions. It is increasingly possible to target single-nucleotide polymorphisms that have defined functions on gene expression, though not belonging to protein-coding genes. Filtering can improve the power of an interaction association study, but also increases the chance of missing important findings.
epistasis; genetic interaction; biological interaction; filtering pipeline; optimal search
Genome-wide association studies (GWAS) are a useful approach in the study of the genetic components of complex phenotypes. Aside from large cohorts, GWAS have generally been limited to the study of one or a few diseases or traits. The emergence of biobanks linked to electronic medical records (EMRs) allows the efficient re-use of genetic data to yield meaningful genotype-phenotype associations for multiple phenotypes or traits. Phase I of the electronic MEdical Records and GEnomics (eMERGE-I) Network is a National Human Genome Research Institute (NHGRI)-supported consortium composed of five sites to perform various genetic association studies using DNA repositories and EMR systems. Each eMERGE site has developed EMR-based algorithms to comprise a core set of fourteen phenotypes for extraction of study samples from each site’s DNA repository. Each eMERGE site selected samples for a specific phenotype, and these samples were genotyped at either the Broad Institute or at the Center for Inherited Disease Research (CIDR) using the Illumina Infinium BeadChip technology. In all, approximately 17,000 samples from across the five sites were genotyped. A unified quality control (QC) pipeline was developed by the eMERGE Genomics Working Group and used to ensure thorough cleaning of the data. This process includes examination of sample quality, marker quality, and various batch effects. Upon completion of the genotyping and QC analyses for each site’s primary study, the eMERGE Coordinating Center merged the datasets from all five sites. This larger merged dataset re-entered the established eMERGE QC pipeline. Based on lessons learned during the process, additional analyses and QC checkpoints were added to the pipeline to ensure proper merging. Here we explore the challenges associated with combining datasets from different genotyping centers and describe the expansion to the eMERGE QC pipeline for merged datasets. These additional steps will be useful as the eMERGE project expands to include additional sites in eMERGE-II and also serve as a starting point for investigators merging multiple genotype data sets accessible through the National Center for Biotechnology Information (NCBI) in the database of Genotypes and Phenotypes (dbGaP). Our experience demonstrates that merging multiple datasets after additional QC can be an efficient use of genotype data despite new challenges that appear in the process.
quality control; genome-wide association (GWAS); eMERGE; dbGaP; merging datasets
Research in human genetics and genetic epidemiology has grown significantly over the previous decade, particularly in the field of pharmacogenomics. Pharmacogenomics presents an opportunity for rapid translation of associated genetic polymorphisms into diagnostic measures or tests to guide therapy as part of a move towards personalized medicine. Expansion in genotyping technology has cleared the way for widespread use of whole-genome genotyping in the effort to identify novel biology and new genetic markers associated with pharmacokinetic and pharmacodynamic endpoints. With new technology and methodology regularly becoming available for use in genetic studies, a discussion on the application of such tools becomes necessary. In particular, quality control criteria have evolved with the use of GWAS as we have come to understand potential systematic errors which can be introduced into the data during genotyping. There have been several replicated pharmacogenomic associations, some of which have moved to the clinic to enact change in treatment decisions. These examples of translation illustrate the strength of evidence necessary to successfully and effectively translate a genetic discovery. In this review, the design of pharmacogenomic association studies is examined with the goal of optimizing the impact and utility of this research. Issues of ascertainment, genotyping, quality control, analysis and interpretation are considered.
Epistasis; genotyping; personalized medicine; pharmacogenomics; quality control; statistics; study design
Since publication of the human genome in 2003, geneticists have been interested in risk variant associations to resolve the etiology of traits and complex diseases. The International HapMap Consortium undertook an effort to catalog all common variation across the genome (variants with a minor allele frequency (MAF) of at least 5% in one or more ethnic groups). HapMap along with advances in genotyping technology led to genome-wide association studies which have identified common variants associated with many traits and diseases. In 2008 the 1000 Genomes Project aimed to sequence 2500 individuals and identify rare variants and 99% of variants with a MAF of <1%.
To determine whether the 1000 Genomes Project includes all the variants in HapMap, we examined the overlap between single nucleotide polymorphisms (SNPs) genotyped in the two resources using merged phase II/III HapMap data and low coverage pilot data from 1000 Genomes.
Comparison of the two data sets showed that approximately 72% of HapMap SNPs were also found in 1000 Genomes Project pilot data. After filtering out HapMap variants with a MAF of <5% (separately for each population), 99% of HapMap SNPs were found in 1000 Genomes data.
Not all variants cataloged in HapMap are also cataloged in 1000 Genomes. This could affect decisions about which resource to use for SNP queries, rare variant validation, or imputation. Both the HapMap and 1000 Genomes Project databases are useful resources for human genetics, but it is important to understand the assumptions made and filtering strategies employed by these projects.
Genome-wide association studies (GWAS) are being conducted at an unprecedented rate in population-based cohorts and have increased our understanding of the pathophysiology of complex disease. The recent application of GWAS to clinic-based cohorts has also yielded genetic predictors of clinical outcomes. Regardless of context, the practical utility of this information will ultimately depend upon the quality of the original data. Quality control (QC) procedures for GWAS are computationally intensive, operationally challenging, and constantly evolving. With each new dataset, new realities are discovered about GWAS data and best practices continue to be developed. The Genomics Workgroup of the National Human Genome Research Institute (NHGRI) funded electronic Medical Records and Genomics (eMERGE) network has invested considerable effort in developing strategies for QC of these data. The lessons learned by this group will be valuable for other investigators dealing with large scale genomic datasets. Here we enumerate some of the challenges in QC of GWAS data and describe the approaches that the eMERGE network is using for quality assurance in GWAS data, thereby minimizing potential bias and error in GWAS results. In this protocol we discuss common issues associated with QC of GWAS data, including data file formats, software packages for data manipulation and analysis, sex chromosome anomalies, sample identity, sample relatedness, population substructure, batch effects, and marker quality. We propose best practices and discuss areas of ongoing and future research.
Gene-gene interactions are proposed as one important component of the genetic architecture of complex diseases, and are just beginning to be evaluated in the context of genome wide association studies (GWAS). In addition to detecting epistasis, a benefit to interaction analysis is that it also increases power to detect weak main effects. We conducted a knowledge-driven interaction analysis of a GWAS of 931 multiple sclerosis trios to discover gene-gene interactions within established biological contexts. We identify heterogeneous signals, including a gene-gene interaction between CHRM3 and MYLK (joint p = 0.0002), an interaction between two phospholipase-β isoforms, PLCβ1 & PLCβ4 (joint p = 0.0098), and a modest interaction between ACTN1 and MYH9 (joint p = 0.0326), all localized to calcium-signaled cytoskeletal regulation. Furthermore, we discover a main effect (joint p = 5.2E-5) previously unidentified by single-locus analysis within another related gene, SCIN, a calcium-binding cytoskeleton regulatory protein. This work illustrates that knowledge-driven interaction analysis of GWAS data is a feasible approach to identify new genetic effects. The results of this study are among the first gene-gene interactions and non-immune susceptibility loci for multiple sclerosis. Further, the implicated genes cluster within inter-related biological mechanisms that suggest a neurodegenerative component to multiple sclerosis.
A number of genetic variants have been discovered by recent genome-wide association studies for their associations with clinical coronary heart disease (CHD). However, it is unclear whether these variants are also associated with the development of CHD as measured by subclinical atherosclerosis phenotypes, ankle brachial index (ABI), carotid artery intima-media thickness (cIMT) and carotid plaque.
Ten CHD risk single nucleotide polymorphisms (SNPs) were genotyped in individuals of European American (EA), African American (AA), American Indian (AI), and Mexican American (MA) ancestry in the Population Architecture using Genomics and Epidemiology (PAGE) study. In each individual study, we performed linear or logistic regression to examine population-specific associations between SNPs and ABI, common and internal cIMT, and plaque. The results from individual studies were meta-analyzed using a fixed effect inverse variance weighted model.
None of the ten SNPs was significantly associated with ABI and common or internal cIMT, after Bonferroni correction. In the sample of 13,337 EA, 3,809 AA, and 5,353 AI individuals with carotid plaque measurement, the GCKR SNP rs780094 was significantly associated with the presence of plaque in AI only (OR = 1.32, 95% confidence interval: 1.17, 1.49, P = 1.08 × 10−5), but not in the other populations (P = 0.90 in EA and P = 0.99 in AA). A 9p21 region SNP, rs1333049, was nominally associated with plaque in EA (OR = 1.07, P = 0.02) and in AI (OR = 1.10, P = 0.05).
We identified a significant association between rs780094 and plaque in AI populations, which needs to be replicated in future studies. There was little evidence that the index CHD risk variants identified through genome-wide association studies in EA influence the development of CHD through subclinical atherosclerosis as assessed by cIMT and ABI across ancestries.
ankle brachial index; carotid artery intima-media thickness; carotid plaque; coronary heart disease; genetic association study; multiethnic populations; subclinical atherosclerosis
While accurate measures of heritability are needed to understand the pharmacogenetic basis of drug treatment response, these are generally not available, since it is unfeasible to give medications to individuals for which treatment is not indicated. Using a polygenic linear mixed modeling approach, we estimated lower-bounds on asthma heritability and the heritability of two related drug-response phenotypes, bronchodilator response and airway hyperreactivity, using genome-wide SNP data from existing asthma cohorts. Our estimate of the heritability for bronchodilator response is 28.5% (se 16%, p = 0.043) and airway hyperresponsiveness is 51.1% (se 34%, p = 0.064), while we estimate asthma genetic liability at 61.5% (se 16%, p < 0.001). Our results agree with previously published estimates of the heritability of these traits, suggesting that the LMM method is useful for computing the heritability of other pharmacogenetic traits. Furthermore, our results indicate that multiple SNP main-effects, including SNPs as yet unidentified by GWAS methods, together explain a sizable portion of the heritability of these traits.
Asthma; Pharmacogenetics; Heritability; Bronchodilator Response; Airway Hyperresponsiveness
As genetic epidemiology looks beyond mapping single disease susceptibility loci, interest in detecting epistatic interactions between genes has grown. The dimensionality and comparisons required to search the epistatic space and the inference for a significant result pose challenges for testing epistatic disease models. The Multifactor Dimensionality Reduction Pedigree Disequilibrium Test (MDR-PDT) was developed to test for multilocus models in pedigree data. In the present study we rigorously tested MDR-PDT with new cross-validation (CV) (both 5- and 10-fold) and omnibus model selection algorithms by simulating a range of heritabilities, odds ratios, minor allele frequencies, sample sizes, and numbers of interacting loci. Power was evaluated using 100, 500, and 1000 families, with minor allele frequencies 0.2 and 0.4 and broad-sense heritabilities of 0.005, 0.01, 0.03, 0.05, and 0.1 for 2 and 3-locus purely epistatic penetrance models. We also compared the prediction error measure of effect with a predicted matched odds ratio for final model selection and testing. We report that the CV procedure is valid with the permutation test, MDR-PDT performs similarly with 5 and 10- fold CV, and that the matched odds ratio is more powerful than prediction error as the fitness metric for MDR-PDT.
Epistasis; MDR-PDT; complex disease; family-based association; bioinformatics
With white blood cell count emerging as an important risk factor for chronic inflammatory diseases, genetic associations of differential leukocyte types, specifically monocyte count, are providing novel candidate genes and pathways to further investigate. Circulating monocytes play a critical role in vascular diseases such as in the formation of atherosclerotic plaque. We performed a joint and ancestry-stratified genome-wide association analyses to identify variants specifically associated with monocyte count in 11 014 subjects in the electronic Medical Records and Genomics Network. In the joint and European ancestry samples, we identified novel associations in the chromosome 16 interferon regulatory factor 8 (IRF8) gene (P-value = 2.78×10(−16), β = −0.22). Other monocyte associations include novel missense variants in the chemokine-binding protein 2 (CCBP2) gene (P-value = 1.88×10(−7), β = 0.30) and a region of replication found in ribophorin I (RPN1) (P-value = 2.63×10(−16), β = −0.23) on chromosome 3. The CCBP2 and RPN1 region is located near GATA binding protein2 gene that has been previously shown to be associated with coronary heart disease. On chromosome 9, we found a novel association in the prostaglandin reductase 1 gene (P-value = 2.29×10(−7), β = 0.16), which is downstream from lysophosphatidic acid receptor 1. This region has previously been shown to be associated with monocyte count. We also replicated monocyte associations of genome-wide significance (P-value = 5.68×10(−17), β = −0.23) at the integrin, alpha 4 gene on chromosome 2. The novel IRF8 results and further replications provide supporting evidence of genetic regions associated with monocyte count.
In genetic studies of complex disease a consideration for the investigator is detection of joint effects. The Multifactor Dimensionality Reduction (MDR) algorithm searches for these effects with an exhaustive approach. Previously unknown aspects of MDR performance were the power to detect interactive effects given large numbers of non-model loci or varying degrees of heterogeneity among multiple epistatic disease models.
To address the performance with many non-model loci, datasets of 500 cases and 500 controls with 100 to 10,000 SNPs were simulated for two-locus models, and one hundred 500-case/500-control datasets with 100 and 500 SNPs were simulated for three-locus models. Multiple levels of locus heterogeneity were simulated in several sample sizes.
These results show MDR is robust to locus heterogeneity when the definition of power is not as conservative as in previous simulation studies where all model loci were required to be found by the method. The results also indicate that MDR performance is related more strongly to broad-sense heritability than sample size and is not greatly affected by non-model loci.
A study in which a population with high heritability estimates is sampled predisposes the MDR study to success more than a larger ascertainment in a population with smaller estimates.
Epistasis; MDR; Heterogeneity
Electrocardiographic QRS duration, a measure of cardiac intraventricular conduction, varies ~2-fold in individuals without cardiac disease. Slow conduction may promote reentrant arrhythmias.
Methods and Results
We performed a genome-wide association study (GWAS) to identify genomic markers of QRS duration in 5,272 individuals without cardiac disease selected from electronic medical record (EMR) algorithms at five sites in the Electronic Medical Records and Genomics (eMERGE) network. The most significant loci were evaluated within the CHARGE consortium QRS GWAS meta-analysis. Twenty-three single nucleotide polymorphisms in 5 loci, previously described by CHARGE, were replicated in the eMERGE samples; 18 SNPs were in the chromosome 3 SCN5A and SCN10A loci, where the most significant SNPs were rs1805126 in SCN5A with p=1.2×10−8 (eMERGE) and p=2.5×10−20 (CHARGE) and rs6795970 in SCN10A with p=6×10−6 (eMERGE) and p=5×10−27 (CHARGE). The other loci were in NFIA, near CDKN1A, and near C6orf204. We then performed phenome-wide association studies (PheWAS) on variants in these five loci in 13,859 European Americans to search for diagnoses associated with these markers. PheWAS identified atrial fibrillation and cardiac arrhythmias as the most common associated diagnoses with SCN10A and SCN5A variants. SCN10A variants were also associated with subsequent development of atrial fibrillation and arrhythmia in the original 5,272 “heart-healthy” study population.
We conclude that DNA biobanks coupled to EMRs provide a platform not only for GWAS but may also allow broad interrogation of the longitudinal incidence of disease associated with genetic variants. The PheWAS approach implicated sodium channel variants modulating QRS duration in subjects without cardiac disease as predictors of subsequent arrhythmias.
cardiac conduction; QRS duration; atrial fibrillation; genome-wide association study; phenome-wide association study; electronic medical records
This editorial introduces BioData Mining, a new journal which publishes research articles related to advances in computational methods and techniques for the extraction of useful knowledge from heterogeneous biological data. We outline the aims and scope of the journal, introduce the publishing model and describe the open peer review policy, which fosters interaction within the research community.
Candidate gene and genome-wide association studies (GWAS) have identified genetic variants that modulate risk for human disease; many of these associations require further study to replicate the results. Here we report the first large-scale application of the phenome-wide association study (PheWAS) paradigm within electronic medical records (EMRs), an unbiased approach to replication and discovery that interrogates relationships between targeted genotypes and multiple phenotypes. We scanned for associations between 3,144 single-nucleotide polymorphisms (previously implicated by GWAS as mediators of human traits) and 1,358 EMR-derived phenotypes in 13,835 individuals of European ancestry. This PheWAS replicated 66% (51/77) of sufficiently powered prior GWAS associations and revealed 63 potentially pleiotropic associations with P < 4.6 × 10−6 (false discovery rate < 0.1); the strongest of these novel associations were replicated in an independent cohort (n = 7,406). These findings validate PheWAS as a tool to allow unbiased interrogation across multiple phenotypes in EMR-based cohorts and to enhance analysis of the genomic basis of human disease.
Type 2 diabetes (T2D) is a complex metabolic disease that disproportionately affects African Americans. Genome-wide association studies (GWAS) have identified several loci that contribute to T2D in European Americans, but few studies have been performed in admixed populations. We first performed a GWAS of 1,563 African Americans from the Vanderbilt Genome-Electronic Records Project and Northwestern University NUgene Project as part of the electronic Medical Records and Genomics (eMERGE) network. We successfully replicate an association in TCF7L2, previously identified by GWAS in this African American dataset. We were unable to identify novel associations at p<5.0×10−8 by GWAS. Using admixture mapping as an alternative method for discovery, we performed a genome-wide admixture scan that suggests multiple candidate genes associated with T2D. One finding, TCIRG1, is a T-cell immune regulator expressed in the pancreas and liver that has not been previously implicated for T2D. We performed subsequent fine-mapping to further assess the association between TCIRG1 and T2D in >5,000 African Americans. We identified 13 independent associations between TCIRG1, CHKA, and ALDH3B1 genes on chromosome 11 and T2D. Our results suggest a novel region on chromosome 11 identified by admixture mapping is associated with T2D in African Americans.
Genetic variants of the enzyme that metabolizes warfarin, cytochrome P-450 2C9 (CYP2C9), and of a key pharmacologic target of warfarin, vitamin K epoxide reductase (VKORC1), contribute to differences in patients’ responses to various warfarin doses, but the role of these variants during initial anticoagulation is not clear.
In 297 patients starting warfarin therapy, we assessed CYP2C9 genotypes (CYP2C9 *1, *2, and *3), VKORC1 haplotypes (designated A and non-A), clinical characteristics, response to therapy (as determined by the international normalized ratio [INR]), and bleeding events. The study outcomes were the time to the first INR within the therapeutic range, the time to the first INR of more than 4, the time above the therapeutic INR range, the INR response over time, and the warfarin dose requirement.
As compared with patients with the non-A/non-A haplotype, patients with the A/A haplotype of VKORC1 had a decreased time to the first INR within the therapeutic range (P = 0.02) and to the first INR of more than 4 (P = 0.003). In contrast, the CYP2C9 genotype was not a significant predictor of the time to the first INR within the therapeutic range (P = 0.57) but was a significant predictor of the time to the first INR of more than 4 (P = 0.03). Both the CYP2C9 genotype and VKORC1 haplotype had a significant influence on the required warfarin dose after the first 2 weeks of therapy.
Initial variability in the INR response to warfarin was more strongly associated with genetic variability in the pharmacologic target of warfarin, VKORC1, than with CYP2C9.
The purpose of this paper is to describe the data collection efforts and validation of PhenX measures in the Personalized Medicine Research Project (PMRP) cohort.
Thirty-six measures were chosen from the PhenX Toolkit within the following domains: demographics; anthropometrics; alcohol, tobacco and other substances; cardiovascular; environmental exposures; cancer; psychiatric; neurology; and physical activity and physical fitness. Eligibility criteria for the current study included: living PMRP subjects with known addresses who consented to future contact and were not currently living in a nursing home, available GWAS data from eMERGE I for subjects where age-related cataract, HDL, dementia and resistant hypertension were the primary phenotypes, thus biasing the sample to the older PMRP participants. The questionnaires were mailed twice. Data from the PhenX measures were compared with information from PMRP questionnaires and data from Marshfield Clinic electronic medical records.
Completed PhenX questionnaires were returned by 2271 subjects for a final response rate of 70%. The mean age reported on the PhenX questionnaire (73.1 years) was greater than the PMRP questionnaire (64.8 years) because the data were collected at different time points. The mean self-reported weight, and subsequently calculated BMI, were less on the PhenX survey than the measured values at the time of enrollment into PMRP (PhenX means 173.5 pounds and BMI 28.2 kg/m2 versus PMRP 182.9 pounds and BMI 29.6 kg/m2). There was 95.3% agreement between the two questionnaires about having ever smoked at least 100 cigarettes. 139 (6.2%) of subjects indicated on the PhenX questionnaire that they had been told they had a stroke. Of them, only 15 (10.8%) had no electronic indication of a prior stroke or TIA. All of the age-and gender-specific 95% confidence limits around point estimates for major depressive episodes overlap and show that 31% of women aged 50–64 reported symptoms associated with a major depressive episode.
The approach employed resulted in a high response rate and valuable data for future gene/environment analyses. These results and high response rate highlight the utility of the PhenX Toolkit to collect valid phenotypic data that can be shared across groups to facilitate gene/environment studies.
The ever-growing wealth of biological information available through multiple comprehensive database repositories can be leveraged for advanced analysis of data. We have now extensively revised and updated the multi-purpose software tool Biofilter that allows researchers to annotate and/or filter data as well as generate gene-gene interaction models based on existing biological knowledge. Biofilter now has the Library of Knowledge Integration (LOKI), for accessing and integrating existing comprehensive database information, including more flexibility for how ambiguity of gene identifiers are handled. We have also updated the way importance scores for interaction models are generated. In addition, Biofilter 2.0 now works with a range of types and formats of data, including single nucleotide polymorphism (SNP) identifiers, rare variant identifiers, base pair positions, gene symbols, genetic regions, and copy number variant (CNV) location information.
Biofilter provides a convenient single interface for accessing multiple publicly available human genetic data sources that have been compiled in the supporting database of LOKI. Information within LOKI includes genomic locations of SNPs and genes, as well as known relationships among genes and proteins such as interaction pairs, pathways and ontological categories.
Via Biofilter 2.0 researchers can:
genomic location or region based data, such as results from association studies, or CNV analyses, with relevant biological knowledge for deeper interpretation
genomic location or region based data on biological criteria, such as filtering a series SNPs to retain only SNPs present in specific genes within specific pathways of interest
Generate Predictive Models
for gene-gene, SNP-SNP, or CNV-CNV interactions based on biological information, with priority for models to be tested based on biological relevance, thus narrowing the search space and reducing multiple hypothesis-testing.
Biofilter is a software tool that provides a flexible way to use the ever-expanding expert biological knowledge that exists to direct filtering, annotation, and complex predictive model development for elucidating the etiology of complex phenotypic outcomes.
Data mining; Bioinformatics; Expert knowledge; Modeling; Pathway analyses; Epistasis
Analyses investigating low frequency variants have the potential for explaining additional genetic heritability of many complex human traits. However, the natural frequencies of rare variation between human populations strongly confound genetic analyses. We have applied a novel collapsing method to identify biological features with low frequency variant burden differences in thirteen populations sequenced by the 1000 Genomes Project. Our flexible collapsing tool utilizes expert biological knowledge from multiple publicly available database sources to direct feature selection. Variants were collapsed according to genetically driven features, such as evolutionary conserved regions, regulatory regions genes, and pathways. We have conducted an extensive comparison of low frequency variant burden differences (MAF<0.03) between populations from 1000 Genomes Project Phase I data. We found that on average 26.87% of gene bins, 35.47% of intergenic bins, 42.85% of pathway bins, 14.86% of ORegAnno regulatory bins, and 5.97% of evolutionary conserved regions show statistically significant differences in low frequency variant burden across populations from the 1000 Genomes Project. The proportion of bins with significant differences in low frequency burden depends on the ancestral similarity of the two populations compared and types of features tested. Even closely related populations had notable differences in low frequency burden, but fewer differences than populations from different continents. Furthermore, conserved or functionally relevant regions had fewer significant differences in low frequency burden than regions under less evolutionary constraint. This degree of low frequency variant differentiation across diverse populations and feature elements highlights the critical importance of considering population stratification in the new era of DNA sequencing and low frequency variant genomic analyses.
Low frequency variants are likely to play an important role in uncovering complex trait heritability; however, they are often continent or population specific. This specificity complicates genetic analyses investigating low frequency variants for two reasons: low frequency variant signals in an association test are often difficult to generalize beyond a single population or continental group, and there is an increase in false positive results in association analyses due to underlying population stratification. In order to reveal the magnitude of low frequency population stratification, we performed pairwise population comparisons using the 1000 Genomes Project Phase I data to investigate differences in low frequency variant burden across multiple biological features. We found that low frequency variant confounding is much more prevalent than one might expect, even within continental groups. The proportion of significant differences in low frequency variant burden was also dependent on the region of interest; for example, annotated regulatory regions showed fewer low frequency burden differences between populations than intergenic regions. Knowledge of population structure and the genomic landscape in a region of interest are important factors in determining the extent of confounding due to population stratification in a low frequency genomic analysis.
Gene expression profiles have been broadly used in cancer research as a diagnostic or prognostic signature for the clinical outcome prediction such as stage, grade, metastatic status, recurrence, and patient survival, as well as to potentially improve patient management. However, emerging evidence shows that gene expression-based prediction varies between independent data sets. One possible explanation of this effect is that previous studies were focused on identifying genes with large main effects associated with clinical outcomes. Thus, non-linear interactions without large individual main effects would be missed. The other possible explanation is that gene expression as a single level of genomic data is insufficient to explain the clinical outcomes of interest since cancer can be dysregulated by multiple alterations through genome, epigenome, transcriptome, and proteome levels. In order to overcome the variability of diagnostic or prognostic predictors from gene expression alone and to increase its predictive power, we need to integrate multi-levels of genomic data and identify interactions between them associated with clinical outcomes.
Here, we proposed an integrative framework for identifying interactions within/between multi-levels of genomic data associated with cancer clinical outcomes using the Grammatical Evolution Neural Networks (GENN). In order to demonstrate the validity of the proposed framework, ovarian cancer data from TCGA was used as a pilot task. We found not only interactions within a single genomic level but also interactions between multi-levels of genomic data associated with survival in ovarian cancer. Notably, the integration model from different levels of genomic data achieved 72.89% balanced accuracy and outperformed the top models with any single level of genomic data.
Understanding the underlying tumorigenesis and progression in ovarian cancer through the global view of interactions within/between different levels of genomic data is expected to provide guidance for improved prognostic biomarkers and individualized therapies.
Integrative analysis; Multi-omics data; Grammatical evolution neural network; Ovarian cancer
A single mutation can alter cellular and global homeostatic mechanisms and give rise to multiple clinical diseases. We hypothesized that these disease mechanisms could be identified using low minor allele frequency (MAF<0.1) non-synonymous SNPs (nsSNPs) associated with “mechanistic phenotypes”, comprised of collections of related diagnoses. We studied two mechanistic phenotypes: (1) thrombosis, evaluated in a population of 1,655 African Americans; and (2) four groupings of cancer diagnoses, evaluated in 3,009 white European Americans. We tested associations between nsSNPs represented on GWAS platforms and mechanistic phenotypes ascertained from electronic medical records (EMRs), and sought enrichment in functional ontologies across the top-ranked associations. We used a two-step analytic approach whereby nsSNPs were first sorted by the strength of their association with a phenotype. We tested associations using two reverse genetic models and standard additive and recessive models. In the second step, we employed a hypothesis-free ontological enrichment analysis using the sorted nsSNPs to identify functional mechanisms underlying the diagnoses comprising the mechanistic phenotypes. The thrombosis phenotype was solely associated with ontologies related to blood coagulation (Fisher's p = 0.0001, FDR p = 0.03), driven by the F5, P2RY12 and F2RL2 genes. For the cancer phenotypes, the reverse genetics models were enriched in DNA repair functions (p = 2×10−5, FDR p = 0.03) (POLG/FANCI, SLX4/FANCP, XRCC1, BRCA1, FANCA, CHD1L) while the additive model showed enrichment related to chromatid segregation (p = 4×10−6, FDR p = 0.005) (KIF25, PINX1). We were able to replicate nsSNP associations for POLG/FANCI, BRCA1, FANCA and CHD1L in independent data sets. Mechanism-oriented phenotyping using collections of EMR-derived diagnoses can elucidate fundamental disease mechanisms.
Prior candidate gene studies have associated CYP2B6 516G→T [rs3745274] and 983T→C [rs28399499] with increased plasma efavirenz exposure. We sought to identify novel variants associated with efavirenz pharmacokinetics.
Materials and methods
Antiretroviral therapy-naive AIDS Clinical Trials Group studies A5202, A5095, and ACTG 384 included plasma sampling for efavirenz pharmacokinetics. Log-transformed trough efavirenz concentrations (Cmin) were previously estimated by population pharmacokinetic modeling. Stored DNA was genotyped with Illumina HumanHap 650Y or 1MDuo platforms, complemented by additional targeted genotyping of CYP2B6 and CYP2A6 with MassARRAY iPLEX Gold. Associations were identified by linear regression, which included principal component vectors to adjust for genetic ancestry.
Among 856 individuals, CYP2B6 516G→T was associated with efavirenz estimated Cmin (P = 8.5 × 10−41). After adjusting for CYP2B6 516G→T, CYP2B6 983T→C was associated (P = 9.9 × 10−11). After adjusting for both CYP2B6 516G→T and 983T→C, a CYP2B6 variant (rs4803419) in intron 3 was associated (P = 4.4 × 10−15). After adjusting for all the three variants, non-CYP2B6 polymorphisms were associated at P-value less than 5× 10−8. In a separate cohort of 240 individuals, only the three CYP2B6 polymorphisms replicated. These three polymorphisms explained 34% of interindividual variability in efavirenz estimated Cmin. The extensive metabolizer phenotype was best defined by the absence of all three polymorphisms.
Three CYP2B6 polymorphisms were independently associated with efavirenz estimated Cmin at genome-wide significance, and explained one-third of interindividual variability. These data will inform continued efforts to translate pharmacogenomic knowledge into optimal efavirenz utilization.
CYP2B6; efavirenz; HIV; pharmacogenomics; pharmacokinetics
Marked prolongation of the QT interval on the electrocardiogram associated with the polymorphic ventricular tachycardia Torsades de Pointes is a serious adverse event during treatment with antiarrhythmic drugs and other culprit medications, and is a common cause for drug relabeling and withdrawal. Although clinical risk factors have been identified, the syndrome remains unpredictable in an individual patient. Here we used genome-wide association analysis to search for common predisposing genetic variants. Cases of drug-induced Torsades de Pointes (diTdP), treatment tolerant controls, and general population controls were ascertained across multiple sites using common definitions, and genotyped on the Illumina 610k or 1M-Duo BeadChips. Principal Components Analysis was used to select 216 Northwestern European diTdP cases and 771 ancestry-matched controls, including treatment-tolerant and general population subjects. With these sample sizes, there is 80% power to detect a variant at genome-wide significance with minor allele frequency of 10% and conferring an odds ratio of ≥2.7. Tests of association were carried out for each single nucleotide polymorphism (SNP) by logistic regression adjusting for gender and population structure. No SNP reached genome wide-significance; the variant with the lowest P value was rs2276314, a non-synonymous coding variant in C18orf21 (p = 3×10−7, odds ratio = 2, 95% confidence intervals: 1.5–2.6). The haplotype formed by rs2276314 and a second SNP, rs767531, was significantly more frequent in controls than cases (p = 3×10−9). Expanding the number of controls and a gene-based analysis did not yield significant associations. This study argues that common genomic variants do not contribute importantly to risk for drug-induced Torsades de Pointes across multiple drugs.
With the abundance of information and analysis results being collected for genetic loci, user-friendly and flexible data visualization approaches can inform and improve the analysis and dissemination of these data. A chromosomal ideogram is an idealized graphic representation of chromosomes. Ideograms can be combined with overlaid points, lines, and/or shapes, to provide summary information from studies of various kinds, such as genome-wide association studies or phenome-wide association studies, coupled with genomic location information. To facilitate visualizing varied data in multiple ways using ideograms, we have developed a flexible software tool called PhenoGram which exists as a web-based tool and also a command-line program.
With PhenoGram researchers can create chomosomal ideograms annotated with lines in color at specific base-pair locations, or colored base-pair to base-pair regions, with or without other annotation. PhenoGram allows for annotation of chromosomal locations and/or regions with shapes in different colors, gene identifiers, or other text. PhenoGram also allows for creation of plots showing expanded chromosomal locations, providing a way to show results for specific chromosomal regions in greater detail. We have now used PhenoGram to produce a variety of different plots, and provide these as examples herein. These plots include visualization of the genomic coverage of SNPs from a genotyping array, highlighting the chromosomal coverage of imputed SNPs, copy-number variation region coverage, as well as plots similar to the NHGRI GWA Catalog of genome-wide association results.
PhenoGram is a versatile, user-friendly software tool fostering the exploration and sharing of genomic information. Through visualization of data, researchers can both explore and share complex results, facilitating a greater understanding of these data.
Data visualization; Bioinformatics; Genome-wide association study; GWAS; Copy-number variants; CNV; SNP; Ideogram
The Electronic Medical Records and Genomics (eMERGE) Network is a National Human Genome Research Institute (NHGRI)-funded consortium engaged in the development of methods and best-practices for utilizing the Electronic Medical Record (EMR) as a tool for genomic research. Now in its sixth year, its second funding cycle and comprising nine research groups and a coordinating center, the network has played a major role in validating the concept that clinical data derived from EMRs can be used successfully for genomic research. Current work is advancing knowledge in multiple disciplines at the intersection of genomics and healthcare informatics, particularly electronic phenotyping, genome-wide association studies, genomic medicine implementation and the ethical and regulatory issues associated with genomics research and returning results to study participants. Here we describe the evolution, accomplishments, opportunities and challenges of the network since its inception as a five-group consortium focused on genotype-phenotype associations for genomic discovery to its current form as a nine-group consortium pivoting towards implementation of genomic medicine.
electronic medical records; personalized medicine; genome-wide association studies; genetics and genomics; collaborative research