Chronic Obstructive Pulmonary Disease (COPD) is a complex disease. Genetic, epigenetic, and environmental factors are known to contribute to COPD risk and disease progression. Therefore we developed a systematic approach to identify key regulators of COPD that integrates genome-wide DNA methylation, gene expression, and phenotype data in lung tissue from COPD and control samples. Our integrative analysis identified 126 key regulators of COPD. We identified EPAS1 as the only key regulator whose downstream genes significantly overlapped with multiple genes sets associated with COPD disease severity. EPAS1 is distinct in comparison with other key regulators in terms of methylation profile and downstream target genes. Genes predicted to be regulated by EPAS1 were enriched for biological processes including signaling, cell communications, and system development. We confirmed that EPAS1 protein levels are lower in human COPD lung tissue compared to non-disease controls and that Epas1 gene expression is reduced in mice chronically exposed to cigarette smoke. As EPAS1 downstream genes were significantly enriched for hypoxia responsive genes in endothelial cells, we tested EPAS1 function in human endothelial cells. EPAS1 knockdown by siRNA in endothelial cells impacted genes that significantly overlapped with EPAS1 downstream genes in lung tissue including hypoxia responsive genes, and genes associated with emphysema severity. Our first integrative analysis of genome-wide DNA methylation and gene expression profiles illustrates that not only does DNA methylation play a ‘causal’ role in the molecular pathophysiology of COPD, but it can be leveraged to directly identify novel key mediators of this pathophysiology.
Chronic Obstructive Pulmonary Disease (COPD) is a common lung disease. It is the fourth leading cause of death in the world and is expected to be the third by 2020. COPD is a heterogeneous and complex disease consisting of obstruction in the small airways, emphysema, and chronic bronchitis. COPD is generally caused by exposure to noxious particles or gases, most commonly from cigarette smoking. However, only 20–25% of smokers develop clinically significant airflow obstruction. Smoking is known to cause epigenetic changes in lung tissues. Thus, genetics, epigenetic, and their interaction with environmental factors play an important role in COPD pathogenesis and progression. Currently, there are no therapeutics that can reverse COPD progression. In order to identify new targets that may lead to the development of therapeutics for curing COPD, we developed a systematic approach to identify key regulators of COPD that integrates genome-wide DNA methylation, gene expression, and phenotype data in lung tissue from COPD and control samples. Our integrative analysis identified 126 key regulators of COPD. We identified EPAS1 as the only key regulator whose downstream genes significantly overlapped with multiple genes sets associated with COPD disease severity.
A disruptive approach to therapeutic discovery and development is required in order to significantly improve the success rate of drug discovery for central nervous system (CNS) disorders. In this review, we first assess the key factors contributing to the frequent clinical failures for novel drugs. Second, we discuss cancer translational research paradigms that addressed key issues in drug discovery and development and have resulted in delivering drugs with significantly improved outcomes for patients. Finally, we discuss two emerging technologies that could improve the success rate of CNS therapies: human induced pluripotent stem cell (hiPSC)-based studies and multiscale biology models. Coincident with advances in cellular technologies that enable the generation of hiPSCs directly from patient blood or skin cells, together with methods to differentiate these hiPSC lines into specific neural cell types relevant to neurological disease, it is also now possible to combine data from large-scale forward genetics and post-mortem global epigenetic and expression studies in order to generate novel predictive models. The application of systems biology approaches to account for the multiscale nature of different data types, from genetic to molecular and cellular to clinical, can lead to new insights into human diseases that are emergent properties of biological networks, not the result of changes to single genes. Such studies have demonstrated the heterogeneity in etiological pathways and the need for studies on model systems that are patient-derived and thereby recapitulate neurological disease pathways with higher fidelity. In the context of two common and presumably representative neurological diseases, the neurodegenerative disease Alzheimer’s Disease, and the psychiatric disorder schizophrenia, we propose the need for, and exemplify the impact of, a multiscale biology approach that can integrate panomic, clinical, imaging, and literature data in order to construct predictive disease network models that can (i) elucidate subtypes of syndromic diseases, (ii) provide insights into disease networks and targets and (iii) facilitate a novel drug screening strategy using patient-derived hiPSCs to discover novel therapeutics for CNS disorders.
stem cell-based screening; systems biology and network biology; drug discovery screening; complex disease mechanism; high throughput biology
Errors in sample annotation or labeling often occur in large-scale genetic or genomic studies and are difficult to avoid completely during data generation and management. For integrative genomic studies, it is critical to identify and correct these errors. Different types of genetic and genomic data are inter-connected by cis-regulations. On that basis, we developed a computational approach, Multi-Omics Data Matcher (MODMatcher), to identify and correct sample labeling errors in multiple types of molecular data, which can be used in further integrative analysis. Our results indicate that inspection of sample annotation and labeling error is an indispensable data quality assurance step. Applied to a large lung genomic study, MODMatcher increased statistically significant genetic associations and genomic correlations by more than two-fold. In a simulation study, MODMatcher provided more robust results by using three types of omics data than two types of omics data. We further demonstrate that MODMatcher can be broadly applied to large genomic data sets containing multiple types of omics data, such as The Cancer Genome Atlas (TCGA) data sets.
Many human diseases are complex with multiple genetic and environmental causal factors interacting together to give rise to disease phenotypes. Such factors affect biological systems through many layers of regulations, including transcriptional and epigenetic regulation, and protein changes. To fully understand their molecular mechanisms, complex diseases are often studied in diverse dimensions including genetics (genotype variations by single nucleotide polymorphism (SNP) arrays or whole exome sequencing), transcriptomics, epigenetics, and proteomics. However, errors in sample annotation or labeling often occur in large-scale genetic and genomic studies and are difficult to avoid completely during data generation and management. Identifying and correcting these errors are critical for integrative genomic studies. In this study, we developed a computational approach, Multi-Omics Data Matcher (MODMatcher), to identify and correct sample labeling errors based on multiple types of molecular data before further integrative analysis. Our results indicate that signals increased more than 100% after correction of sample labeling errors in a large lung genomic study. Our method can be broadly applied to large genomic data sets with multiple types of omics data, such as TCGA (The Cancer Genome Atlas) data sets.
Posttraumatic stress disorder (PTSD) and other deployment-related outcomes originate from a complex interplay between constellations of changes in DNA, environmental traumatic exposures, and other biological risk factors. These factors affect not only individual genes or bio-molecules but also the entire biological networks that in turn increase or decrease the risk of illness or affect illness severity. This review focuses on recent developments in the field of systems biology which use multidimensional data to discover biological networks affected by combat exposure and post-deployment disease states. By integrating large-scale, high-dimensional molecular, physiological, clinical, and behavioral data, the molecular networks that directly respond to perturbations that can lead to PTSD can be identified and causally associated with PTSD, providing a path to identify key drivers. Reprogrammed neural progenitor cells from fibroblasts from PTSD patients could be established as an in vitro assay for high throughput screening of approved drugs to determine which drugs reverse the abnormal expression of the pathogenic biomarkers or neuronal properties.
PTSD; genomics; gene expression; proteomics; Computational Biology; risk factors
Allergic rhinitis is a common disease whose genetic basis is incompletely explained. We report an integrated genomic analysis of allergic rhinitis.
We performed genome wide association studies (GWAS) of allergic rhinitis in 5633 ethnically diverse North American subjects. Next, we profiled gene expression in disease-relevant tissue (peripheral blood CD4+ lymphocytes) collected from subjects who had been genotyped. We then integrated the GWAS and gene expression data using expression single nucleotide (eSNP), coexpression network, and pathway approaches to identify the biologic relevance of our GWAS.
GWAS revealed ethnicity-specific findings, with 4 genome-wide significant loci among Latinos and 1 genome-wide significant locus in the GWAS meta-analysis across ethnic groups. To identify biologic context for these results, we constructed a coexpression network to define modules of genes with similar patterns of CD4+ gene expression (coexpression modules) that could serve as constructs of broader gene expression. 6 of the 22 GWAS loci with P-value ≤ 1x10−6 tagged one particular coexpression module (4.0-fold enrichment, P-value 0.0029), and this module also had the greatest enrichment (3.4-fold enrichment, P-value 2.6 × 10−24) for allergic rhinitis-associated eSNPs (genetic variants associated with both gene expression and allergic rhinitis). The integrated GWAS, coexpression network, and eSNP results therefore supported this coexpression module as an allergic rhinitis module. Pathway analysis revealed that the module was enriched for mitochondrial pathways (8.6-fold enrichment, P-value 4.5 × 10−72).
Our results highlight mitochondrial pathways as a target for further investigation of allergic rhinitis mechanism and treatment. Our integrated approach can be applied to provide biologic context for GWAS of other diseases.
Genome-wide association study; Allergic rhinitis; Coexpression network; Expression single-nucleotide polymorphism; Coexpression module; Pathway; Mitochondria; Hay fever; Allergy
Using expression profiles from postmortem prefrontal cortex samples of 624 dementia patients and non-demented controls, we investigated global disruptions in the co-regulation of genes in two neurodegenerative diseases, late-onset Alzheimer's disease (AD) and Huntington's disease (HD). We identified networks of differentially co-expressed (DC) gene pairs that either gained or lost correlation in disease cases relative to the control group, with the former dominant for both AD and HD and both patterns replicating in independent human cohorts of AD and aging. When aligning networks of DC patterns and physical interactions, we identified a 242-gene subnetwork enriched for independent AD/HD signatures. This subnetwork revealed a surprising dichotomy of gained/lost correlations among two inter-connected processes, chromatin organization and neural differentiation, and included DNA methyltransferases, DNMT1 and DNMT3A, of which we predicted the former but not latter as a key regulator. To validate the inter-connection of these two processes and our key regulator prediction, we generated two brain-specific knockout (KO) mice and show that Dnmt1 KO signature significantly overlaps with the subnetwork (P = 3.1 × 10−12), while Dnmt3a KO signature does not (P = 0.017).
differential co-expression; dysregulatory gene networks; epigenetic regulation of neural differentiation; network alignment; neurodegenerative diseases
The outbreak of diarrhea and hemolytic uremic syndrome that occurred in Germany in 2011 was caused by a Shiga toxin-producing enteroaggregative Escherichia coli (EAEC) strain. The strain was classified as EAEC due to the presence of a plasmid (pAA) that mediates a characteristic pattern of aggregative adherence on cultured cells, the defining feature of EAEC that has classically been associated with virulence. Here, we describe an infant rabbit-based model of intestinal colonization and diarrhea caused by the outbreak strain, which we use to decipher the factors that mediate the pathogen’s virulence. Shiga toxin is the key factor required for diarrhea. Unexpectedly, we observe that pAA is dispensable for intestinal colonization and development of intestinal pathology. Instead, chromosome-encoded autotransporters are critical for robust colonization and diarrheal disease in this model. Our findings suggest that conventional wisdom linking aggregative adherence to EAEC intestinal colonization is false for at least a subset of strains.
Genetic association studies, in particular the genome-wide association study (GWAS) design, have provided a wealth of novel insights into the aetiology of a wide range of human diseases and traits, in particular cardiovascular diseases and lipid biomarkers. The next challenge consists of understanding the molecular basis of these associations. The integration of multiple association datasets, including gene expression datasets, can contribute to this goal. We have developed a novel statistical methodology to assess whether two association signals are consistent with a shared causal variant. An application is the integration of disease scans with expression quantitative trait locus (eQTL) studies, but any pair of GWAS datasets can be integrated in this framework. We demonstrate the value of the approach by re-analysing a gene expression dataset in 966 liver samples with a published meta-analysis of lipid traits including >100,000 individuals of European ancestry. Combining all lipid biomarkers, our re-analysis supported 26 out of 38 reported colocalisation results with eQTLs and identified 14 new colocalisation results, hence highlighting the value of a formal statistical test. In three cases of reported eQTL-lipid pairs (SYPL2, IFT172, TBKBP1) for which our analysis suggests that the eQTL pattern is not consistent with the lipid association, we identify alternative colocalisation results with SORT1, GCKR, and KPNB1, indicating that these genes are more likely to be causal in these genomic intervals. A key feature of the method is the ability to derive the output statistics from single SNP summary statistics, hence making it possible to perform systematic meta-analysis type comparisons across multiple GWAS datasets (implemented online at http://coloc.cs.ucl.ac.uk/coloc/). Our methodology provides information about candidate causal genes in associated intervals and has direct implications for the understanding of complex diseases as well as the design of drugs to target disease pathways.
Genome-wide association studies (GWAS) have found a large number of genetic regions (“loci”) affecting clinical end-points and phenotypes, many outside coding intervals. One approach to understanding the biological basis of these associations has been to explore whether GWAS signals from intermediate cellular phenotypes, in particular gene expression, are located in the same loci (“colocalise”) and are potentially mediating the disease signals. However, it is not clear how to assess whether the same variants are responsible for the two GWAS signals or whether it is distinct causal variants close to each other. In this paper, we describe a statistical method that can use simply single variant summary statistics to test for colocalisation of GWAS signals. We describe one application of our method to a meta-analysis of blood lipids and liver expression, although any two datasets resulting from association studies can be used. Our method is able to detect the subset of GWAS signals explained by regulatory effects and identify candidate genes affected by the same GWAS variants. As summary GWAS data are increasingly available, applications of colocalisation methods to integrate the findings will be essential for functional follow-up, and will also be particularly useful to identify tissue specific signals in eQTL datasets.
The genetics of complex disease produce alterations in the molecular interactions of cellular pathways whose collective effect may become clear through the organized structure of molecular networks. To characterize molecular systems associated with late-onset Alzheimer’s disease (LOAD), we constructed gene regulatory networks in 1647 post-mortem brain tissues from LOAD patients and non-demented subjects, and demonstrate that LOAD reconfigures specific portions of the molecular interaction structure. Through an integrative network-based approach, we rank-ordered these network structures for relevance to LOAD pathology, highlighting an immune and microglia-specific module dominated by genes involved in pathogen phagocytosis, containing TYROBP as a key regulator and up-regulated in LOAD. Mouse microglia cells over-expressing intact or truncated TYROBP revealed expression changes that significantly overlapped the human brain TYROBP network. Thus the causal network structure is a useful predictor of response to gene perturbations and presents a novel framework to test models of disease mechanisms underlying LOAD.
Whole exome and genome sequencing (WES/WGS) is now routinely offered as a clinical test by a growing number of laboratories. As part of the test design process each laboratory must determine the performance characteristics of the platform, test and informatics pipeline. This report documents one such characterization of WES/WGS.
Whole exome and whole genome sequencing was performed on multiple technical replicates of five reference samples using the Illumina HiSeq 2000/2500. The sequencing data was processed with a GATK-based genome analysis pipeline to evaluate: intra-run, inter-run, inter-mode, inter-machine and inter-library consistency, concordance with orthogonal technologies (microarray, Sanger) and sensitivity and accuracy relative to known variant sets.
Concordance to high-density microarrays consistently exceeds 97% (and typically exceeds 99%) and concordance between sequencing replicates also exceeds 97%, with no observable differences between different flow cells, runs, machines or modes. Sensitivity relative to high-density microarray variants exceeds 95%. In a detailed study of a 129 kb region, sensitivity was lower with some validated single-base insertions and deletions “not called”. Different variants are "not called" in each replicate: of all variants identified in WES data from the NA12878 reference sample 74% of indels and 89% of SNVs were called in all seven replicates, in NA12878 WGS 52% of indels and 88% of SNVs were called in all six replicates. Key sources of non-uniformity are variance in depth of coverage, artifactual variants resulting from repetitive regions and larger structural variants.
Blood pressure (BP) is a heritable determinant of risk for cardiovascular disease (CVD). To investigate genetic associations with systolic BP (SBP), diastolic BP (DBP), mean arterial pressure (MAP) and pulse pressure (PP), we genotyped ∼50 000 single-nucleotide polymorphisms (SNPs) that capture variation in ∼2100 candidate genes for cardiovascular phenotypes in 61 619 individuals of European ancestry from cohort studies in the USA and Europe. We identified novel associations between rs347591 and SBP (chromosome 3p25.3, in an intron of HRH1) and between rs2169137 and DBP (chromosome1q32.1 in an intron of MDM4) and between rs2014408 and SBP (chromosome 11p15 in an intron of SOX6), previously reported to be associated with MAP. We also confirmed 10 previously known loci associated with SBP, DBP, MAP or PP (ADRB1, ATP2B1, SH2B3/ATXN2, CSK, CYP17A1, FURIN, HFE, LSP1, MTHFR, SOX6) at array-wide significance (P < 2.4 × 10−6). We then replicated these associations in an independent set of 65 886 individuals of European ancestry. The findings from expression QTL (eQTL) analysis showed associations of SNPs in the MDM4 region with MDM4 expression. We did not find any evidence of association of the two novel SNPs in MDM4 and HRH1 with sequelae of high BP including coronary artery disease (CAD), left ventricular hypertrophy (LVH) or stroke. In summary, we identified two novel loci associated with BP and confirmed multiple previously reported associations. Our findings extend our understanding of genes involved in BP regulation, some of which may eventually provide new targets for therapeutic intervention.
Approaches exploiting extremes of the trait distribution may reveal novel loci for common traits, but it is unknown whether such loci are generalizable to the general population. In a genome-wide search for loci associated with upper vs. lower 5th percentiles of body mass index, height and waist-hip ratio, as well as clinical classes of obesity including up to 263,407 European individuals, we identified four new loci (IGFBP4, H6PD, RSRC1, PPP2R2A) influencing height detected in the tails and seven new loci (HNF4G, RPTOR, GNAT2, MRPS33P4, ADCY9, HS6ST3, ZZZ3) for clinical classes of obesity. Further, we show that there is large overlap in terms of genetic structure and distribution of variants between traits based on extremes and the general population and little etiologic heterogeneity between obesity subgroups.
Genetic variation at the chromosome 9p21 risk locus promotes cardiovascular disease; however, it is unclear how or which proteins encoded at this locus contribute to disease. We have previously demonstrated that loss of one candidate gene at this locus, cyclin-dependent kinase inhibitor 2B (Cdkn2b), in mice promotes vascular SMC apoptosis and aneurysm progression. Here, we investigated the role of Cdnk2b in atherogenesis and found that in a mouse model of atherosclerosis, deletion of Cdnk2b promoted advanced development of atherosclerotic plaques composed of large necrotic cores. Furthermore, human carriers of the 9p21 risk allele had reduced expression of CDKN2B in atherosclerotic plaques, which was associated with impaired expression of calreticulin, a ligand required for activation of engulfment receptors on phagocytic cells. As a result of decreased calreticulin, CDKN2B-deficient apoptotic bodies were resistant to efferocytosis and not efficiently cleared by neighboring macrophages. These uncleared SMCs elicited a series of proatherogenic juxtacrine responses associated with increased foam cell formation and inflammatory cytokine elaboration. The addition of exogenous calreticulin reversed defects associated with loss of Cdkn2b and normalized engulfment of Cdkn2b-deficient cells. Together, these data suggest that loss of CDKN2B promotes atherosclerosis by increasing the size and complexity of the lipid-laden necrotic core through impaired efferocytosis.
Single-molecule real-time (SMRT) DNA sequencing allows the systematic detection of chemical modifications such as methylation but has not previously been applied on a genome-wide scale. We used this approach to detect 49,311 putative 6-methyladenine (m6A) residues and 1,407 putative 5-methylcytosine (m5C) residues in the genome of a pathogenic Escherichia coli strain. We obtained strand-specific information for methylation sites and a quantitative assessment of the frequency of methylation at each modified position. We deduced the sequence motifs recognized by the methyltransferase enzymes present in this strain without prior knowledge of their specificity. Furthermore, we found that deletion of a phage-encoded methyltransferase-endonuclease (restriction-modification; RM) system induced global transcriptional changes and led to gene amplification, suggesting that the role of RM systems extends beyond protecting host genomes from foreign DNA.
Genome wide association studies have implicated allelic variation at 9p21.3 in multiple forms of vascular disease, including atherosclerotic coronary heart disease and abdominal aortic aneurysm. As for other genes at 9p21.3, human eQTL studies have associated expression of the tumor suppressor gene CDKN2B with the risk haplotype, but its potential role in vascular pathobiology remains unclear.
Methods and Results
Here we employed vascular injury models and found that Cdkn2b knockout mice displayed the expected increase in proliferation after injury, but developed reduced neointimal lesions and larger aortic aneurysms. In situ and in vitro studies suggested that these effects were due to increased smooth muscle cell apoptosis. Adoptive bone marrow transplant studies confirmed that the observed effects of Cdkn2b were mediated through intrinsic vascular cells and were not dependent on bone marrow-derived inflammatory cells. Mechanistic studies suggested that the observed increase in apoptosis was due to a reduction in MDM2 and an increase in p53 signaling, possibly due in part to compensation by other genes at the 9p21.3 locus. Dual inhibition of both Cdkn2b and p53 led to a reversal of the vascular phenotype in each model.
These results suggest that reduced CDKN2B expression and increased SMC apoptosis may be one mechanism underlying the 9p21.3 association with aneurysmal disease.
CDKN2B; apoptosis; smooth muscle; remodeling; abdominal aortic aneurysm; genome wide association studies; p53
Retrospective studies have demonstrated that nearly 50% of patients with ovarian cancer with normal cancer antigen 125 (CA125) levels have persistent disease; however, prospectively distinguishing between patients is currently impossible. Here, we demonstrate that for one patient, with the first reported fibroblast growth factor receptor 2 (FGFR2) fusion transcript in ovarian cancer, circulating tumor DNA (ctDNA) is a more sensitive and specific biomarker than CA125, and it can also inform on a candidate therapeutic. For a 4-year period, during which the patient underwent primary debulking surgery and chemotherapy, tumor recurrences, and multiple chemotherapeutic regimens, blood samples were longitudinally collected and stored. Whereas postsurgical CA125 levels were elevated only three times for 28 measurements, the FGFR2 fusion ctDNA biomarker was readily detectable by quantitative real-time reverse transcription-polymerase chain reaction (PCR) in all of these same blood samples and in the tumor recurrences. Given the persistence of the FGFR2 fusion, we treated tumor cells derived from this patient and others with the FGFR2 inhibitor BGJ398. Only tumor cells derived from this patient were sensitive to FGFR2 inhibitor treatment. Using the same methodologic approach, we demonstrate in a second patient with a different fusion that PCR and agarose gel electrophoresis can also be used to identify tumor-specific DNA in the circulation. Taken together, we demonstrate that a relatively inexpensive, PCR-based ctDNA surveillance assay can outperform CA125 in identifying occult disease.
Multiple laboratories now offer clinical whole genome sequencing (WGS). We anticipate WGS becoming routinely used in research and clinical practice. Many institutions are exploring how best to educate geneticists and other professionals about WGS. Providing students in WGS courses with the option to analyze their own genome sequence is one strategy that might enhance students’ engagement and motivation to learn about personal genomics. However, if this option is presented to students, it is vital they make informed decisions, do not feel pressured into analyzing their own genomes by their course directors or peers, and feel free to analyze a third-party genome if they prefer. We therefore developed a 26-hour introductory genomics course in part to help students make informed decisions about whether to receive personal WGS data in a subsequent advanced genomics course. In the advanced course, they had the option to receive their own personal genome data, or an anonymous genome, at no financial cost to them. Our primary aims were to examine whether students made informed decisions regarding analyzing their personal genomes, and whether there was evidence that the introductory course enabled the students to make a more informed decision.
This was a longitudinal cohort study in which students (N = 19) completed questionnaires assessing their intentions, informed decision-making, attitudes and knowledge before (T1) and after (T2) the introductory course, and before the advanced course (T3). Informed decision-making was assessed using the Decisional Conflict Scale.
At the start of the introductory course (T1), most (17/19) students intended to receive their personal WGS data in the subsequent course, but many expressed conflict around this decision. Decisional conflict decreased after the introductory course (T2) indicating there was an increase in informed decision-making, and did not change before the advanced course (T3). This suggests that it was the introductory course content rather than simply time passing that had the effect. In the advanced course, all (19/19) students opted to receive their personal WGS data. No changes in technical knowledge of genomics were observed. Overall attitudes towards WGS were broadly positive.
Providing students with intensive introductory education about WGS may help them make informed decisions about whether or not to work with their personal WGS data in an educational setting.
Genome-wide association studies (GWAS) have identified 36 loci associated with body mass index (BMI), predominantly in populations of European ancestry. We conducted a meta-analysis to examine the association of >3.2 million SNPs with BMI in 39,144 men and women of African ancestry, and followed up the most significant associations in an additional 32,268 individuals of African ancestry. We identified one novel locus at 5q33 (GALNT10, rs7708584, p=3.4×10−11) and another at 7p15 when combined with data from the Giant consortium (MIR148A/NFE2L3, rs10261878, p=1.2×10−10). We also found suggestive evidence of an association at a third locus at 6q16 in the African ancestry sample (KLHL32, rs974417, p=6.9×10−8). Thirty-two of the 36 previously established BMI variants displayed directionally consistent effect estimates in our GWAS (binomial p=9.7×10−7), of which five reached genome-wide significance. These findings provide strong support for shared BMI loci across populations as well as for the utility of studying ancestrally diverse populations.
Sequence-based variation in gene expression is a key driver of disease risk. Common variants regulating expression in cis have been mapped in many eQTL studies typically in single tissues from unrelated individuals. Here, we present a comprehensive analysis of gene expression across multiple tissues conducted in a large set of mono- and dizygotic twins that allows systematic dissection of genetic (cis and trans) and non-genetic effects on gene expression. Using identity-by-descent estimates, we show that at least 40% of the total heritable cis-effect on expression cannot be accounted for by common cis-variants, a finding which exposes the contribution of low frequency and rare regulatory variants with respect to both transcriptional regulation and complex trait susceptibility. We show that a substantial proportion of gene expression heritability is trans to the structural gene and identify several replicating trans-variants which act predominantly in a tissue-restricted manner and may regulate the transcription of many genes.
Association studies have identified several signals at the LRRK2 locus for Parkinson's disease (PD), Crohn's disease (CD) and leprosy. However, little is known about the molecular mechanisms mediating these effects. To further characterize this locus, we fine-mapped the risk association in 5,802 PD and 5,556 controls using a dense genotyping array (ImmunoChip). Using samples from 134 post-mortem control adult human brains (UK Human Brain Expression Consortium), where up to ten brain regions were available per individual, we studied the regional variation, splicing and regulation of LRRK2. We found convincing evidence for a common variant PD association located outside of the LRRK2 protein coding region (rs117762348, A>G, P = 2.56×10−8, case/control MAF 0.083/0.074, odds ratio 0.86 for the minor allele with 95% confidence interval [0.80–0.91]). We show that mRNA expression levels are highest in cortical regions and lowest in cerebellum. We find an exon quantitative trait locus (QTL) in brain samples that localizes to exons 32–33 and investigate the molecular basis of this eQTL using RNA-Seq data in n = 8 brain samples. The genotype underlying this eQTL is in strong linkage disequilibrium with the CD associated non-synonymous SNP rs3761863 (M2397T). We found two additional QTLs in liver and monocyte samples but none of these explained the common variant PD association at rs117762348. Our results characterize the LRRK2 locus, and highlight the importance and difficulties of fine-mapping and integration of multiple datasets to delineate pathogenic variants and thus develop an understanding of disease mechanisms.
Dramatic improvements in DNA sequencing technology have revolutionized our ability to characterize most genomic diversity. However, accurate resolution of large structural events has remained challenging due to the comparatively shorter read lengths of second-generation technologies. Emerging third-generation sequencing technologies, which yield markedly increased read length on rapid time scales and for low cost, have the potential to address assembly limitations. Here we combine sequencing data from second- and third-generation DNA sequencing technologies to assemble the two-chromosome genome of a recent Haitian cholera outbreak strain into two nearly finished contigs at > 99.9% accuracy. Complex regions with clinically significant structure were completely resolved. In separate control assemblies on experimental and simulated data for the canonical N16961 reference we obtain 14 and 8 scaffolds greater than 1kb, respectively, correcting several errors in the underlying source data. This work provides a blueprint for the next generation of rapid microbial identification and full-genome assembly.
Coronary heart disease (CHD) is the leading cause of mortality in both developed and developing countries worldwide. Genome-wide association studies (GWAS) have now identified 46 independent susceptibility loci for CHD, however, the biological and disease-relevant mechanisms for these associations remain elusive. The large-scale meta-analysis of GWAS recently identified in Caucasians a CHD-associated locus at chromosome 6q23.2, a region containing the transcription factor TCF21 gene. TCF21 (Capsulin/Pod1/Epicardin) is a member of the basic-helix-loop-helix (bHLH) transcription factor family, and regulates cell fate decisions and differentiation in the developing coronary vasculature. Herein, we characterize a cis-regulatory mechanism by which the lead polymorphism rs12190287 disrupts an atypical activator protein 1 (AP-1) element, as demonstrated by allele-specific transcriptional regulation, transcription factor binding, and chromatin organization, leading to altered TCF21 expression. Further, this element is shown to mediate signaling through platelet-derived growth factor receptor beta (PDGFR-β) and Wilms tumor 1 (WT1) pathways. A second disease allele identified in East Asians also appears to disrupt an AP-1-like element. Thus, both disease-related growth factor and embryonic signaling pathways may regulate CHD risk through two independent alleles at TCF21.
As much as half of the risk of developing coronary heart disease is genetically predetermined. Genome-wide association studies in human populations have now uncovered multiple sites of common genetic variation associated with heart disease. However, the biological mechanisms responsible for linking the disease associations with changes in gene expression are still underexplored. One of these variants occurs within the vascular developmental factor, TCF21, leading to dysregulated gene expression. Using various in silico and molecular approaches, we identify an intricate allele-specific regulatory mechanism underlying altered expression of TCF21. Notably, we observe that two apparently independent risk alleles identified in distinct populations function through a similar regulatory mechanism. Together these data suggest that conserved upstream pathways may organize the complex genetic etiology of coronary heart disease and potentially lead to new treatment opportunities.
Breast cancer is the most common malignancy in women and is responsible for hundreds of thousands of deaths annually. As with most cancers, it is a heterogeneous disease and different breast cancer subtypes are treated differently. Understanding the difference in prognosis for breast cancer based on its molecular and phenotypic features is one avenue for improving treatment by matching the proper treatment with molecular subtypes of the disease. In this work, we employed a competition-based approach to modeling breast cancer prognosis using large datasets containing genomic and clinical information and an online real-time leaderboard program used to speed feedback to the modeling team and to encourage each modeler to work towards achieving a higher ranked submission. We find that machine learning methods combined with molecular features selected based on expert prior knowledge can improve survival predictions compared to current best-in-class methodologies and that ensemble models trained across multiple user submissions systematically outperform individual models within the ensemble. We also find that model scores are highly consistent across multiple independent evaluations. This study serves as the pilot phase of a much larger competition open to the whole research community, with the goal of understanding general strategies for model optimization using clinical and molecular profiling data and providing an objective, transparent system for assessing prognostic models.
We developed an extensible software framework for sharing molecular prognostic models of breast cancer survival in a transparent collaborative environment and subjecting each model to automated evaluation using objective metrics. The computational framework presented in this study, our detailed post-hoc analysis of hundreds of modeling approaches, and the use of a novel cutting-edge data resource together represents one of the largest-scale systematic studies to date assessing the factors influencing accuracy of molecular-based prognostic models in breast cancer. Our results demonstrate the ability to infer prognostic models with accuracy on par or greater than previously reported studies, with significant performance improvements by using state-of-the-art machine learning approaches trained on clinical covariates. Our results also demonstrate the difficultly in incorporating molecular data to achieve substantial performance improvements over clinical covariates alone. However, improvement was achieved by combining clinical feature data with intelligent selection of important molecular features based on domain-specific prior knowledge. We observe that ensemble models aggregating the information across many diverse models achieve among the highest scores of all models and systematically out-perform individual models within the ensemble, suggesting a general strategy for leveraging the wisdom of crowds to develop robust predictive models.
Crohn’s disease (CD) and ulcerative colitis (UC), the two common forms of inflammatory bowel disease (IBD), affect over 2.5 million people of European ancestry with rising prevalence in other populations1. Genome-wide association studies (GWAS) and subsequent meta-analyses of CD and UC2,3 as separate phenotypes implicated previously unsuspected mechanisms, such as autophagy4, in pathogenesis and showed that some IBD loci are shared with other inflammatory diseases5. Here we expand knowledge of relevant pathways by undertaking a meta-analysis of CD and UC genome-wide association scans, with validation of significant findings in more than 75,000 cases and controls. We identify 71 new associations, for a total of 163 IBD loci that meet genome-wide significance thresholds. Most loci contribute to both phenotypes, and both directional and balancing selection effects are evident. Many IBD loci are also implicated in other immune-mediated disorders, most notably with ankylosing spondylitis and psoriasis. We also observe striking overlap between susceptibility loci for IBD and mycobacterial infection. Gene co-expression network analysis emphasizes this relationship, with pathways shared between host responses to mycobacteria and those predisposing to IBD.