Transcription factors (TFs) are fundamental controllers of cellular regulation that function in a complex and combinatorial manner. Accurate identification of a transcription factor's targets is essential to understanding the role that factors play in disease biology. However, due to a high false positive rate, identifying coherent functional target sets is difficult. We have created an improved mapping of targets by integrating ChIP-Seq data with 423 functional modules derived from 9,395 human expression experiments. We identified 5,002 TF-module relationships, significantly improved TF target prediction, and found 30 high-confidence TF-TF associations, of which 14 are known. Importantly, we also connected TFs to diseases through these functional modules and identified 3,859 significant TF-disease relationships. As an example, we found a link between MEF2A and Crohn's disease, which we validated in an independent expression dataset. These results show the power of combining expression data and ChIP-Seq data to remove noise and better extract the associations between TFs, functional modules, and disease.
Transcription factors (TFs) are crucial to the precise regulation of many cellular processes and thus, are responsible for many human phenotypes and diseases. Now that the ENCODE project has mapped hundreds of TFs to their genomic binding locations, extracting functional biological signals is the next step in understanding their role in disease. In this paper, we present a novel approach to identifying TF targets and use these targets to find regulatory relationships between TFs and diseases. We present a large open dataset of putative TF-TF interactions and TF-disease associations which includes known connections as well as novel ones. We validate the association of one of our novel TF-disease associations, MEF2A and Crohn's disease, suggesting that our approach generates testable disease association hypotheses. Integrating these datasets will be crucial for understanding phenotypes and complex diseases.
Identifying environmentally-specific genetic effects is a key challenge in understanding the structure of complex traits. Model organisms play a crucial role in the identification of such gene-by-environment interactions, as a result of the unique ability to observe genetically similar individuals across multiple distinct environments. Many model organism studies examine the same traits but under varying environmental conditions. For example, knock-out or diet-controlled studies are often used to examine cholesterol in mice. These studies, when examined in aggregate, provide an opportunity to identify genomic loci exhibiting environmentally-dependent effects. However, the straightforward application of traditional methodologies to aggregate separate studies suffers from several problems. First, environmental conditions are often variable and do not fit the standard univariate model for interactions. Additionally, applying a multivariate model results in increased degrees of freedom and low statistical power. In this paper, we jointly analyze multiple studies with varying environmental conditions using a meta-analytic approach based on a random effects model to identify loci involved in gene-by-environment interactions. Our approach is motivated by the observation that methods for discovering gene-by-environment interactions are closely related to random effects models for meta-analysis. We show that interactions can be interpreted as heterogeneity and can be detected without utilizing the traditional uni- or multi-variate approaches for discovery of gene-by-environment interactions. We apply our new method to combine 17 mouse studies containing in aggregate 4,965 distinct animals. We identify 26 significant loci involved in High-density lipoprotein (HDL) cholesterol, many of which are consistent with previous findings. Several of these loci show significant evidence of involvement in gene-by-environment interactions. An additional advantage of our meta-analysis approach is that our combined study has significantly higher power and improved resolution compared to any single study thus explaining the large number of loci discovered in the combined study.
Identifying gene-by-environment interactions is important for understand the architecture of a complex trait. Discovering gene-by-environment interaction requires the observation of the same phenotype in individuals under different environments. Model organism studies are often conducted under different environments. These studies provide an unprecedented opportunity for researchers to identify the gene-by-environment interactions. A difference in the effect size of a genetic variant between two studies conducted in different environments may suggest the presence of a gene-by-environment interaction. In this paper, we propose to employ a random-effect-based meta-analysis approach to identify gene-by-environment interaction, which assumes different or heterogeneous effect sizes between studies. Our approach is motivated by the observation that methods for discovering gene-by-environment interactions are closely related to random effects models for meta-analysis. We show that interactions can be interpreted as heterogeneity and can be detected without utilizing the traditional approaches for discovery of gene-by-environment interactions, which treats the gene-by-environment interactions as covariates in the analysis. We provide a intuitive way to visualize the results of the meta-analysis at a locus which allows us to obtain the biological insights of gene-by-environment interactions. We demonstrate our method by searching for gene-by-environment interactions by combining 17 mouse genetic studies totaling 4,965 distinct animals.
Recent high-throughput efforts such as ENCODE have generated a large body of genome-scale transcriptional data in multiple conditions (e.g., cell-types and disease states). Leveraging these data is especially important for network-based approaches to human disease, for instance to identify coherent transcriptional modules (subnetworks) that can inform functional disease mechanisms and pathological pathways. Yet, genome-scale network analysis across conditions is significantly hampered by the paucity of robust and computationally-efficient methods. Building on the Higher-Order Generalized Singular Value Decomposition, we introduce a new algorithmic approach for efficient, parameter-free and reproducible identification of network-modules simultaneously across multiple conditions. Our method can accommodate weighted (and unweighted) networks of any size and can similarly use co-expression or raw gene expression input data, without hinging upon the definition and stability of the correlation used to assess gene co-expression. In simulation studies, we demonstrated distinctive advantages of our method over existing methods, which was able to recover accurately both common and condition-specific network-modules without entailing ad-hoc input parameters as required by other approaches. We applied our method to genome-scale and multi-tissue transcriptomic datasets from rats (microarray-based) and humans (mRNA-sequencing-based) and identified several common and tissue-specific subnetworks with functional significance, which were not detected by other methods. In humans we recapitulated the crosstalk between cell-cycle progression and cell-extracellular matrix interactions processes in ventricular zones during neocortex expansion and further, we uncovered pathways related to development of later cognitive functions in the cortical plate of the developing brain which were previously unappreciated. Analyses of seven rat tissues identified a multi-tissue subnetwork of co-expressed heat shock protein (Hsp) and cardiomyopathy genes (Bag3, Cryab, Kras, Emd, Plec), which was significantly replicated using separate failing heart and liver gene expression datasets in humans, thus revealing a conserved functional role for Hsp genes in cardiovascular disease.
Complex biological interactions and processes can be modelled as networks, for instance metabolic pathways or protein-protein interactions. The growing availability of large high-throughput data in several experimental conditions now permits the full-scale analysis of biological interactions and processes. However, no reliable and computationally efficient methods for simultaneous analysis of multiple large-scale interaction datasets (networks) have been developed to date. To overcome this shortcoming, we have developed a new computational framework that is parameter-free, computationally efficient and highly reliable. We showed how these distinctive properties make it a useful tool for real genomic data exploration and analyses. Indeed, in extensive simulation studies and real-data analyses we have demonstrated that our method outperformed existing approaches in terms of efficiency and, most importantly, reproducibility of the results. Beyond the computational advantages, we illustrated how our method can be effectively applied to leverage the vast stream of genome-scale transcriptional data that has risen exponentially over the last years. In contrast with existing approaches, using our method we were able to identify and replicate multi-tissue gene co-expression networks that were associated with specific functional processes relevant to phenotypic variation and disease in rats and humans.
Personal genome analysis is now being considered for evaluation of disease risk in healthy individuals, utilizing both rare and common variants. Multiple scores have been developed to predict the deleteriousness of amino acid substitutions, using information on the allele frequencies, level of evolutionary conservation, and averaged structural evidence. However, agreement among these scores is limited and they likely over-estimate the fraction of the genome that is deleterious.
This study proposes an integrative approach to identify a subset of homozygous non-synonymous single nucleotide polymorphisms (nsSNPs). An 8-level classification scheme is constructed from the presence/absence of deleterious predictions combined with evidence of association with disease or complex traits. Detailed literature searches and structural validations are then performed for a subset of homozygous 826 mis-sense mutations in 575 proteins found in the genomes of 12 healthy adults.
Implementation of the Association-Adjusted Consensus Deleterious Scheme (AACDS) classifies 11% of all predicted highly deleterious homozygous variants as most likely to influence disease risk. The number of such variants per genome ranges from 0 to 8 with no significant difference between African and Caucasian Americans. Detailed analysis of mutations affecting the APOE, MTMR2, THSB1, CHIA, αMyHC, and AMY2A proteins shows how the protein structure is likely to be disrupted, even though the associated phenotypes have not been documented in the corresponding individuals.
The classification system for homozygous nsSNPs provides an opportunity to systematically rank nsSNPs based on suggestive evidence from annotations and sequence-based predictions. The ranking scheme, in-depth literature searches, and structural validations of highly prioritized mis-sense mutations compliment traditional sequence-based approaches and should have particular utility for the development of individualized health profiles. An online tool reporting the AACDS score for any variant is provided at the authors’ website.
Homozygous variant; Non-synonymous single nucleotide polymorphism; Personal genome interpretation; Variant prioritization; Protein structure analysis
The major histocompatibility complex (MHC) region is strongly associated with multiple sclerosis (MS) susceptibility. HLA-DRB1*15:01 has the strongest effect, and several other alleles have been reported at different levels of validation. Using SNP data from genome-wide studies, we imputed and tested classical alleles and amino acid polymorphisms in 8 classical human leukocyte antigen (HLA) genes in 5,091 cases and 9,595 controls. We identified 11 statistically independent effects overall: 6 HLA-DRB1 and one DPB1 alleles in class II, one HLA-A and two B alleles in class I, and one signal in a region spanning from MICB to LST1. This genomic segment does not contain any HLA class I or II genes and provides robust evidence for the involvement of a non-HLA risk allele within the MHC. Interestingly, this region contains the TNF gene, the cognate ligand of the well-validated TNFRSF1A MS susceptibility gene. The classical HLA effects can be explained to some extent by polymorphic amino acid positions in the peptide-binding grooves. This study dissects the independent effects in the MHC, a critical region for MS susceptibility that harbors multiple risk alleles.
Multiple sclerosis (MS) is an inflammatory and neurodegenerative disease with a heritable component. Although it has been known for a long time that the strongest MS risk factor maps to the major histocompatibility complex (MHC) on chromosome 6, there are still many unresolved questions as to the identity and the nature of the risk variants within the MHC. Because the MHC has a complex structure, systematic investigation across this region has been challenging. In this study, we used state-of-the-art imputation methods coupled to statistical regression to query variants in the human leukocyte antigen (HLA) class I and II genes for a role in MS risk. Starting from available SNP genotype data, we replicated the strongest risk factor, the HLA-DRB1*15:01 allele, and were able to identify 11 independent effects in total. Functional studies are now needed to understand their mechanism in MS etiology.
Interactions between HLA class I molecules and killer-cell immunoglobulin-like receptors (KIR) control natural killer cell (NK) functions in immunity and reproduction. Encoded by genes on different chromosomes, these polymorphic ligands and receptors correlate highly with disease resistance and susceptibility. Although studied at low-resolution in many populations, high-resolution analysis of combinatorial diversity of HLA class I and KIR is limited to Asian and Amerindian populations with low genetic diversity. At the other end of the spectrum is the West African population investigated here: we studied 235 individuals, including 104 mother-child pairs, from the Ga-Adangbe of Ghana. This population has a rich diversity of 175 KIR variants forming 208 KIR haplotypes, and 81 HLA-A, -B and -C variants forming 190 HLA class I haplotypes. Each individual we studied has a unique compound genotype of HLA class I and KIR, forming 1–14 functional ligand-receptor interactions. Maintaining this exceptionally high polymorphism is balancing selection. The centromeric region of the KIR locus, encoding HLA-C receptors, is highly diverse whereas the telomeric region encoding Bw4-specific KIR3DL1, lacks diversity in Africans. Present in the Ga-Adangbe are high frequencies of Bw4-bearing HLA-B*53:01 and Bw4-lacking HLA-B*35:01, which otherwise are identical. Balancing selection at key residues maintains numerous HLA-B allotypes having and lacking Bw4, and also those of stronger and weaker interaction with LILRB1, a KIR-related receptor. Correspondingly, there is a balance at key residues of KIR3DL1 that modulate its level of cell-surface expression. Thus, capacity to interact with NK cells synergizes with peptide binding diversity to drive HLA-B allele frequency distribution. These features of KIR and HLA are consistent with ongoing co-evolution and selection imposed by a pathogen endemic to West Africa. Because of the prevalence of malaria in the Ga-Adangbe and previous associations of cerebral malaria with HLA-B*53:01 and KIR, Plasmodium falciparum is a candidate pathogen.
Natural killer cells are white blood cells with critical roles in human health that deliver front-line immunity against pathogens and nurture placentation in early pregnancy. Controlling these functions are cell-surface receptors called KIR that interact with HLA class I ligands expressed on most cells of the body. KIR and HLA are both products of complex families of variable genes, but present on separate chromosomes. Many HLA and KIR variants and their combinations associate with resistance to specific infections and pregnancy syndromes. Previously we identified basic components of the system necessary for individual and population survival. Here, we explore the system at its most genetically diverse by studying the Ga-Adangbe population from Ghana in West Africa. Co-evolution of KIR receptors with their HLA targets is ongoing in the Ga-Adangbe, with every one of 235 individuals studied having a unique set of KIR receptors and HLA class I ligands. In addition, one critical combination of receptor and ligand maintains alternative forms that either can or cannot interact with their ‘partner.’ This balance resembles that induced by malfunctioning variants of hemoglobin that confer resistance to malaria, a candidate disease for driving diversity and co-evolution of KIR and HLA class I in the Ga-Adangbe.
The improved characterisation of risk factors for rheumatoid arthritis (RA) suggests they could be combined to identify individuals at increased disease risks in whom preventive strategies may be evaluated. We aimed to develop an RA prediction model capable of generating clinically relevant predictive data and to determine if it better predicted younger onset RA (YORA). Our novel modelling approach combined odds ratios for 15 four-digit/10 two-digit HLA-DRB1 alleles, 31 single nucleotide polymorphisms (SNPs) and ever-smoking status in males to determine risk using computer simulation and confidence interval based risk categorisation. Only males were evaluated in our models incorporating smoking as ever-smoking is a significant risk factor for RA in men but not women. We developed multiple models to evaluate each risk factor's impact on prediction. Each model's ability to discriminate anti-citrullinated protein antibody (ACPA)-positive RA from controls was evaluated in two cohorts: Wellcome Trust Case Control Consortium (WTCCC: 1,516 cases; 1,647 controls); UK RA Genetics Group Consortium (UKRAGG: 2,623 cases; 1,500 controls). HLA and smoking provided strongest prediction with good discrimination evidenced by an HLA-smoking model area under the curve (AUC) value of 0.813 in both WTCCC and UKRAGG. SNPs provided minimal prediction (AUC 0.660 WTCCC/0.617 UKRAGG). Whilst high individual risks were identified, with some cases having estimated lifetime risks of 86%, only a minority overall had substantially increased odds for RA. High risks from the HLA model were associated with YORA (P<0.0001); ever-smoking associated with older onset disease. This latter finding suggests smoking's impact on RA risk manifests later in life. Our modelling demonstrates that combining risk factors provides clinically informative RA prediction; additionally HLA and smoking status can be used to predict the risk of younger and older onset RA, respectively.
Rheumatoid arthritis (RA) is a common, incurable disease with major individual and health service costs. Preventing its development is therefore an important goal. Being able to predict who will develop RA would allow researchers to look at ways to prevent it. Many factors have been found that increase someone's risk of RA. These are divided into genetic and environmental (such as smoking) factors. The risk of RA associated with each factor has previously been reported. Here, we demonstrate a method that combines these risk factors in a process called “prediction modelling” to estimate someone's lifetime risk of RA. We show that firstly, our prediction models can identify people with very high-risks of RA and secondly, they can be used to identify people at risk of developing RA at a younger age. Although these findings are an important first step towards preventing RA, as only a minority of people tested had substantially increased disease risks our models could not be used to screen the general population. Instead they need testing in people already at risk of RA such as relatives of affected patients. In this context they could identify enough numbers of high-risk people to allow preventive methods to be evaluated.
A multi-ethnic study demonstrates that the extrapolation of genetic disease risk models from European populations to other ethnicities is compromised more strongly by genetic structure than by environmental or global genetic background in differential genetic risk associations across ethnicities.
The vast majority of genome-wide association study (GWAS) findings reported to date are from populations with European Ancestry (EA), and it is not yet clear how broadly the genetic associations described will generalize to populations of diverse ancestry. The Population Architecture Using Genomics and Epidemiology (PAGE) study is a consortium of multi-ancestry, population-based studies formed with the objective of refining our understanding of the genetic architecture of common traits emerging from GWAS. In the present analysis of five common diseases and traits, including body mass index, type 2 diabetes, and lipid levels, we compare direction and magnitude of effects for GWAS-identified variants in multiple non-EA populations against EA findings. We demonstrate that, in all populations analyzed, a significant majority of GWAS-identified variants have allelic associations in the same direction as in EA, with none showing a statistically significant effect in the opposite direction, after adjustment for multiple testing. However, 25% of tagSNPs identified in EA GWAS have significantly different effect sizes in at least one non-EA population, and these differential effects were most frequent in African Americans where all differential effects were diluted toward the null. We demonstrate that differential LD between tagSNPs and functional variants within populations contributes significantly to dilute effect sizes in this population. Although most variants identified from GWAS in EA populations generalize to all non-EA populations assessed, genetic models derived from GWAS findings in EA may generate spurious results in non-EA populations due to differential effect sizes. Regardless of the origin of the differential effects, caution should be exercised in applying any genetic risk prediction model based on tagSNPs outside of the ancestry group in which it was derived. Models based directly on functional variation may generalize more robustly, but the identification of functional variants remains challenging.
The number of known associations between human diseases and common genetic variants has grown dramatically in the past decade, most being identified in large-scale genetic studies of people of Western European origin. But because the frequencies of genetic variants can differ substantially between continental populations, it's important to assess how well these associations can be extended to populations with different continental ancestry. Are the correlations between genetic variants, disease endpoints, and risk factors consistent enough for genetic risk models to be reliably applied across different ancestries? Here we describe a systematic analysis of disease outcome and risk-factor–associated variants (tagSNPs) identified in European populations, in which we test whether the effect size of a tagSNP is consistent across six populations with significant non-European ancestry. We demonstrate that although nearly all such tagSNPs have effects in the same direction across all ancestries (i.e., variants associated with higher risk in Europeans will also be associated with higher risk in other populations), roughly a quarter of the variants tested have significantly different magnitude of effect (usually lower) in at least one non-European population. We therefore advise caution in the use of tagSNP-based genetic disease risk models in populations that have a different genetic ancestry from the population in which original associations were first made. We then show that this differential strength of association can be attributed to population-dependent variations in the correlation between tagSNPs and the variant that actually determines risk—the so-called functional variant. Risk models based on functional variants are therefore likely to be more robust than tagSNP-based models.
Genome-wide association studies and follow-up meta-analyses in Crohn's disease (CD) and ulcerative colitis (UC) have recently identified 163 disease-associated loci that meet genome-wide significance for these two inflammatory bowel diseases (IBD). These discoveries have already had a tremendous impact on our understanding of the genetic architecture of these diseases and have directed functional studies that have revealed some of the biological functions that are important to IBD (e.g. autophagy). Nonetheless, these loci can only explain a small proportion of disease variance (∼14% in CD and 7.5% in UC), suggesting that not only are additional loci to be found but that the known loci may contain high effect rare risk variants that have gone undetected by GWAS. To test this, we have used a targeted sequencing approach in 200 UC cases and 150 healthy controls (HC), all of French Canadian descent, to study 55 genes in regions associated with UC. We performed follow-up genotyping of 42 rare non-synonymous variants in independent case-control cohorts (totaling 14,435 UC cases and 20,204 HC). Our results confirmed significant association to rare non-synonymous coding variants in both IL23R and CARD9, previously identified from sequencing of CD loci, as well as identified a novel association in RNF186. With the exception of CARD9 (OR = 0.39), the rare non-synonymous variants identified were of moderate effect (OR = 1.49 for RNF186 and OR = 0.79 for IL23R). RNF186 encodes a protein with a RING domain having predicted E3 ubiquitin-protein ligase activity and two transmembrane domains. Importantly, the disease-coding variant is located in the ubiquitin ligase domain. Finally, our results suggest that rare variants in genes identified by genome-wide association in UC are unlikely to contribute significantly to the overall variance for the disease. Rather, these are expected to help focus functional studies of the corresponding disease loci.
Genetic studies of common diseases have seen tremendous progress in the last half-decade primarily due to recent technologies that enable a systematic examination of genetic markers across the entire genome in large numbers of patients and healthy controls. The studies, while identifying genomic regions that influence a person's risk for developing disease, often do not pinpoint the actual gene or gene variants that account for this risk (called a causal gene/variant). A prime example of this can be seen with the 163 genetic risk factors that have recently been associated with the chronic inflammatory bowel diseases known as Crohn's disease and ulcerative colitis. For less than a handful of these 163 is the causative change in the genetic code known. The current study used an approach to directly look at the genetic code for a subset of these and identified a causative change in the genetic code for eight risk factors for ulcerative colitis. This finding is particularly important because it directs biological studies to understand the mechanisms that lead to this chronic life-long inflammatory disease.
Genome-wide association studies (GWAS) yielded significant advances in defining the genetic architecture of complex traits and disease. Still, a major hurdle of GWAS is narrowing down multiple genetic associations to a few causal variants for functional studies. This becomes critical in multi-phenotype GWAS where detection and interpretability of complex SNP(s)-trait(s) associations are complicated by complex Linkage Disequilibrium patterns between SNPs and correlation between traits. Here we propose a computationally efficient algorithm (GUESS) to explore complex genetic-association models and maximize genetic variant detection. We integrated our algorithm with a new Bayesian strategy for multi-phenotype analysis to identify the specific contribution of each SNP to different trait combinations and study genetic regulation of lipid metabolism in the Gutenberg Health Study (GHS). Despite the relatively small size of GHS (n = 3,175), when compared with the largest published meta-GWAS (n>100,000), GUESS recovered most of the major associations and was better at refining multi-trait associations than alternative methods. Amongst the new findings provided by GUESS, we revealed a strong association of SORT1 with TG-APOB and LIPC with TG-HDL phenotypic groups, which were overlooked in the larger meta-GWAS and not revealed by competing approaches, associations that we replicated in two independent cohorts. Moreover, we demonstrated the increased power of GUESS over alternative multi-phenotype approaches, both Bayesian and non-Bayesian, in a simulation study that mimics real-case scenarios. We showed that our parallel implementation based on Graphics Processing Units outperforms alternative multi-phenotype methods. Beyond multivariate modelling of multi-phenotypes, our Bayesian model employs a flexible hierarchical prior structure for genetic effects that adapts to any correlation structure of the predictors and increases the power to identify associated variants. This provides a powerful tool for the analysis of diverse genomic features, for instance including gene expression and exome sequencing data, where complex dependencies are present in the predictor space.
Nowadays, the availability of cheaper and accurate assays to quantify multiple (endo)phenotypes in large population cohorts allows multi-trait studies. However, these studies are limited by the lack of flexible models integrated with efficient computational tools for genome-wide multi SNPs-traits analyses. To overcome this problem, we propose a novel Bayesian analysis strategy and a new algorithmic implementation which exploits parallel processing architecture for fully multivariate modeling of groups of correlated phenotypes at the genome-wide scale. In addition to increased power of our algorithm over alternative Bayesian and well-established non-Bayesian multi-phenotype methods, we provide an application to a real case study of several blood lipid traits, and show how our method recovered most of the major associations and is better at refining multi-trait polygenic associations than alternative methods. We reveal and replicate in independent cohorts new associations with two phenotypic groups that were not detected by competing multivariate approaches and not noticed by a large meta-GWAS. We also discuss the applicability of the proposed method to large meta-analyses involving hundreds of thousands of individuals and to diverse genomic datasets where complex dependencies in the predictor space are present.
Genetic variants in cis-regulatory elements or trans-acting regulators frequently influence the quantity and spatiotemporal distribution of gene transcription. Recent interest in expression quantitative trait locus (eQTL) mapping has paralleled the adoption of genome-wide association studies (GWAS) for the analysis of complex traits and disease in humans. Under the hypothesis that many GWAS associations tag non-coding SNPs with small effects, and that these SNPs exert phenotypic control by modifying gene expression, it has become common to interpret GWAS associations using eQTL data. To fully exploit the mechanistic interpretability of eQTL-GWAS comparisons, an improved understanding of the genetic architecture and causal mechanisms of cell type specificity of eQTLs is required. We address this need by performing an eQTL analysis in three parts: first we identified eQTLs from eleven studies on seven cell types; then we integrated eQTL data with cis-regulatory element (CRE) data from the ENCODE project; finally we built a set of classifiers to predict the cell type specificity of eQTLs. The cell type specificity of eQTLs is associated with eQTL SNP overlap with hundreds of cell type specific CRE classes, including enhancer, promoter, and repressive chromatin marks, regions of open chromatin, and many classes of DNA binding proteins. These associations provide insight into the molecular mechanisms generating the cell type specificity of eQTLs and the mode of regulation of corresponding eQTLs. Using a random forest classifier with cell specific CRE-SNP overlap as features, we demonstrate the feasibility of predicting the cell type specificity of eQTLs. We then demonstrate that CREs from a trait-associated cell type can be used to annotate GWAS associations in the absence of eQTL data for that cell type. We anticipate that such integrative, predictive modeling of cell specificity will improve our ability to understand the mechanistic basis of human complex phenotypic variation.
When interpreting genome-wide association studies showing that specific genetic variants are associated with disease risk, scientists look for a link between the genetic variant and a biological mechanism behind that disease. One functional mechanism is that the genetic variant may influence gene transcription via a co-localized genomic regulatory element, such as a transcription factor binding site within an open chromatin region. Often this type of regulation occurs in some cell types but not others. In this study, we look across eleven gene expression studies with seven cell types and consider how genetic transcription regulators, or eQTLs, replicate within and between cell types. We identify pervasive allelic heterogeneity, or transcriptional control of a single gene by multiple, independent eQTLs. We integrate extensive data on cell type specific regulatory elements from ENCODE to identify general methods of transcription regulation through enrichment of eQTLs within regulatory elements. We also build a classifier to predict eQTL replication across cell types. The results in this paper present a path to an integrative, predictive approach to improve our ability to understand the mechanistic basis of human phenotypic variation.
Coronary heart disease (CHD) is the leading cause of mortality in both developed and developing countries worldwide. Genome-wide association studies (GWAS) have now identified 46 independent susceptibility loci for CHD, however, the biological and disease-relevant mechanisms for these associations remain elusive. The large-scale meta-analysis of GWAS recently identified in Caucasians a CHD-associated locus at chromosome 6q23.2, a region containing the transcription factor TCF21 gene. TCF21 (Capsulin/Pod1/Epicardin) is a member of the basic-helix-loop-helix (bHLH) transcription factor family, and regulates cell fate decisions and differentiation in the developing coronary vasculature. Herein, we characterize a cis-regulatory mechanism by which the lead polymorphism rs12190287 disrupts an atypical activator protein 1 (AP-1) element, as demonstrated by allele-specific transcriptional regulation, transcription factor binding, and chromatin organization, leading to altered TCF21 expression. Further, this element is shown to mediate signaling through platelet-derived growth factor receptor beta (PDGFR-β) and Wilms tumor 1 (WT1) pathways. A second disease allele identified in East Asians also appears to disrupt an AP-1-like element. Thus, both disease-related growth factor and embryonic signaling pathways may regulate CHD risk through two independent alleles at TCF21.
As much as half of the risk of developing coronary heart disease is genetically predetermined. Genome-wide association studies in human populations have now uncovered multiple sites of common genetic variation associated with heart disease. However, the biological mechanisms responsible for linking the disease associations with changes in gene expression are still underexplored. One of these variants occurs within the vascular developmental factor, TCF21, leading to dysregulated gene expression. Using various in silico and molecular approaches, we identify an intricate allele-specific regulatory mechanism underlying altered expression of TCF21. Notably, we observe that two apparently independent risk alleles identified in distinct populations function through a similar regulatory mechanism. Together these data suggest that conserved upstream pathways may organize the complex genetic etiology of coronary heart disease and potentially lead to new treatment opportunities.
Autism Spectrum Disorders (ASD) are highly heritable and characterised by impairments in social interaction and communication, and restricted and repetitive behaviours. Considering four sets of de novo copy number variants (CNVs) identified in 181 individuals with autism and exploiting mouse functional genomics and known protein-protein interactions, we identified a large and significantly interconnected interaction network. This network contains 187 genes affected by CNVs drawn from 45% of the patients we considered and 22 genes previously implicated in ASD, of which 192 form a single interconnected cluster. On average, those patients with copy number changed genes from this network possess changes in 3 network genes, suggesting that epistasis mediated through the network is extensive. Correspondingly, genes that are highly connected within the network, and thus whose copy number change is predicted by the network to be more phenotypically consequential, are significantly enriched among patients that possess only a single ASD-associated network copy number changed gene (p = 0.002). Strikingly, deleted or disrupted genes from the network are significantly enriched in GO-annotated positive regulators (2.3-fold enrichment, corrected p = 2×10−5), whereas duplicated genes are significantly enriched in GO-annotated negative regulators (2.2-fold enrichment, corrected p = 0.005). The direction of copy change is highly informative in the context of the network, providing the means through which perturbations arising from distinct deletions or duplications can yield a common outcome. These findings reveal an extensive ASD-associated molecular network, whose topology indicates ASD-relevant mutational deleteriousness and that mechanistically details how convergent aetiologies can result extensively from CNVs affecting pathways causally implicated in ASD.
Autism Spectrum Disorders (ASD) are characterised by impairments in social interaction and communication, and restricted and repetitive behaviours. ASD are highly heritable and many different stretches of DNA have been found to be duplicated or deleted in individuals with ASD. We found that an unusually high number of genes affected by these DNA deletions/duplications are associated with the functioning of synaptic transmission between nerve cells. The proteins made by many of these genes are known to interact with each other and, together with proteins from other deleted/duplicated genes, form a large interlinked biological network. This network was affected by almost 50% of the deletions/duplications in the ASD patients considered. Many individual ASD patients had deletions or duplications of multiple genes within this network, but for those patients with just a single gene from the network changed, that single gene appeared to play an important role. Furthermore, the network predicts that the effects arising from the genes in the deletions are similar to the effects arising from the genes in the duplications. Thus, the way that this ASD-associated network is wired together contributes to the understanding of the impact of these DNA deletions and duplications.
An under-appreciated aspect of the genetic analysis of gene expression is the impact of post-probe level normalization on biological inference. Here we contrast nine different methods for normalization of an Illumina bead-array gene expression profiling dataset consisting of peripheral blood samples from 189 individual participants in the Center for Health Discovery and Well Being study in Atlanta, quantifying differences in the inference of global variance components and covariance of gene expression, as well as the detection of variants that affect transcript abundance (eSNPs). The normalization strategies, all relative to raw log2 measures, include simple mean centering, two modes of transcript-level linear adjustment for technical factors, and for differential immune cell counts, variance normalization by interquartile range and by quantile, fitting the first 16 Principal Components, and supervised normalization using the SNM procedure with adjustment for cell counts. Robustness of genetic associations as a consequence of Pearson and Spearman rank correlation is also reported for each method, and it is shown that the normalization strategy has a far greater impact than correlation method. We describe similarities among methods, discuss the impact on biological interpretation, and make recommendations regarding appropriate strategies.
microarray analysis; normalization; variance component analysis; eSNP
There is increasing evidence that heritable variation in gene expression underlies genetic variation in susceptibility to disease. Therefore, a comprehensive understanding of the similarity between relatives for transcript variation is warranted—in particular, dissection of phenotypic variation into additive and non-additive genetic factors and shared environmental effects. We conducted a gene expression study in blood samples of 862 individuals from 312 nuclear families containing MZ or DZ twin pairs using both pedigree and genotype information. From a pedigree analysis we show that the vast majority of genetic variation across 17,994 probes is additive, although non-additive genetic variation is identified for 960 transcripts. For 180 of the 960 transcripts with non-additive genetic variation, we identify expression quantitative trait loci (eQTL) with dominance effects in a sample of 339 unrelated individuals and replicate 31% of these associations in an independent sample of 139 unrelated individuals. Over-dominance was detected and replicated for a trans association between rs12313805 and ETV6, located 4MB apart on chromosome 12. Surprisingly, only 17 probes exhibit significant levels of common environmental effects, suggesting that environmental and lifestyle factors common to a family do not affect expression variation for most transcripts, at least those measured in blood. Consistent with the genetic architecture of common diseases, gene expression is predominantly additive, but a minority of transcripts display non-additive effects.
Gene expression levels are known to influence common disease susceptibility in humans, with GWAS significant SNPs frequently found in regulatory regions. The expression levels of most genes are influenced by genetic variants, often located close to the gene itself. Expression Quantitative Trait Loci (eQTL) mapping studies have been very successful in identifying SNPs associated with expression levels; however, little is currently known about the extent of additive and non-additive genetic variance and the role of common environment on gene expression. Here we report a comprehensive study of the sources of genetic and non-genetic variation for gene expression levels using both pedigree and genotype information. We show that the majority of transcripts exhibit only additive genetic variance with congruence from independent methods using pedigree and genotype approaches. However, there are a small number of probes whose expression levels are influenced by non-additive genetic variance. For some of these probes we identify SNPs acting in a dominant and over-dominant manner that replicate in an independent sample. Surprisingly, only 17 probes exhibit significant levels of common environmental effects, suggesting that environmental and lifestyle factors common to a family do not affect expression variation for most transcripts, at least those measured in blood.
Mapping expression Quantitative Trait Loci (eQTLs) represents a powerful and widely adopted approach to identifying putative regulatory variants and linking them to specific genes. Up to now eQTL studies have been conducted in a relatively narrow range of tissues or cell types. However, understanding the biology of organismal phenotypes will involve understanding regulation in multiple tissues, and ongoing studies are collecting eQTL data in dozens of cell types. Here we present a statistical framework for powerfully detecting eQTLs in multiple tissues or cell types (or, more generally, multiple subgroups). The framework explicitly models the potential for each eQTL to be active in some tissues and inactive in others. By modeling the sharing of active eQTLs among tissues, this framework increases power to detect eQTLs that are present in more than one tissue compared with “tissue-by-tissue” analyses that examine each tissue separately. Conversely, by modeling the inactivity of eQTLs in some tissues, the framework allows the proportion of eQTLs shared across different tissues to be formally estimated as parameters of a model, addressing the difficulties of accounting for incomplete power when comparing overlaps of eQTLs identified by tissue-by-tissue analyses. Applying our framework to re-analyze data from transformed B cells, T cells, and fibroblasts, we find that it substantially increases power compared with tissue-by-tissue analysis, identifying 63% more genes with eQTLs (at FDR = 0.05). Further, the results suggest that, in contrast to previous analyses of the same data, the majority of eQTLs detectable in these data are shared among all three tissues.
Genetic variants that are associated with gene expression are known as expression Quantitative Trait Loci, or eQTLs. Many studies have been conducted to identify eQTLs, and they have proven an effective tool for identifying putative regulatory variants and linking them to specific genes. Up to now most studies have been conducted in a single tissue or cell type, but moving forward this is changing, and ongoing studies are collecting data aimed at mapping eQTLs in dozens of tissues. Current statistical methods are not able to fully exploit the richness of these kinds of data, taking account of both the sharing and differences in eQTLs among tissues. In this paper we develop a statistical framework to address this problem, to improve power to detect eQTLs when they are shared among multiple tissues, and to allow for differences among tissues to be estimated. Applying these methods to data from three tissues suggests that sharing of eQTLs among tissues may be substantially more common than it appeared in previous analyses of the same data.
Inflammation, which is directly regulated by interleukin-6 (IL-6) signaling, is implicated in the etiology of several chronic diseases. Although a common, non-synonymous variant in the IL-6 receptor gene (IL6R Asp358Ala; rs2228145 A>C) is associated with the risk of several common diseases, with the 358Ala allele conferring protection from coronary heart disease (CHD), rheumatoid arthritis (RA), atrial fibrillation (AF), abdominal aortic aneurysm (AAA), and increased susceptibility to asthma, the variant's effect on IL-6 signaling is not known. Here we provide evidence for the association of this non-synonymous variant with the risk of type 1 diabetes (T1D) in two independent populations and confirm that rs2228145 is the major determinant of the concentration of circulating soluble IL-6R (sIL-6R) levels (34.6% increase in sIL-6R per copy of the minor allele 358Ala; rs2228145 [C]). To further investigate the molecular mechanism of this variant, we analyzed expression of IL-6R in peripheral blood mononuclear cells (PBMCs) in 128 volunteers from the Cambridge BioResource. We demonstrate that, although 358Ala increases transcription of the soluble IL6R isoform (P = 8.3×10−22) and not the membrane-bound isoform, 358Ala reduces surface expression of IL-6R on CD4+ T cells and monocytes (up to 28% reduction per allele; P≤5.6×10−22). Importantly, reduced expression of membrane-bound IL-6R resulted in impaired IL-6 responsiveness, as measured by decreased phosphorylation of the transcription factors STAT3 and STAT1 following stimulation with IL-6 (P≤5.2×10−7). Our findings elucidate the regulation of IL-6 signaling by IL-6R, which is causally relevant to several complex diseases, identify mechanisms for new approaches to target the IL-6/IL-6R axis, and anticipate differences in treatment response to IL-6 therapies based on this common IL6R variant.
Interleukin-6 (IL-6) is a complex cytokine, which plays a critical role in the regulation of inflammatory responses. Genetic variation in the IL-6 receptor gene is associated with the risk of several human diseases with an inflammatory component, including coronary heart disease, rheumatoid arthritis, and asthma. A common non-synonymous single nucleotide polymorphism in this gene (Asp358Ala) has been suggested to be the causal variant in this region by affecting the circulatory concentrations of soluble IL-6R (sIL-6R). In this study we extend the genetic association of this variant to type 1 diabetes and provide evidence that this variant exerts its functional mechanism by regulating the balance between sIL-6R (generated through cleavage of the surface receptor and by alternative splicing of a soluble IL6R isoform) and membrane-bound IL-6R. These data show for the first time that the minor allele of this non-synonymous variant (Ala358) directly controls the surface levels of IL-6R on individual immune cells and that these differences in protein levels translate into a functional impairment in IL-6R signaling. These findings may have implications for clinical trials targeting inflammatory mechanisms involving IL-6R signaling and may provide tools for identifying patients with specific benefit from therapeutic intervention in the IL-6R signaling pathway.
Many genetic variants that are significantly correlated to gene expression changes across human individuals have been identified, but the ability of these variants to predict expression of unseen individuals has rarely been evaluated. Here, we devise an algorithm that, given training expression and genotype data for a set of individuals, predicts the expression of genes of unseen test individuals given only their genotype in the local genomic vicinity of the predicted gene. Notably, the resulting predictions are remarkably robust in that they agree well between the training and test sets, even when the training and test sets consist of individuals from distinct populations. Thus, although the overall number of genes that can be predicted is relatively small, as expected from our choice to ignore effects such as environmental factors and trans sequence variation, the robust nature of the predictions means that the identity and quantitative degree to which genes can be predicted is known in advance. We also present an extension that incorporates heterogeneous types of genomic annotations to differentially weigh the importance of the various genetic variants, and we show that assigning higher weights to variants with particular annotations such as proximity to genes and high regional G/C content can further improve the predictions. Finally, genes that are successfully predicted have, on average, higher expression and more variability across individuals, providing insight into the characteristics of the types of genes that can be predicted from their cis genetic variation.
Variation in gene expression across different individuals has been found to play a role in susceptibility to different diseases. In addition, many genetic variants that are linked to changes in expression have been found to date. However, their joint ability to accurately predict these changes is not well understood and has rarely been evaluated. Here, we devise a method that uses multiple genetic variants to explain the variation in expression of genes across individuals. One important aspect of our method is its robustness, in that our predictions agree well between training and test sets. Thus, although the number of genes that could be explained is relatively small, the identity and quantitative degree to which genes can be predicted is known in advance. We also present an extension to our method that integrates different genomic annotations such as location of the genetic variant or its context to differentially weigh the genetic variants in our model and improve predictions. Finally, genes that are successfully predicted have, on average, higher expression and more variability across individuals, providing insight into the characteristics of the types of genes that can be predicted by our method.
Physical activity and molecular ageing presumably interact to precipitate musculoskeletal decline in humans with age. Herein, we have delineated molecular networks for these two major components of sarcopenic risk using multiple independent clinical cohorts. We generated genome-wide transcript profiles from individuals (n = 44) who then undertook 20 weeks of supervised resistance-exercise training (RET). Expectedly, our subjects exhibited a marked range of hypertrophic responses (3% to +28%), and when applying Ingenuity Pathway Analysis (IPA) up-stream analysis to ∼580 genes that co-varied with gain in lean mass, we identified rapamycin (mTOR) signaling associating with growth (P = 1.4×10−30). Paradoxically, those displaying most hypertrophy exhibited an inhibited mTOR activation signature, including the striking down-regulation of 70 rRNAs. Differential analysis found networks mimicking developmental processes (activated all-trans-retinoic acid (ATRA, Z-score = 4.5; P = 6×10−13) and inhibited aryl-hydrocarbon receptor signaling (AhR, Z-score = −2.3; P = 3×10−7)) with RET. Intriguingly, as ATRA and AhR gene-sets were also a feature of endurance exercise training (EET), they appear to represent “generic” physical activity responsive gene-networks. For age, we found that differential gene-expression methods do not produce consistent molecular differences between young versus old individuals. Instead, utilizing two independent cohorts (n = 45 and n = 52), with a continuum of subject ages (18–78 y), the first reproducible set of age-related transcripts in human muscle was identified. This analysis identified ∼500 genes highly enriched in post-transcriptional processes (P = 1×10−6) and with negligible links to the aforementioned generic exercise regulated gene-sets and some overlap with ribosomal genes. The RNA signatures from multiple compounds all targeting serotonin, DNA topoisomerase antagonism, and RXR activation were significantly related to the muscle age-related genes. Finally, a number of specific chromosomal loci, including 1q12 and 13q21, contributed by more than chance to the age-related gene list (P = 0.01–0.005), implying possible epigenetic events. We conclude that human muscle age-related molecular processes appear distinct from the processes regulated by those of physical activity.
A fundamental challenge for modern medicine is to generate new strategies to cope with the rising proportion of older people within society, as unaddressed it will make many health care systems financially unviable. Ageing impacts both quality of life and longevity through reduced musculoskeletal function. What is unknown in humans is whether the decline with age, referred to as “sarcopenia,” represents a molecular ageing process or whether it is primarily driven by alterations in lifestyle, e.g. reduced physical activity and poor nutrition. Because the details of such interactions will be uniquely human, we aimed to produce the first reproducible global molecular profile of human muscle age, one that could be validated across independent clinical cohorts to ensure its general applicability. We combined this analysis with extensive data on the impact of exercise training on human muscle phenotype to then identify the processes predominately associated with age and not environment. We were able to identify unique gene pathways associated with human muscle growth and age and were able to conclude that human muscle age-related molecular processes appear distinct from the processes directly regulated by those of physical activity.
Human growth has an estimated heritability of about 80%–90%. Nevertheless, the underlying cause of shortness of stature remains unknown in the majority of individuals. Genome-wide association studies (GWAS) showed that both common single nucleotide polymorphisms and copy number variants (CNVs) contribute to height variation under a polygenic model, although explaining only a small fraction of overall genetic variability in the general population. Under the hypothesis that severe forms of growth retardation might also be caused by major gene effects, we searched for rare CNVs in 200 families, 92 sporadic and 108 familial, with idiopathic short stature compared to 820 control individuals. Although similar in number, patients had overall significantly larger CNVs (p-value<1×10−7). In a gene-based analysis of all non-polymorphic CNVs>50 kb for gene function, tissue expression, and murine knock-out phenotypes, we identified 10 duplications and 10 deletions ranging in size from 109 kb to 14 Mb, of which 7 were de novo (p<0.03) and 13 inherited from the likewise affected parent but absent in controls. Patients with these likely disease causing 20 CNVs were smaller than the remaining group (p<0.01). Eleven (55%) of these CNVs either overlapped with known microaberration syndromes associated with short stature or contained GWAS loci for height. Haploinsufficiency (HI) score and further expression profiling suggested dosage sensitivity of major growth-related genes at these loci. Overall 10% of patients carried a disease-causing CNV indicating that, like in neurodevelopmental disorders, rare CNVs are a frequent cause of severe growth retardation.
With a frequency of 3%, shortness of stature is a common medical concern. Although family studies have clearly shown that gene defects play a pivotal role in the development of short stature, the underlying genetic variants involved remain unknown in about 80% of cases. In contrast to recent studies which aimed at the identification of common genetic variants to explain minor differences in the height variation in the general population, we targeted rare genomic variants where we expected a major gene effect on growth. By examining 200 patients clinically evaluated for short stature, we show that rare structural chromosomal aberrations (CNVs) are associated with shortness of stature in 10% of the cases. The identified CNVs were either de novo or segregated with short stature in the families and include genes that are functionally involved in growth regulation in humans or mice. We furthermore demonstrate an overlap of these CNVs with known microdeletion syndromes. Interestingly, 3 CNVs contain positions of common variants and confirm the localization of major growth-related genes. These findings are particularly important for identification of biological pathways leading to short stature, but also for further therapeutic approaches.
We describe a novel approach to capturing the covariance structure of peripheral blood gene expression that relies on the identification of highly conserved Axes of variation. Starting with a comparison of microarray transcriptome profiles for a new dataset of 189 healthy adult participants in the Emory-Georgia Tech Center for Health Discovery and Well-Being (CHDWB) cohort, with a previously published study of 208 adult Moroccans, we identify nine Axes each with between 99 and 1,028 strongly co-regulated transcripts in common. Each axis is enriched for gene ontology categories related to sub-classes of blood and immune function, including T-cell and B-cell physiology and innate, adaptive, and anti-viral responses. Conservation of the Axes is demonstrated in each of five additional population-based gene expression profiling studies, one of which is robustly associated with Body Mass Index in the CHDWB as well as Finnish and Australian cohorts. Furthermore, ten tightly co-regulated genes can be used to define each Axis as “Blood Informative Transcripts” (BITs), generating scores that define an individual with respect to the represented immune activity and blood physiology. We show that environmental factors, including lifestyle differences in Morocco and infection leading to active or latent tuberculosis, significantly impact specific axes, but that there is also significant heritability for the Axis scores. In the context of personalized medicine, reanalysis of the longitudinal profile of one individual during and after infection with two respiratory viruses demonstrates that specific axes also characterize clinical incidents. This mode of analysis suggests the view that, rather than unique subsets of genes marking each class of disease, differential expression reflects movement along the major normal Axes in response to environmental and genetic stimuli.
Gene expression profiling of human tissues typically reveals a complex structure of co-regulation of gene expression that has yet to be explored with regard to the genetic and environmental sources of covariance or its implications for quantitative and clinical traits. Here we show that peripheral blood samples from multiple studies can be described by nine common axes of variation that collectively explain up to one half of all transcriptional variance in blood. Specific axes diverge according to environmental variables such as lifestyle and infectious disease exposure, but a strong genetic component to axis regulation is also inferred. As few as 10 “blood-informative transcripts” (BITs) can be used to define each axis and potentially classify individuals with respect to multiple aspects of their blood and immune function. The analysis of longitudinal profiles of one individual shows how these change relative to clinical shifts in metabolic profile following viral infection. The notion that gene expression diverges along genetic paths of least resistance defined by these axes has important implications for interpreting differential expression in case-control studies of disease.
The phylogeographic population structure of Mycobacterium tuberculosis suggests local adaptation to sympatric human populations. We hypothesized that HIV infection, which induces immunodeficiency, will alter the sympatric relationship between M. tuberculosis and its human host. To test this hypothesis, we performed a nine-year nation-wide molecular-epidemiological study of HIV–infected and HIV–negative patients with tuberculosis (TB) between 2000 and 2008 in Switzerland. We analyzed 518 TB patients of whom 112 (21.6%) were HIV–infected and 233 (45.0%) were born in Europe. We found that among European-born TB patients, recent transmission was more likely to occur in sympatric compared to allopatric host–pathogen combinations (adjusted odds ratio [OR] 7.5, 95% confidence interval [95% CI] 1.21–infinity, p = 0.03). HIV infection was significantly associated with TB caused by an allopatric (as opposed to sympatric) M. tuberculosis lineage (OR 7.0, 95% CI 2.5–19.1, p<0.0001). This association remained when adjusting for frequent travelling, contact with foreigners, age, sex, and country of birth (adjusted OR 5.6, 95% CI 1.5–20.8, p = 0.01). Moreover, it became stronger with greater immunosuppression as defined by CD4 T-cell depletion and was not the result of increased social mixing in HIV–infected patients. Our observation was replicated in a second independent panel of 440 M. tuberculosis strains collected during a population-based study in the Canton of Bern between 1991 and 2011. In summary, these findings support a model for TB in which the stable relationship between the human host and its locally adapted M. tuberculosis is disrupted by HIV infection.
Human tuberculosis (TB) caused by Mycobacterium tuberculosis kills 1.5 million people each year. M. tuberculosis has been affecting humans for millennia, suggesting that different strain lineages may be adapted to specific human populations. The combination of a particular strain lineage and its corresponding patient population can be classified as sympatric (e.g. Euro-American lineage in Europeans) or allopatric (e.g. East-Asian lineage in Europeans). We hypothesized that infection with the human immunodeficiency virus (HIV), which impairs the human immune system, will interfere with this host–pathogen relationship. We performed a nation-wide molecular-epidemiological study of HIV–infected and HIV–negative TB patients between 2000 and 2008 in Switzerland. We found that HIV infection was associated with the less adapted allopatric lineages among patients born in Europe, and this was not explained by social or other patient factors such as increased social mixing in HIV–infected individuals. Strikingly, the association between HIV infection and less adapted M. tuberculosis lineages was stronger in patients with more pronounced immunodeficiency. Our observation was replicated in a second independent panel of M. tuberculosis strains collected during a population-based study in the Canton of Bern. In summary, our study provides evidence that the sympatric host–pathogen relationship in TB is disrupted by HIV infection.
Myopia, or nearsightedness, is the most common eye disorder, resulting primarily from excess elongation of the eye. The etiology of myopia, although known to be complex, is poorly understood. Here we report the largest ever genome-wide association study (45,771 participants) on myopia in Europeans. We performed a survival analysis on age of myopia onset and identified 22 significant associations (), two of which are replications of earlier associations with refractive error. Ten of the 20 novel associations identified replicate in a separate cohort of 8,323 participants who reported if they had developed myopia before age 10. These 22 associations in total explain 2.9% of the variance in myopia age of onset and point toward a number of different mechanisms behind the development of myopia. One association is in the gene PRSS56, which has previously been linked to abnormally small eyes; one is in a gene that forms part of the extracellular matrix (LAMA2); two are in or near genes involved in the regeneration of 11-cis-retinal (RGR and RDH5); two are near genes known to be involved in the growth and guidance of retinal ganglion cells (ZIC2, SFRP1); and five are in or near genes involved in neuronal signaling or development. These novel findings point toward multiple genetic factors involved in the development of myopia and suggest that complex interactions between extracellular matrix remodeling, neuronal development, and visual signals from the retina may underlie the development of myopia in humans.
The genetic basis of myopia, or nearsightedness, is believed to be complex and affected by multiple genes. Two genetic association studies have each identified a single genetic region associated with myopia in European populations. Here we report the results of the largest ever genetic association study on myopia in over 45,000 people of European ancestry. We identified 22 genetic regions significantly associated with myopia age of onset. Two are replications of the previously identified associations, and 20 are novel. Ten of the novel associations replicate in a small separate cohort. Sixteen of the novel associations are in or near genes implicated in eye development, neuronal development and signaling, the visual cycle of the retina, and general morphology: BMP3, BMP4, DLG2, DLX1, KCNMA1, KCNQ5, LAMA2, LRRC4C, PRSS56, RBFOX1, RDH5, RGR, SFRP1, TJP2, ZBTB38, and ZIC2. These findings point to numerous biological pathways involved in the development of myopia and, in particular, suggest that early eye and neuronal development may lead to the eventual development of myopia in humans.