Biochemical indicators of gene regulatory regions have long been used to interpret heritable phenotypes in humans, starting with Mendelian traits. Early examples are the use of DHSs to understand the impact of large deletions in the complex of genes encoding β-globins (5
) and α-globins (60
). DHSs distal to the genes were shown to be enhancers that when deleted led to thalassemias, which are inherited anemias resulting from inadequate production of one or more globin polypeptides.
Epigenetic features also provide insights into the interpretation of complex traits. A gene desert located upstream of the MYC
gene contains genetic variants that are associated with breast and prostate cancers (61
). The MYC
gene is an intriguing candidate for the target of the SNPs, given its role in cell cycle control, but all of the phenotype-associated variants are distal to the MYC
gene, and from position alone, it is not clear how they may work. However, high-resolution mapping of several epigenetic features shows that some of the phenotype-associated variants are in transcription factor-binding sites in enhancers. The binding affinity is allele-specific, and the different alleles affect chromatin looping to the presumptive target MYC
). Thus, the genetic variants do affect regulated expression of a target gene that could help explain cancer predisposition. Importantly, alignment of the ENCODE data in this region with the significant variants from the GWAS also reveals that key variants are found in the transcription factor-occupied DNA segments mapped by this consortium (42
). This is true even though neither prostate nor breast tissue was used in the analysis at that time. Even without complete coverage of all tissues and factors, informative insights are gleaned from examining the GWAS results in the context of epigenetic features.
Recently, investigators employed ENCODE epigenetic data as an initial guide to discover regulatory regions in which genetic variation is affecting a complex trait. For example, Farrell et al.
) used ENCODE data to help find likely causative variants in an enhancer in the HBS1L-MYB
locus, one of three loci associated with quantitative levels of “fetal” hemoglobin in adult red blood cells. Their fine-mapping showed that the most strongly associated variants are clustered in the intergenic region (A
), and a scan of ENCODE data showed that the variants are in DNA segments with epigenetic features expected for enhancers (, B
). Guided by the initial ENCODE data, the authors focused further analysis in patients and controls and showed that the variants affect a transcriptional enhancer (66
). Other recent examples of the use of ENCODE or other epigenetic data as guides for functional studies of trait-associated variants are studies of the TCF7L2
intronic enhancer strongly associated with type 2 diabetes (67
), the gene desert at chromosome 9p21 associated with coronary artery disease (68
), and a locus associated with susceptibility to colorectal cancer (69
FIGURE 2. GWAS variants associated with high levels of fetal hemoglobin in adults found in an enhancer marked by epigenetic features.
A, fine mapping of genetic variants between the genes HBS1L and MYB on human chromosome 6 (chr6), with the position of SNPs along (more ...)
Some links between specific epigenetic features and trait-associated variants are found at multiple loci affecting a complex trait, suggesting that common regulatory mechanisms could be operating at multiple loci. Each phenotype in the GWAS Catalog can be associated with multiple loci, and in several cases, the loci affecting a given trait are associated more frequently than expected with a particular feature such as occupancy by a particular transcription factor or appearance of a DHS in a given cell line (46
). For example, variants associated with Crohn disease are over-represented in DNA segments bound by GATA2 (in human umbilical vein endothelial cells (HUVECs)) and are sensitive to DNase in T-helper cells. One example is a 1.25-Mb gene desert on chromosome 5 (). High-resolution mapping reveals a cluster of variants in LD that are strongly associated with Crohn disease (A
). Within this cluster are variants that affect the level of expression of PTGER4
, a gene located ~300 kb away that encodes the EP4 prostaglandin receptor (70
). Examination of selected ENCODE tracks within this region shows that the trait-associated variants are in or close to DHSs that are binding sites for a GATA transcription factor (, B
). The data from the T-helper cells are likely to be more relevant to autoimmunity than those from HUVECs, and one could hypothesize that the genetic variation could be affecting affinity for GATA3, a related protein that regulates gene expression in T-cells (71
). This is an example of a readily testable hypothesis grounded in the examination of the GWAS and ENCODE data (C
FIGURE 3. GWAS variants associated with Crohn disease and other autoimmune diseases found in potential regulatory regions marked by epigenetic features.
A, fine mapping of genetic variants in a 2-Mb interval on human chromosome 5 (chr5). The red vertical lines (more ...)
These examples illustrate an important principle. The data from community projects such as ENCODE and NIH Roadmap Epigenomics Mapping Consortium may not cover the tissues, developmental stages, or transcription factors of greatest relevance to a particular phenotype. However, in many cases, they provide initial insights that help guide more definitive experiments. These may be cases in which a regulatory region is active in multiple cell types or is bound by several different transcription factors.