|Home | About | Journals | Submit | Contact Us | Français|
Complex traits such as susceptibility to diseases are determined in part by variants at multiple genetic loci. Genome-wide association studies can identify these loci, but most phenotype-associated variants lie distal to protein-coding regions and are likely involved in regulating gene expression. Understanding how these genetic variants affect complex traits depends on the ability to predict and test the function of the genomic elements harboring them. Community efforts such as the ENCODE Project provide a wealth of data about epigenetic features associated with gene regulation. These data enable the prediction of testable functions for many phenotype-associated variants.
Genetics informs us about human disease in at least two major ways, one through Mendelian diseases and the other through complex traits. Mutations that lead to Mendelian inheritance of disease usually alter the function of single genes (1), reducing or modifying the function of the protein product by changing the encoded amino acid sequence (2, 3). In addition, some Mendelian diseases are caused by debilitating mutations in promoters or enhancers of a gene, resulting in a deficiency of the protein product and the consequent pathological phenotype (4, 5). Genetic variants causing Mendelian disease are rare in the human population (6) because selective pressure against their deleterious effects keeps their allele frequency low. Because the genetic variants causing monogenic diseases are rare, mapping studies are confined to detailed analyses of affected families and kindreds (6). Such studies have mapped genetic variants at the heart of many monogenic disorders. The Online Mendelian Inheritance in Man® (OMIM®) database (1, 7) currently lists almost 3400 phenotypes for which the molecular basis is known.
Susceptibilities to many common diseases such as coronary artery disease and many forms of cancer and type 2 diabetes have substantial genetic components, but in contrast to the Mendelian diseases, these phenotypes are affected by variants at multiple loci (6, 8, 9). Thus, susceptibility to a common disease is a complex trait. Mapping the multiple loci that contribute to these important traits usually follows a case-control design (Fig. 1A). The mapping experiments examine SNPs to ascertain the genotypes that are significantly more prevalent in the affected group than in the non-affected group; these genotypes are associated with the trait of interest. When genotypes are determined at SNPs throughout the genome of each individual, the study is called a genome-wide association study (GWAS).2
Conducting a GWAS for complex traits has been a formidable challenge because the contribution of any one locus to the phenotype is expected to be small compared with the sizable effects of variants causing monogenic disorders. Furthermore, the mapping experiments need to cover the entire human genome at a sufficiently high resolution for discovery. Of course, the fact that the diseases are common means that large cohorts of individuals can be recruited for case-control studies, with thousands of affected and non-affected persons enrolled in a study, thus providing substantial power.
Recent advances such as the HapMap Project (10) have enabled effective mapping of multiple loci for complex traits in humans. A driving assumption for GWASs is that common diseases are likely caused by common variants (9, 11). Because the phenotypic effect of any one variant is expected to be small, these alleles may reach sufficiently high frequencies to be considered common (at least 5%). The HapMap Project determined combinations of allelic configurations of loci (haplotypes) that are common in several human populations. This allowed the development of high-throughput approaches to ascertain the genotype for individuals at ~1 million SNPs across the genome, giving a good resolution for GWASs (Fig. 1A). High-throughput high-resolution genotyping coupled with the large cohorts available for many complex traits enabled the completion of the first GWAS in 2005 (12), and the number of completed GWASs has increased each subsequent year. The National Human Genome Research Institute maintains a catalog of published GWAS results (13), and as of mid-2011, it contained the results of 1449 GWASs for 237 traits.
A critical step in interpreting the results of the GWAS is moving from maps of loci associated with a trait to identifying the genetic variants that actually contribute to the trait (14, 15). In the case of Mendelian disorders, once a locus was strongly implicated in the disease, attention was rightly focused on the protein-coding genes in the region because many of the causative variants impact the structure of the encoded protein. However, it is likely that a substantial fraction of genetic variants contributing to complex traits in humans are involved in gene regulation, just as has been observed in model organisms (8, 16). Most phenotype-associated variants discovered in GWASs are far from protein-coding regions, even appearing in gene deserts (13), which is similar to the genomic distribution of most cis-regulatory modules (CRMs) such as promoters and enhancers. Strikingly, trait-associated variants from GWASs are more likely to be associated with quantitative variation in gene expression than are other variants on the genotyping arrays (17, 18), further supporting the expectation that many variants associated with complex traits affect gene expression.
Thus, what is needed is to match the high-resolution maps of phenotype-associated variants to reliable information about the locations of CRMs (Fig. 1, B and C). Enhancers can be located very far from their target genes (19), and virtually any noncoding sequence in the human genome could potentially be a CRM. Although both interspecies conservation and direct assays for biochemical features of chromatin associated with CRMs can be used to predict the locations of regulatory regions (15, 20, 21), interpretation of the results from GWASs requires an extensive mapping of CRMs in multiple human tissues.
This minireview will cover recent advances in building a more comprehensive catalog of CRMs in humans and the use of that catalog to predict functions that are altered by phenotype-associated SNPs discovered through GWASs.
The expression levels of genes in humans (and eukaryotes in general) are determined by both chromatin structure and the transcription factors that are bound to the CRMs (22–24). The molecular mechanisms regulating gene expression are a complex interplay among enzymes and factors that catalyze hundreds of reactions, including covalent modification of histones; alteration of nucleosomal structure and stability; binding of transcription factors to specific DNA sequences; recruitment of coactivators, repressors, and polymerases; and initiation, pausing, and elongation of transcription. The details of how these reactions lead to appropriate levels of expression at the right time and place are specific for each gene, and a full understanding of regulation requires intensive study of each locus.
However, some features of the biochemical machinery employed in regulated expression are common to most genes and their CRMs (25). The most obvious feature is the presence of transcripts in the steady-state RNA. Measurement of RNA levels using any of a variety of methods provides a good monitor of the expression of a gene (26–30).
CRMs have consistent features as well. Most are in regions of the chromatin that are accessible to macromolecules, reflecting the need for the CRM to interact with proteins such as transcription factors. These accessible regions can be mapped by treating nuclei with a DNase and determining the sites of cleavage. Such DNase-hypersensitive sites (DHSs) are a general feature of almost all active CRMs (31).
Particular histone modifications can distinguish categories of CRMs and expression states of genes (24, 32). Antibodies specific to individual histone modifications are used to immunoprecipitate chromatin (histones and DNA) bearing that modification. DNA isolated by ChIP is then assayed for the presence of segments of interest (33). Some features have substantial diagnostic importance. The chromatin around active promoters has high levels of trimethylation at lysine 4 of histone H3 (H3K4me3), whereas the chromatin around enhancers has high levels of monomethylation at the same position (H3K4me1) (25, 34, 35). Acetylation of histone H3 at lysine 27 is associated with active promoters and enhancers (32). The chromatin of transcribed regions is marked by di- and trimethylation at lysine 79 of H3 (H3K79me2/me3) in the initial portion of the transcription unit, followed by methylation at lysine 36 (H3K36me3) in the distal portion (36). Other histone H3 methylations mark distinct portions of the repressed chromatin, with trimethylation at lysine 27 (H3K27me3) or lysine 9 (H3K9me3) covering different sets of repressed genes (37, 38).
CRMs such as enhancers and promoters are clusters of binding sites for transcription factors (39), and thus, occupancy by transcription factors is a good indicator of potential regulatory regions. Using a ChIP approach but with antibodies against individual transcription factors, one can obtain reliable maps of transcription factor occupancy (40–42). However, because a distinct battery of factors is bound at each CRM and because many transcription factors are present in a limited number of cell types, maps of binding by many transcription factors are needed to find a broad range of CRMs. Conversely, once a region has been identified as a CRM, the set of bound proteins can be used to better understand regulation and the impact of genetic variation on that regulation (Fig. 1C).
Transcripts, DHSs, histone modifications, and transcription factor occupancy can all be considered epigenetic features (43). They are biochemical attributes that lie on top of (epi, “on” or “above”) the genetic material (DNA), and they reflect or influence the expression of genes. The epigenetic features are dynamic: RNA is made and degraded, histone modifications are added and removed, and transcription factors bind to and dissociate from DNA. However, the steady-state levels of these epigenetic features are characteristic of the chromatin containing a given segment of DNA in a given cell type, and that steady-state level can be inherited at least in somatic cells. This is thought to be a cellular memory for expression status (44), and the epigenetic features can be used as a monitor of gene activity and CRM location.
Detailed studies of the molecules and biochemical events that regulate expression of individual genes in chromatin led to the discovery of the connections between epigenetic features and regulation. Recent advances in genomic technology allow these features to be determined quantitatively throughout genomes. DNA that is highly enriched for the feature of interest can be mapped comprehensively, most commonly using second-generation sequencing methods (42, 45, 46). Transcriptomes can be determined by sequencing RNA after fragmentation and conversion to complementary DNA; this is called RNA-seq (30). DNA in chromatin with a certain modification or bound by a particular transcription factor can be determined by sequencing the DNA enriched by ChIP; this is called ChIP-seq (47, 48). DNA in exposed regions of chromatin, i.e. DHSs, can be identified by enriching for DNA cut by nucleases in chromatin and sequencing from the cleaved ends; this is called DNase-seq (49, 50).
A few community projects are assaying a broad collection of epigenetic features across a wide spectrum of cell types in humans and model organisms. In these consortia, complementary work in multiple laboratories is coordinated to cover a substantial portion of the matrix of features and cell types. Consistent data standards are established, and the data are released as soon as it is replicated. One of the major community projects is the ENCODE Project, which aims to establish an ENCyclopedia Of DNA Elements (51). The various branches of this consortium are determining transcriptomes, mapping histone modifications and transcription factor occupancy, and identifying accessible chromatin, in addition to manually curating the annotation of genes (42, 46). In the production phase culminating this year, all assays are being run on a set of human cell lines that represent some important human tissues, and some assays such as DNase-seq are being conducted on a wide range of cell types, including primary cells. Almost 3000 data sets have been released to date. Parallel work is being done in Caenorhabditis elegans (52) and Drosophila melanogaster (53) as the modENCODE Project.
Another community project is the NIH Roadmap Epigenomics Mapping Consortium (54), which is part of the International Human Epigenome Consortium. The NIH Roadmap Epigenomics Mapping Consortium is mapping histone modifications by ChIP-seq and accessible chromatin by DNase-seq in many human tissues and cell types, with an emphasis on primary cells from healthy individuals. Over 250 data sets have been released to date.
Several studies have shown that epigenetic data such as those being generated in these community projects can be highly effective at predicting CRMs. DNA segments in chromatin with the H3K4me1 modification are validated as enhancers in cell transfection assays at a high rate (55), and DNA segments bound by the coactivator p300 are frequently validated as enhancers in transgenic mice (56). Hence, it is reasonable to expect the epigenetic data from these consortia to be good predictors of gene regulatory function (46).
The extensive epigenetic data, although not comprehensive, are already proving to be useful for finding potential regulatory regions that could be affected by genetic variants (Fig. 1B). Several recent studies have shown that SNPs associated with complex traits are enriched in regions implicated in gene regulation based on epigenetic features. One study used statistical modeling to integrate information about several histone modifications in multiple cell lines, generating a segmentation, or partitioning, of the human genome into classes associated with different functions (37). The segmentation classes with properties of enhancers were significantly enriched for phenotype-associated SNPs. The integrative analysis of ENCODE data (46) showed that the phenotype-associated SNPs in the GWAS Catalog are enriched in DNase-sensitive regions and in DNA segments bound by transcription factors. These studies initially examined the lead SNPs from the GWAS, i.e. the SNPs on the genotyping arrays that are most highly associated with the trait of interest. Although these need not be the functional SNPs, a notable fraction of them (34%) are in DHSs (46). Of course, in many cases, the functional SNP is not the lead SNP, but rather another variant in linkage disequilibrium (LD) with the lead SNP is the functional one (Fig. 1B). When all SNPs in LD with the lead SNPs are included as phenotype-associated, then for a large fraction of the phenotype associations, at least one SNP is found in a DNA segment associated with regulatory function via the epigenetic data. For example, the lead or linked SNP is found in a DHS for 70–80% of the phenotype associations reported in the GWAS Catalog (46, 57). This strong correspondence between phenotype-associated variants and function-associated DNA indicates that current epigenetic data are already useful for interpretation of GWAS SNPs.
Biochemical indicators of gene regulatory regions have long been used to interpret heritable phenotypes in humans, starting with Mendelian traits. Early examples are the use of DHSs to understand the impact of large deletions in the complex of genes encoding β-globins (5, 58, 59) and α-globins (60). DHSs distal to the genes were shown to be enhancers that when deleted led to thalassemias, which are inherited anemias resulting from inadequate production of one or more globin polypeptides.
Epigenetic features also provide insights into the interpretation of complex traits. A gene desert located upstream of the MYC gene contains genetic variants that are associated with breast and prostate cancers (61, 62). The MYC gene is an intriguing candidate for the target of the SNPs, given its role in cell cycle control, but all of the phenotype-associated variants are distal to the MYC gene, and from position alone, it is not clear how they may work. However, high-resolution mapping of several epigenetic features shows that some of the phenotype-associated variants are in transcription factor-binding sites in enhancers. The binding affinity is allele-specific, and the different alleles affect chromatin looping to the presumptive target MYC (63–65). Thus, the genetic variants do affect regulated expression of a target gene that could help explain cancer predisposition. Importantly, alignment of the ENCODE data in this region with the significant variants from the GWAS also reveals that key variants are found in the transcription factor-occupied DNA segments mapped by this consortium (42). This is true even though neither prostate nor breast tissue was used in the analysis at that time. Even without complete coverage of all tissues and factors, informative insights are gleaned from examining the GWAS results in the context of epigenetic features.
Recently, investigators employed ENCODE epigenetic data as an initial guide to discover regulatory regions in which genetic variation is affecting a complex trait. For example, Farrell et al. (66) used ENCODE data to help find likely causative variants in an enhancer in the HBS1L-MYB locus, one of three loci associated with quantitative levels of “fetal” hemoglobin in adult red blood cells. Their fine-mapping showed that the most strongly associated variants are clustered in the intergenic region (Fig. 2A), and a scan of ENCODE data showed that the variants are in DNA segments with epigenetic features expected for enhancers (Fig. 2, B and C). Guided by the initial ENCODE data, the authors focused further analysis in patients and controls and showed that the variants affect a transcriptional enhancer (66). Other recent examples of the use of ENCODE or other epigenetic data as guides for functional studies of trait-associated variants are studies of the TCF7L2 intronic enhancer strongly associated with type 2 diabetes (67), the gene desert at chromosome 9p21 associated with coronary artery disease (68), and a locus associated with susceptibility to colorectal cancer (69).
Some links between specific epigenetic features and trait-associated variants are found at multiple loci affecting a complex trait, suggesting that common regulatory mechanisms could be operating at multiple loci. Each phenotype in the GWAS Catalog can be associated with multiple loci, and in several cases, the loci affecting a given trait are associated more frequently than expected with a particular feature such as occupancy by a particular transcription factor or appearance of a DHS in a given cell line (46, 57). For example, variants associated with Crohn disease are over-represented in DNA segments bound by GATA2 (in human umbilical vein endothelial cells (HUVECs)) and are sensitive to DNase in T-helper cells. One example is a 1.25-Mb gene desert on chromosome 5 (Fig. 3). High-resolution mapping reveals a cluster of variants in LD that are strongly associated with Crohn disease (Fig. 3A) (70). Within this cluster are variants that affect the level of expression of PTGER4, a gene located ~300 kb away that encodes the EP4 prostaglandin receptor (70). Examination of selected ENCODE tracks within this region shows that the trait-associated variants are in or close to DHSs that are binding sites for a GATA transcription factor (Fig. 3, B and C). The data from the T-helper cells are likely to be more relevant to autoimmunity than those from HUVECs, and one could hypothesize that the genetic variation could be affecting affinity for GATA3, a related protein that regulates gene expression in T-cells (71). This is an example of a readily testable hypothesis grounded in the examination of the GWAS and ENCODE data (Fig. 1C).
These examples illustrate an important principle. The data from community projects such as ENCODE and NIH Roadmap Epigenomics Mapping Consortium may not cover the tissues, developmental stages, or transcription factors of greatest relevance to a particular phenotype. However, in many cases, they provide initial insights that help guide more definitive experiments. These may be cases in which a regulatory region is active in multiple cell types or is bound by several different transcription factors.
Application of genomic technologies continues to stimulate discovery in biochemistry and molecular biology as the networks of regulatory interactions begin to be understood not only in model organisms (72) but also in humans. This information is helping to translate molecular insights into settings of clinical relevance. The results from GWASs reveal genomic locations in which genetic variation impacts susceptibility to the most common diseases of humans. It is now clear that many of these loci are involved in gene regulation, and the deep knowledge of the biochemistry of gene regulation can be coupled with high-throughput genomic assays to help identify candidate regulatory regions in the loci identified by GWASs. No longer will finding a key genetic variant in a gene desert mean the end of a search for a molecular connection between genotype and phenotype.
This minireview has emphasized the use of epigenetic features mapped by ENCODE and other projects for interpreting GWAS results. This approach has considerable power now, and plans are in place to increase substantially the coverage of cell types, transcription factors, and other epigenetic features (described at www.genome.gov/10005107). The large scale of these community projects and their commitment to rapid data release will ensure that data sets of closer relevance to a wider range of phenotypes will become available. Assays with higher resolution for mapping regulatory elements will also be used more widely (73–75).
Of course, the genome-wide mapping of epigenetic features originated in individual laboratories, and they will continue to provide important data sets and insights for regulation. Indeed, it is likely that individual or small groups of laboratories will explore epigenetic features in the cell types most closely related to the phenotypes of interest. The capacity of second-generation sequencing machines is increasing, which means that many more laboratories will be generating and analyzing genome-wide epigenetic data. This diversity of investigator-initiated projects should complement the community projects and likely fill gaps in coverage that are most relevant to the phenotypes of interest.
Several challenges must be met to effectively harvest the insights from the plethora of genome-scale genetic and epigenetic results. One of the most exciting challenges is integration of the data. Initial efforts using statistical modeling (76, 77) are being applied to some of the data from the community projects. Opportunities abound for novel approaches that have the capacity for even larger numbers of data sets and that provide more accurate predictions. These opportunities should engage not only biochemists and geneticists but also statisticians and bioinformaticians. Strong collaborations among teams of these investigators should lead to insights into connections between multiple genotypes and complex phenotypes of even greater importance to medicine.
*This work was supported, in whole or in part, by National Institutes of Health Grants R01 DK065806, RC HG005573, and U01 HG004695. This is the sixth article in the Thematic Minireview Series on Results from the ENCODE Project: Integrative Global Analyses of Regulatory Regions in the Human Genome.
2The abbreviations used are: