|Home | About | Journals | Submit | Contact Us | Français|
To understand the molecular mechanisms that underlie global transcriptional regulation, it is essential to first identify all the transcriptional regulatory elements in the human genome. The advent of next-generation sequencing has provided a powerful platform for genome-wide analysis of different species and specific cell types; when combined with traditional techniques to identify regions of open chromatin [DNaseI hypersensitivity (DHS)] or specific binding locations of transcription factors [chromatin immunoprecipitation (ChIP)], and expression data from microarrays, we become uniquely poised to uncover the mysteries of the genome and its regulation. To this end, we have performed global meta-analysis of the relationship among data from DNaseI-seq, ChIP-seq and expression arrays, and found that specific correlations exist among regulatory elements and gene expression across different cell types. These correlations revealed four distinct modes of chromatin domain structure reflecting different functions: repressive, active, primed and bivalent. Furthermore, CCCTC-binding factor (CTCF) binding sites were identified based on these integrative data. Our findings uncovered a complex regulatory process involving by DNaseI HS sites and histone modifications, and suggest that these dynamic elements may be responsible for maintaining chromatin structure and integrity of the human genome. Our integrative approach provides an example by which data from diverse technology platforms may be integrated to provide more meaningful insights into global transcriptional regulation.
Consistent proper function of all biologic processes relies on the precise spatial and temporal expression of genes (1–3). Development, differentiation, proliferation, apoptosis, and even aging, are the culmination of cell type-specific and ubiquitous gene expression. Since transcription was first described researchers have sought to define the molecular mechanisms that regulate this phenomenon, driven by the belief that understanding the gene expression profiles of normal and disease states will facilitate discoveries of therapeutic targets to alleviate human and animal suffering. These works have defined several types of cis-acting transcriptional regulators, including promoters, enhancers, insulators and locus control regions (LCR) (1,4,5) and the trans-acting factors that bind to them. Nonetheless, the relative roles of these regulatory DNA elements have yet to be fully elucidated. The introduction of high-throughput sequencing (6) and its massive amounts of data spanning entire genomes of species has provided a platform from which we may begin to examine global patterns of gene expression and compare these patterns among different cell types to gain a clearer understanding of the molecular mechanisms underlying the dynamic and complex processes of life.
Next-generation sequencing (NGS) has become a popular approach to identifying gene regulatory elements and to performing accurate functional analysis (6). NGS of DNA–protein complexes isolated by chromatin immunoprecipitation, a procedure known as (ChIP-seq), has allowed for global localization of regulatory elements associated with a specific protein of interest (7–11). Unfortunately, this combined technique is only applicable to known (previously characterized) trans-acting factors and is limited by its requirement for a high quality ChIP-grade antibody to isolate the transcription factor (TF) to be analyzed (9). By coupling the NGS method with DNaseI hypersensitive (DNaseI HS) site mapping (long considered the gold-standard for comprehensively identifying the location of various classes of transcriptional regulatory elements), a particularly powerful high-resolution procedure, DNase-seq, was developed (12–18). Like the ChIP-seq procedure, though, DNase-seq suffers from some inherent limitations. DNase-seq provides only location data and is unable to directly characterize function or identify the particular TF(s) associated with the region.
The data obtained from each of these combined-NGS procedures may be analyzed in parallel (along with data obtained from gene expression arrays) to facilitate the identification of bona fide transcriptional regulatory elements. First, though, we must obtain a thorough understanding of the different types of cis-regulatory sequence elements and epigenetic modulatory mechanisms in order to accurately investigate their contributions to spatial and temporal gene expression.
The first genome-wide maps of histone methylation (7) and acetylation marks (19) were generated from human resting CD4+ T cells. Histone modifications associated with gene transcription were designated as active, while those associated with repressed transcription were designated as repressive. Intriguingly, some of the ‘active’ were identified in transcriptionally silent genes (7,20–23), suggesting that these modifications may act more as markers of genes primed for transcriptional activity. Not surprisingly, then, histone modification is not the sole mediator of expression level (24). By performing DNase-seq and DNase fragment, hybridization to microarray chip (DNase-chip), Boyle et al. (25) created a comprehensive genome-wide map of the open chromatin regions in CD4+ T cells. Their analysis of the resultant data sets did not identify a clear correlation between DNaseI HS and levels of gene expression. Shortly thereafter, Xi et al. (26) used DNase-chip to comparatively analyze six human cell types in order to identify functional cell type specific and ubiquitous DNaseI HS sites (DHSs). Their examination of 1% of the human genome revealed that cell type-specific DNaseI HS sites co-localized with cell type-specific gene expression. Recently, Stitzel et al. (27) conducted genome-wide analysis of DNaseI hypersensitive sites in human islets. Ling et al. (28) produced a set of detailed, high-quality, genome-wide DNaseI hypersensitivity maps in the mouse liver in vivo. These studies highlight the utility of DNase-seq for systematically uncovering cis-regulatory elements on a genome-wide scale.
In the study presented herein, we performed a genome-wide meta-analysis of DNaseI HS sites identified in 29 different cell types. We sought to determine the relationship between DNaseI HS, histone modifications and gene expression. We found that specific correlations exist between DNaseI HS, gene expression and the amounts of active and repressive histone modifications across different cell types. These correlations displayed four distinct modes (repressive, active, bivalent and primed), reflecting different functions of the chromatin domains. Furthermore, CCCTC binding factor (CTCF) binding sites were newly identified based on these integrative data. Our findings revealed a situation of complex regulation of gene expression mediated by DNaseI hypersensitive chromatin regions and their histone modifications.
All data used in this study were freely available for download from the University of California, Santa Cruz (UCSC) NCBI36/hg18 Genome Browser (http://genome.ucsc.edu/encode/) (29,30). The Open Chromatin (25) track was used to obtain DNase-seq data from 29 cell lines and ChIP-seq data of CTCF, c-Myc and RNA Pol II. The Broad Histone (22) track provided the ChIP-seq data of histone modifications from nine cell lines. Regions of enriched signal in either DNaseI HS or ChIP experiments were identified using F-Seq (31). The gene annotations presented herein were taken from the Gencode data (32) in the Gencode Genes track (Version 3c, October 2009). The detailed information of these data was presented in Supplementary Table S1. Finally, the 17 751 UCSC known human genes with expression data were obtained from the Gene Expression Omnibus database (series GSE15805 and GSE17793; http://www.ncbi.nlm.nih.gov/projects/geo/). All chromosome Y data were omitted from this study.
Considering the lineage specificity observed with DNaseI HS sites (Supplementary Figure S1) (26), we classified DNaseI HS sites according to their occurrence rates in 29 cell lines. A given DNaseI HS site is classified as cell type specific, if it does not overlap (Here, overlap between two binding sites represents that these two regions have at least one common base pair.) with any DNaseI HS site within other 28 cell lines. A given DNaseI HS site is classified ubiquitous, if it overlaps with any DNaseI HS site within all 28 cell lines. The remaining DNaseI HS sites are classified as common, which are present in two to 28 cell lines. The results showed that about one-half of DNaseI HS sites were found in two cell lines (Supplementary Table S2). The cell lines that had the highest DNaseI HS overlap (74.96%) were the two lymphocyte lines, GM12891 and GM12892; the lowest overlap (25.99%) was found between AoSMC (aortic smooth muscle) and Medullo (Medulloblastoma). This agreed with the clustering analysis that we performed on DNaseI HS sites (Supplementary Figure S1).
The profiles of tag density at DNaseI HS sites or at transcription start sites (TSSs) were generated as described by Wang et al. (19). Briefly, the DNaseI HS region was examined in 200bp windows that spanned the 5kb immediately upstream of the DNaseI HS start site (dxStart) and 5kb downstream from the end of the DNaseI HS site (dxEnd); each window was evaluated for content of uniquely mapped ChIP-seq data of histone modifications and TFs. The DNaseI HS site itself was divided equally among 10 windows for detailed analysis. All window tag counts were normalized to the total number of bases present in the window and to the total read number of the given library.
To plot the profiles of those DNaseI HSs associated with TSSs, the 17 751 UCSC known genes with expression information were categorized among broad groups according to their reported expression levels: high, median or mainly silent. One thousand genes were selected per group and corresponding DNase-seq data was analyzed after each was aligned by their TSS.
The entire genome was scanned by 2 Mb windows. Within each window, the number of genes, DNaseI HS sites, CTCF binding sites, Pol2 and c-Myc binding sites were quantitated. Linear regression was used to determine the correlation between gene density and binding site density.
To quantitatively measure histone modification enrichment or depletion at DNaseI HS sites, we evaluated the histone signal based on the profiles of histone modification tag density by using the formula: , where and wee tag density at dxStart and dxEnd, respectively. Enrichment or depletion of histone modifications at DNaseI HS sites were defined as , where ζ>1.0 indicated enrichment and 0≤ζ<1.0 indicated depletion.
Cell type specific, common and ubiquitous DNaseI HS sites were divided among 100 groups according to DNaseI hypersensitivity. The average tag density of DNaseI HS and of histone modifications was calculated for each group.
The 17751 UCSC known genes with expression information were divided into 100 groups, based on expression. The average tag densities of each modification and the DNaseI HS within the gene body were calculated for each group, respectively.
To carry out motif identification, we examined data of DNaseI HS sites that encompassed defined ChIP-enriched regions in each cell line. CUDA-MEME (Version 3.0) (33), which was programmed using hybrid CUDA (Compute Unified Device Architecture) based on MEME (Version 4.4.0) (34) was used to discover consensus motifs with default parameters.
To determine the number of peaks that could be explained by statistically significant motifs, the MEME tool MAST (35) was used to estimate the maximal difference between the total number of peaks containing a motif and the number that could be explained by chance within a range of stringencies (E-values). See Methods in Supplementary Data for generation of MAST curves.
We classified DNaseI HS sites according to their occurrence rates: cell type specific, found in one out of 29 cell lines; common, found in 2 to 28 cell lines or ubiquitous, found in all 29 cell lines (Supplementary Table S3).
In the erythroleukemia cell type K562, 9% of the DNaseI HS sites were found to be cell type specific, while the remaining 77% were common and 14% were ubiquitous (Figure 1A). In the strongest DNaseI HS sites (top 20%), 58% were found to be ubiquitous and only 1% of them to be cell type-specific (Figure 1A). In contrast, in the weakest DNaseI HS sites (bottom 20%), we found that 19% were cell type-specific and almost none were ubiquitous (Figure 1A). Similar observations were obtained from other cell lines (Supplementary Figure S2). Taken together, these findings indicated that ubiquitous DNaseI HS sites are extremely hypersensitive to DNaseI digestion, while cell type-specific DNaseI HS sites are less susceptible to digestion, although still significantly more susceptible than the genome average. This is expected since ubiquitous DNaseI HS sites are accessible in all 29 cell types analyzed, and cell type-specific DNaseI HS sites are only accessible in one cell type.
To determine whether the majority of DNaseI HS sites that exist in the human genome were represented in the data sets under examination, we computed the cumulative percentage of the genome covered by DNaseI HS sites and the cumulative numbers of DNaseI HS sites with respect to the number of cell lines tested [Methods in Supplementary Data; described in (26)]. Therefore, as additional cell lines were included in the analysis, the total number of DNaseI HS sites being investigated increased; ultimately, we examined approximately 900000 DNaseI HS sites from 29 cell lines (Figure 1B). Accordingly, the total percentage of base pairs (bps) in the human genome covered by DNaseI HS sites was calculated at 8.28%. A total of 18 943 ubiquitous and 10100 cell type-specific DNaseI HS sites were identified from the 29 cell types, which covered 0.67 and 0.07% of bp in the human genome, respectively. However, even after addition of the 29th cell line we were unable to reach a significant saturation level of total DNaseI HS sites, which would have been represented by equal levels of cell type specific, common and ubiquitous DNaseI HS sites. This finding was consistent with previously published results using only six cell types and examining 1% of the human genome (26), and supported the suggestion by those authors that substantially more cell lines/types must be included in future analysis to identify the majority of the DNaseI HS sites that exist in the entire human genome.
To obtain an overall picture of the DNaseI HS site distribution pattern relative to genes, we divided the human genome into four regions: proximal promoter, 1kb upstream and downstream of the TSS; exon; intron and intergenic regions. Groups were assigned based on the ‘GENCODE’ annotation published in the UCSC Genome Browser.
In the K562 cell line, only 14 and 9% of all DNaseI HS sites mapped to proximal promoters and exons, respectively, while 24% mapped to introns (Figure 1C). The majority (53%) mapped within intergenic regions. However, we discovered that over 20% of DNaseI HS sites located within intergenic region, on average, are associated with strong Pol II signals; this correlation indicated these DNaseI HS sites may in fact represent a promoter, exon or intron of an unannotated gene (Supplementary Data and Supplementary Figure S3). The genome-wide location results were well agreed with previously published results in human islets (27) and CD4+ T cells (25). Of note, the DNaseI HS sites found to be located within promoters and exons were over 2-fold larger in size and twice as hypersensitive than those located in the introns or intragenic regions (Supplementary Table S4). When we considered only the cell type-specific DNaseI HS sites, we found that ~9% were located in proximal promoters or exons. In stark contrast, nearly one-half of the ubiquitous DNaseI HS sites were located in proximal promoters or exons (Supplementary Table S5). Similar distribution patterns were observed in other cell lines (Supplementary Tables S4 and S5). These findings, when considered along with the results from our analysis of overlapping CpG islands and sequence conservation within those sites (Supplementary Figures S4 and S5; Methods in Supplementary Data), indicate that ubiquitous DNaseI HS sites are generally associated with promoters of so-called ‘housekeeping’ genes.
To examine the correlation of DNaseI HS sites with gene density in the surrounding regions and presence of TFBSs, we segmented each chromosome into 2Mb windows and counted the numbers of DNaseI HS sites and genes and binding sites within each. In general, the DNaseI HS sites were found to strongly correlate with genes in all of the cell lines examined (R2=0.9122 in K562; Figure 1D). Additionally, the DNaseI HS sites were highly correlated with Pol II in each cell line (R2=0.9553 in K562; Figure 1D). These results were consistent with observations by others that the open chromatin state is affiliated with gene dense regions in the genome (36). Interestingly, we found that DNaseI HS sites correlated with TFBSs, as evidenced by analysis of the CTCF and c-Myc binding sites (R2=0.9052 and R2=0.9383, respectively; Figure 1D). This finding confirmed the Ling et al. (28) study for a larger set of TFs that binding sites show up to 90% overlap with DNaseI HS sites. The property of DNaseI HS site distribution is consistent with its role at identification of TFs and suggests a widespread function of DNaseI HS sites in the genome.
To characterize the histone modification patterns at DNaseI HS sites, we aligned the DNaseI HS sites of each group (cell type specific, common, ubiquitous) and compared each with the histone modifications of that region. Methylation and acetylation were examined, as distinct forms of each have been associated with activation, repression or both according to context. For example, the H3K27me3 methylation modification has been reported as being present at sites of repressed a gene expression (19).
As shown in Figure 2, all three states of H3K4 methylation and acetylation of both H3K9 and H3K27 were sharply elevated in different types of DNaseI HS sites. H3K9me1 and H4K20me1 were modestly elevated in cell type specific and common DNaseI HS sites, and the elevated levels of these two marks were diminished in ubiquitous DNaseI HS sites (Figure 2). We did not observe elevated levels of trimethylation of either H3K27 or H3K36 in areas surrounding the DNaseI HS sites (Figure 2). The tag densities of these histone modifications in ubiquitous DNaseI HS sites were much higher than those in cell type-specific DNaseI HS sites, except for H3K4me1 and H3K27me3. The percentages of various histone modifications overlapping with DNaseI HS sites from each of the different types are shown in the inset of Figure 2. Only a small fraction of the cell type-specific DNaseI HS sites overlapped with H3K4me2, H3K4me3, H3K9ac and H3K27ac; in contrast, about one-half of cell type-specific DNaseI HS sites overlapped with H3K4me1, H3K9me1, H3K27me3 and H4K20me1 (Figure 2A). Nearly 70% of the ubiquitous DNaseI HS sites overlapped with all of the histone modifications (Figure 2C), and of these, about one-fourth were located in proximal promoters.
We next examined the enrichment/depletion of histone modification profiles at each type of the DNaseI HS sites. We were intrigued to discover that both active and repressive histone modifications corresponded to the depletion at the peak signal of ubiquitous DNaseI HS sites, albeit to differing extents (Figure 2 and Supplementary Table S6). However, we did not observe similar modified histone troughs for the cell type-specific DNaseI HS sites, except in the cases of H3K9 and H3K27 acetylation (Figure 2; Supplementary Table S6). This analysis provided supportive evidence that the DNaseI HS peaks within ubiquitous DNaseI HS sites are nucleosome depleted. Noticeably, we found that the enrichment/depletion at DNaseI HS sites was strongly correlated with the level of DNaseI HS (Supplementary Figure S6 and Supplementary Table S7). Moreover, the correlation of the enrichment/depletion at the ubiquitous DNaseI HS sites was much stronger than that found at the cell type-specific DNaseI HS sites. Trimethylation of H3K27 was the only outlier to this trend, suggesting that the degree of nucleosome depletion is related to the level of DNaseI HS.
To examine the correlation between DNaseI HS sites and the different histone modifications across the entire human genome, we separated cell type specific, common and ubiquitous DNaseI HS sites into 100 groups for each, based on level of DNaseI hypersensitivity. These groups were then plotted against their average modification levels in DNaseI HS sites (Figure 3; see ‘Materials and Methods’ section).
For the cell type-specific DNaseI HS sites, we observed a strongly negative correlation between H3K27me3 and DNaseI hypersensitivity, but a strongly positive correlation between DNaseI hypersensitivity and all other histone modifications (Figure 3A and Supplementary Table S8). Similar correlations between histone modifications and DNaseI hypersensitivity were found for the common DNaseI HS sites, with the exception of a weak correlation with H3K36me3 (Figure 3B and Supplementary Table S8). In ubiquitous DNaseI HS sites, however, modest positive correlations between DNaseI hypersensitivity and H3K4me2, H3K4me3 and H3K9ac were detected and a weak correlation was found with H3K27ac (Figure 3C and Supplementary Table S8). In this type of DNaseI HS sites, DNaseI hypersensitivity was negatively correlated with H3K4me1, H3K9me1, H3K27me3, H3K36me3 and H4K20me1; the H3K36me3 was most negatively correlated, followed in descending order by H4K20me1, H3K9me1, H3K27me3 and H3K4me1 (Figure 3C and Supplementary Table S8).
To identify the general distribution pattern of DNaseI HS sites near and around TSSs, we generated composite profiles (n=1000 each) of the 1000 most active, 1000 median and 1000 least active genes. The genomic region that was analyzed encompassed the entire defined gene body (exons and introns) and extended 5kb upstream and 5kb downstream of the 5′- and 3′-boundaries (Figure 4A). Numbers of tags in the gene body were quantitated in windows representing 10 equal parts, and in the 5′- and 3′-proximal regions in 0.2kb windows; total numbers for each window were then summed to obtain to the overall methylation level for each gene. These numbers were then normalized by the total number of base pairs in each region. It was notable that the DNaseI hypersensitivity signal peaked near both the 5′- and 3′-ends. As such, this signature may represent a useful method by which to confirm annotated TSSs, to identify novel TSSs or to determine alternative TSSs functioning in particular cell types (37). In addition, the DNaseI hypersensitivity level of active genes was consistently much higher than that of silent genes, suggesting that DNaseI hypersensitivity is associated with transcriptional activation.
We then examined the distribution of DNaseI HS sites that overlapped these genes. As shown in the inset of Figure 4A, 849 of the most active genes were found to be associated with DNaseI HS sites, whereas only 421 of the least active genes were similarly associated. Of those most active genes, 258 (>30%) associated with ubiquitous DNaseI HS sites, while only 22 (~5%) of the most silent genes associated with ubiquitous DNaseI HS sites. Association with cell type-specific DNaseI HS sites was found with 22% (189) of the most active genes and >30% (148) of the most silent genes, respectively. These results suggested that ubiquitous DNaseI HS sites are associated with gene activation, whereas cell type-specific DNaseI HS sites are associated with gene repression.
To investigate whether the DNaseI hypersensitivity level was correlated with gene expression level, we grouped the genes into 100 gene sets according to expression level. The sets were then plotted against the DNaseI hypersensitivity levels in the transcribed regions (Figure 4B). This analysis indicated that a strong positive correlation exists between the level of DNaseI hypersensitivity and gene expression (R=0.8713, P=4.73E-32). We then used the same method to assess the correlation of gene expression with the level of DNaseI hypersensitivity within the different types of DNaseI HS sites (Supplementary Figure S7), and found that the positive correlation remained for each. Cell type-specific DNaseI HS sites were more robustly correlated with gene expression than were the ubiquitous DNaseI HS sites, suggesting that cell type-specific DNaseI HS sites are involved in cell type-specific gene regulation (Supplementary Figure S7).
To further clarify whether the relationship between the DNaseI HS sites and histone modifications correlates with gene expression, we compared the DNaseI HS sites and histone modifications with gene expression levels of 17 751 genes. We generated an image plot to determine the average signals of active and repressive histone modifications relative to the tag density of DNaseI hypersensitivity and gene expression levels (Materials in Supplementary Data). Trimethylation of H3K4 and H3K27 were examined as representatives of active and repressive histone modifications, respectively. The results further confirmed previous observations that DNaseI hypersensitive signal is strongly positively correlated with active histone modifications and inversely correlated with repressive histone marks (Supplementary Figure S8). The genes with higher levels of the H3K4me3 signal tended to be expressed at higher levels, while the presence of H3K27me3 signals tended to correlate with decreased levels of expression. This was also consistent with previously published findings by others in which gene expression was significantly associated with presence of histone modifications (7,19). The interrelatedness of DNaseI hypersensitivity, gene expression and histone modifications indicated that active histone modifications generally correlate with the open chromatin state and active gene expression, whereas repressive histone marks indicate closed chromatin and gene silencing (Supplementary Figure S8).
H3K4me3 and H3K27me3 are referred to as a pair of ‘active-repressive’ modifications with regard to their effects on gene activity (7). The overlapping islands of H3K4me3 (‘active’) and H3K27me3 (‘repressive’) histone modifications are defined as ‘bivalent domains’, which have been implicated in the development and differentiation of mammalian embryonic stem cells and differentiated cells (7,21,22,38–41). By counting the total numbers of H3K4me3 and H3K27me3 modifications within each of the different types of DNaseI HS sites from the K562 cell line, we determined that, 15 and 45% of all DNaseI HS sites were associated with H3K4me3 and H3K27me3, respectively (Figure 5A). These numbers were consistent with other cell lines examined (Supplementary Figure S9). Intriguingly, >10% of all DNaseI HS sites were associated with both H3K4me3 and H3K27me3 modifications in multiple cell lines of different origins (Figure 5A and Supplementary Figure S9), indicating that bivalent domains are a widespread phenomenon in mammalian cells and suggesting that the ‘active-repressive’ switch is functionally relevant. However, recent studies suggested that bivalent marks, as described for mammalian embryonic stem cells, do not exist in Xenopus embryos (42,43). Future studies in different model organisms at various developmental stages are essential to elucidate the curious case of the occurrence or absence of bivalent marks (43).
Examining the composition of these DNaseI HS sites showed that over one-third of those with H3K4me3 alone or with both 3K4me3 and H3K27me3 were of the ubiquitous type (Figure 5A and Supplementary Figure S10). However, only 7% of the DNaseI HS sites overlapping with either H3K4me3 alone, both 3K4me3 and H3K27me3 or H3K27me3 alone, were of the cell type-specific type (Figure 5A and Supplementary Figure S10). We then sought to determine whether the signals of DNaseI hypersensitivity, gene expression and H3K4me3 and H3K27me3 significantly differed among the DNaseI HS sites that were associated with H3K4me3 alone, both H3K4me3 and H3K27me3 or H3K27me3 alone, or not associated with H3K4me3 or H3K27me3 at all. As shown in Figure 5B, the highest levels of gene expression, DNaseI hypersensitivity and H3K4me3, and the lowest levels of H3K27me3 were associated with DNaseI HS sites that overlapped with H3K4me3 alone. The lowest levels of gene expression, DNaseI hypersensitivity and H3K4me3, and the highest levels of H3K27me3 were associated with DNaseI HS sites that overlapped with H3K27me3 alone. The DNaseI HS sites that overlapped with both H3K4me3 and H3K27me3 were characterized by high levels of gene expression, DNaseI hypersensitivity, and H3K4me3 and H3K27me3.
Based on the relationship of DNaseI hypersensitivity with active and repressive histone modifications and gene expression, we theorized that at least four major functional modes existed for the different chromatin domain structures observed in the human genome across different cell types (Figure 5C). (i) Active chromatin domains were characterized by relatively higher levels of active histone modifications, DNaseI hypersensitivity and gene expression, and lower levels of repressive histone modifications (Figure 5B; Region I in Figure 5C), (ii) Silent chromatin domains were characterized by lower levels of active histone modifications, DNaseI hypersensitivity and gene expression and higher levels of repressive histone modifications (Figure 5B; Region II in Figure 5C), (iii) The bivalent chromatin domains were characterized by high levels of both active and repressive histone modifications, and high levels of DNaseI HS and gene expression (Figure 5B; Region III in Figure 5C) and (iv) The primed chromatin domains are characterized by patterns of active histone modifications and gene expression similar to the active chromatin domains, but also have similar patterns to the repressive chromatin domains of repressive histone modifications and DNaseI hypersensitivity (Figure 5B; Region IV in Figure 5C).
Next, we investigated whether the specific correlation pattern between the data from DNase-seq and ChIP-seq, in combination with gene expression from microarray analysis, could be used to predict transcription factor binding sites (TFBSs) in the human genome.
In vertebrates, the CTCF, a ubiquitously expressed 11 zinc finger (ZF) protein (44,45), is necessary for insulator element function (46–48). To characterize how the CTCF binding sites are distributed along the human genome, we performed computational meta-analysis of the DNase-seq and ChIP-seq data to identify potential CTCF binding sites. We examined the DNaseI HS sites associated with CTCF ChIP-enriched regions across cell lines (Supplementary Table S9). Of the 35307 DNaseI HS sites that overlapped with CTCF enriched regions in cell type K562, 13 007 were distal (>1kb) to an annotated TSS or RNA Pol II signal. Then, we applied the de novo motif finder MEME to identify CTCF binding sites within these 13 007 distal DNaseI HS sites.
Our MEME analysis revealed that the CTCF consensus DNA binding motif was enriched in these DNaseI HS sites (Figure 6A). Furthermore, the consensus motif was identical in each of the cell types studied (Supplementary Figure S11) and was consistent with previous findings from other cell lines (human islets, CD4+ T cells, HeLa cells and Jurkat cells) (27,49). Most of the distal DNaseI HS sites (10397 out of 13007, 79.9%) contained at least one consensus motif (Figure 6B). Of those, 149 (1.4%) were cell type specific, whereas 7969 (76.7%) and 2279 (21.9%) were common and ubiquitous, respectively. This agreed well with previous observations that CTCF binding in insulator regions is similar across diverse cell types (50). In a statistical context, the consensus motif explains 65% (8456 out of 13007) of the distal DNaseI HS sites after accounting for motifs that are expected to occur by chance. Compared to using DNaseI-Seq data (29%, 39951 out of 139121; Supplementary Figure S12A) or ChIP-seq data (57%, 46328 out of 81688; Supplementary Figure S12B) alone, our result illustrated the high accuracy of CTCF binding sites discovery based on the integrative data (Figure 6B). Most often, these distal DNaseI HS sites contained only a single motif (Figure 6C).
The canonical motif was highly located on the peak of the DNaseI HS signal within each DNaseI HS site (Supplementary Figure S13A), which was consist with previous finding (27). This indicated that these peaks serve as the point of contact by the protein in vivo. Not surprisingly, we also found that the identified motif is located far distally from the nearest upstream or downstream associated genes (Supplementary Figure S13B), a finding that would explain the widespread and fundamental role of CTCF. Although the identified motif appeared to represent the major CTCF binding sequence, a significant number of the sites lacked the consensus sequence. Another study recently found that CTCF can bind to genomic regions that apparently lack the defined motif (51); this may be a result of DNA–CTCF interactions mediated by contacts with distinct arrangements of CTCF's 11-ZFs (44,48,52–54).
The dynamic and complex regulatory elements of gene transcription that underlie every biological process, from cell type-specific functions to systemic response to the environment, remain to be completely defined or understood. Whole-genome mapping of DNaseI HS sites has provided crucial clues to regions of transcriptional regulation. By combining these data with data from ChIP-seq and gene expression microarray experiments, we may gain a better understanding of this process. To this end, we performed a meta-analysis using each of these datasets that are publically available. By classifying the DNaseI HS sites according to their characteristic of cell type specific, common or ubiquitous, we were able to identify approximately 900000 DNaseI HS sites from 29 diverse cell lines. These sites of presumed transcriptional regulatory function encompassed 8.28% of the human genome, which was in agreement with the proportion reported in a related study (26). Detailed examination of all the DNaseI HS sites in each of the 29 cell lines revealed that only approximately 10000 (~7%) are cell type specific and approximately 19000 (~14%) were ubiquitous, covering 0.67 and 0.07% of the human genome, respectively. The percentage of cell type-specific DNaseI HS sites was much smaller than that obtained previously by Xi et al. (26). This discrepancy may be due to the particular cell types examined in our study or the fact that we examined more different cell types in total. Nonetheless, even though 29 cell types were considered in our study, we were also unable to reach the statistical saturation point for the DNaseI HS sites that are likely to exist throughout the human genome. This finding indicated that functional TFBSs make up a more significant percentage of the genome.
By investigating the ubiquitous DNaseI HS sites, we hoped to gain insight into the function of housekeeping genes and overall chromatin structures. Our location analysis of DNaseI HS sites relative to annotated genes indicated that nearly 50% of the ubiquitous DNaseI HS sites in these data sets were located in proximal promoters or exons. This analysis, when considered in tandem with the overlapping with CpG islands between DNaseI HS sites and levels of sequence conservation among different cell lines (Supplementary Figures S4 and S5), indicated that ubiquitous DNaseI HS sites are generally associated with housekeeping promoters of genes. The strong correlation of DNaseI HS sites with TFBSs revealed that DNaseI HS sites are highly restricted to gene regions and their transcriptional regulatory elements. This provides further proof that DNaseI HS sites in the genome are valuable markers of TF binding regions.
Other groups have noticed sharp declines in the presence of active histone marks near TSSs, termed as nucleosome depleted regions (7,25,55). Interestingly, we observed that both active and repressive histone modifications are characterized by modified histone troughs, which are centered at DNaseI HS peaks of ubiquitous DNaseI HS sites, suggesting that these regions are nucleosome depleted. The degree of nucleosome depletion is undoubtedly related to chromatin accessibility and level of DNaseI hypersensitivity (Supplementary Figure S6; Supplementary Table S7). Our finding extends previous observations made in yeast, flies and mammalian systems, particularly the human, that nucleosome depletion is a general characteristic of active promoters. Furthermore, our finding provides supportive evidence that nucleosome depletion is associated with ubiquitous DNaseI HS sites and the ubiquitous DNaseI HS sites are nucleosome depleted.
Both cell type specific and ubiquitous DNaseI HS sites are generally positively correlated with active histone modifications H3K4me2/3, H3K9ac, H3K27ac and negatively correlated with the repressive modification H3K27me3. Interestingly, monomethylations of H3K4, H3K9, H3K36 and H4K20 display a more complex functional relationship with chromatin, as they are positively correlated with cell type-specific DNaseI HS sites and negatively correlated with ubiquitous sites. Correlation of DNaseI hypersensitivity with gene expression suggests that DNaseI hypersensitivity is highly correlated with transcriptional activation. Ubiquitous DNaseI HS sites were associated with active genes, while cell type-specific DNaseI HS sites were correlated with silent genes. Correlations of DNaseI hypersensitivity, histone modifications and gene expression indicated that active histone modifications generally correlated with active gene expression and the open (accessible) chromatin state, whereas repressive histone marks correlated with gene silencing and tightly packed (inaccessible) chromatin state (Figure 5C). These specific correlations were summarized to reveal four distinct modes of chromatin domain function: repressive, active, primed and bivalent. These modes of association, which are much more finely described than the traditional classification of heterochromatin (open) or euchromatin (closed), provide insights into the complex structure and function of chromatin.
After systematic exploration of the relationships between data emanating from DNase-seq, ChIP-seq and gene expression microarrays, we were able to identify transcriptional regulatory elements by using the de novo motif finder MEME. Our limited MEME analysis of the CTCF binding sites served to illustrate the high specificity and accuracy in identification of transcriptional regulatory elements using the integrated data sets (Figure 6B and Supplementary Figure S12). The advent of high-throughput methods, including DNase-chip/DNase-seq, ChIP-chip/ChIP-seq and gene expression arrays, has encouraged large storage databases of related data and public availability for more significant analysis. Combining distinct genome scale data in meta-analysis approaches represents the next era of genomic research.
Our integrative analysis presented herein can be considered as first-order data integration. The simplicity of the overlapping approach to determine regions and characteristics of enriched domains from DNase-seq and ChIP-seq data, along with the motif discovery and Gene Ontology strategies, represent an efficient means by which to perform integrative analysis. The next higher order, comprehensive, integrative analysis should integrate diverse high-throughput sequence tags directly. Additionally, interactome data (56) can be applied to facilitate identification of the target genes of DNaseI HS sites, since TFs and their three-dimensional interactions are crucial to gene regulation (57,58). Technologies based on chromosome conformation capture (3C) (59), circularized chromosome confirmation capture (4C) (60), carboncopy chromosome confirmation capture (5C) (61) and the recently developed methods of Hi-C (62) and chromatin interaction analysis by paired-end tag sequencing (ChIA-PET) (63,64), can be used to provide a high-resolution genome-wide map of long-range genomic interactions. Integrating a wide variety of genomic data sets, including genomes, epigenomes (55), transcriptomes (65) and interactomes (56) will significantly enhance our understanding of the complex genomic systems of all life forms and enhance our efforts to understand and modulate health and optimize well-being (66).
Supplementary Data are available at NAR Online.
National High Technology Research and Development Program of China (No. 2007AA02Z311 to W.S.); National Nature Science Foundation of China (No. 30700139 and No. 31070639 to W.S.). Funding for open access charge: National Nature Science Foundation of China (No. 31070639 to W.S.).
Conflict of interest statement. None declared.
We wish to thank the ENCODE Project Consortium for making their data publicly available and the ENCODE Open Chromatin and Broad Histone groups for providing the DNase-seq, ChIP-seq data and gene expression data. We thank Yongchao Liu (School of Computer Engineering, Nanyang Technological University, Singapore) for sharing CUDA-MEME source codes for us. The authors would like to thank the anonymous reviewers for their constructive comments, which contributed to an improved presentation.