We present DNase-chip data from six cell lines and classify the DNaseI HS sites into cell type–specific, common (found in more than one but not all cell lines), and ubiquitous categories. Only 22% of all DNaseI HS sites are ubiquitous in all cell lines, indicating that the majority of gene regulatory elements are involved in cell type–specific function.
The identification of ubiquitous DNaseI HS sites provides clues to the function of housekeeping chromatin structures that are maintained in most cell types. We detected 259 such ubiquitous sites in the ENCODE regions. Approximately 86% of ubiquitous DNaseI HS sites are proximal to TSS and map to basal transcription factor binding sites, indicating that these regions function as housekeeping promoters. The majority of ubiquitous distal DNaseI HS sites bind to CTCF, a protein with known enhancer-blocking insulator activity [7
], indicating that CTCF is involved in stable chromatin structure and gene expression maintenance across many cell types. Because most ubiquitous sites bind to either basal transcription machinery or CTCF, we conclude that ubiquitous DNaseI HS sites function primarily as promoters and insulators, but not enhancers. Cell type–specific DNase HS sites, however, are more enriched for protein binding sites with known enhancer activity, cell type–specific histone modifications, and cell type–specific gene expression.
Although our DNaseI HS site data was limited to 1% of the human genome, the integration over multiple data types allowed us to conclude that we have uncovered many different types of functional regulatory elements. In the future, as additional cell types are analyzed using whole genome DNase-chip, we will be able to better characterize DNaseI HS sites that are truly cell type specific, as well as those that are shared between cell types of similar lineages. We expect that genome-wide analysis from additional cell types, under different cellular conditions, or at different developmental stages, will provide for more powerful de novo motif discovery, similar to our identification of CTCF, for identifying and characterizing unknown factors that regulate temporal and spatial gene expression.
Our integrated approach combines the strengths of four high-throughput technologies (DNase-chip, ChIP-chip, expression array, and motif discovery). DNase-chip can identify all types of regulatory elements in a single experiment and integration with other datasets has allowed us to delineate the functions of subsets of DNaseI HS sites. This approach will be increasingly more powerful as more high-throughput datasets become available and will be an important part of ensuring that no regulatory element is missed. Nonetheless, our analysis is missing an important component—we cannot identify the target gene(s) of a DNaseI HS site. Technologies such as chromosome conformation capture carbon copy (5C) [35
] are ideal for detecting large numbers of long-range interactions between genomic elements. Since 5C works best by anchoring to known regulatory elements, DNaseI HS sites identified in our study can be used to significantly reduce the search space.
DNaseI HS sites can be used as a general tool for evaluating future ChIP-chip datasets that have been performed on only one of the cell types described here, to determine whether those factors bind genomic DNA in a cell type–specific or ubiquitous manner. This is illustrated for the p300 ChIP-chip data in Figure S9
, which shows that the percentages of p300 binding (performed in HeLa cells) are highest for HeLa-specific distal DNaseI HS sites. Other examples of cell line-specific marks are H3K4me1 and H3K4me2 (Figure S9
). In contrast, H3K4me3, H3ac, H4ac, and CTCF show less cell line specificity (Figure S10
). Our DNaseI HS data can also be used to help identify unknown transcription factors binding and unknown histone modification patterns. For example, only 60% of proximal DNaseI HS sites overlap with the five histone modifications we examined in this study. Future studies will be needed to identify the histone modification(s) that are associated with these regions.
While DNaseI HS sites from each cell type cover approximately only 2%–3% of the genome, the combined DNase data from six cell types covers roughly 8% of the genome. Since the actual functional regulatory sequences (i.e., protein binding sites) may make up a fraction of each DNase HS site, the actual percentage of functional DNA may be smaller. As we have not detected a significant decrease in the number of new DNaseI HS sites identified with the addition of each cell type, this indicates that a large percentage of the genome may be functional in all possible cell types, disease states, and responses to external stimuli. Whole genome identification of all DNaseI HS sites using DNase-chip [4
] or DNase-sequencing [36
] methods will play a key role in identifying and ultimately understanding the function of all functional noncoding DNA sequences.