|Home | About | Journals | Submit | Contact Us | Français|
The identification of regulatory elements from different cell types is necessary for understanding the mechanisms controlling cell type–specific and housekeeping gene expression. Mapping DNaseI hypersensitive (HS) sites is an accurate method for identifying the location of functional regulatory elements. We used a high throughput method called DNase-chip to identify 3,904 DNaseI HS sites from six cell types across 1% of the human genome. A significant number (22%) of DNaseI HS sites from each cell type are ubiquitously present among all cell types studied. Surprisingly, nearly all of these ubiquitous DNaseI HS sites correspond to either promoters or insulator elements: 86% of them are located near annotated transcription start sites and 10% are bound by CTCF, a protein with known enhancer-blocking insulator activity. We also identified a large number of DNaseI HS sites that are cell type specific (only present in one cell type); these regions are enriched for enhancer elements and correlate with cell type–specific gene expression as well as cell type–specific histone modifications. Finally, we found that approximately 8% of the genome overlaps a DNaseI HS site in at least one the six cell lines studied, indicating that a significant percentage of the genome is potentially functional.
There are many different types of gene regulatory elements that control gene expression. Identifying the location of these regulatory elements in the genome, as well as understanding how exactly they control gene expression in different cell types, has been a major challenge. Here, we use a relatively new strategy to identify all gene regulatory elements within a select 1% of the human genome from six diverse human cell types. We find that only 22% of gene regulatory elements are shared among all cell types studied. Among these, 86% are located near annotated transcription start sites and 10% are bound by CTCF, a protein with known enhancer-blocking insulator activity. The gene regulatory elements that are found to be cell type specific are highly correlated with cell type–specific gene expression as well as cell type–specific chromatin modifications. This indicates that we have made a significant step toward understanding why some genes are expressed in all different cell types within the human body, and why others are only expressed in certain cell types.
Biological processes such as proliferation, apoptosis, differentiation, development, and aging require carefully orchestrated spatial and temporal gene expression [1,2]. To understand the molecular mechanisms that underlie global transcriptional regulation, it is essential to identify all the DNA regulatory elements in the human genome. Three methods, DNaseI hypersensitive site (HS) mapping, chromatin immunoprecipitation followed by hybridization to tiled arrays (ChIP-chip), and expression arrays identify gene regulatory elements in different ways. DNaseI HS sites identify regions of open chromatin, which encompass all different types of regulatory elements, including promoters, enhancers, silencers, insulators, and locus control regions (LCR) . However, DNaseI HS mapping does not directly reveal the transcription factor(s) that bind within each DNaseI HS site. ChIP-chip directly identifies the global locations of regulatory factors [4–6], but this method can only be used to study known factors and requires high quality ChIP-grade antibodies. In addition, expression arrays detect genes that are expressed in certain cell types, but do not provide information regarding the factors that cause the cell type–specific expression. Therefore, to completely understand how chromatin structure ultimately regulates gene expression, a multi-pronged integrated experimental approach using all three methods is needed.
We previously used DNase-chip to identify DNaseI HS sites from two cell types across the 1% of the human genome identified by the ENCODE consortium . DNase-chip is a method that works by capturing DNase digested ends, labeling, and hybridizing the material to tiled microarrays. This method is highly sensitive and specific when used to identify valid DNaseI HS sites.
To identify the regulatory elements that control cell type–specific and housekeeping gene expression, we have now performed DNase-chip on the same 1% of the genome from six diverse human cell types: CD4+ T cells, GM06990 (B lymphoblastoid), K562 (erythroleukemia), H9 (undifferentiated embryonic stem cell), IMR90 (fetal lung fibroblast), and HeLa S3 (cervical carcinoma). In this study, we find that approximately 22% of all DNaseI HS sites from each cell type are ubiquitously present in all six cell types, while the remainder are a mixture of cell type specific (only present in one cell type) or common (present in two to five cell types). To identify the regulatory roles of these DNaseI HS sites, we performed computational analyses to integrate the DNase data with the ChIP-chip data for two distinct enhancer-binding proteins, one insulator-binding protein, and five histone modifications, as well as expression data from the same six cell lines. The majority (86%) of ubiquitous DNaseI HS sites are within 2 kb of a transcription start site (TSS). Surprisingly, of the remaining ubiquitous HS sites that are distal to TSS, the majority (70%) are bound by CTCF, a factor with known enhancer-blocking activity , suggesting that a major role of ubiquitously modified chromatin is to prevent misregulation by local enhancers. In contrast, cell type–specific HS sites are correlated with known enhancer elements  and histone-modified regions in a cell type–specific manner. Cell type–specific DNaseI HS sites also contain overrepresented sequence motifs that are biologically relevant and often map near the TSS of genes that exhibit cell type–specific expression. Collectively, these results show that ubiquitous chromatin structures are predominantly associated with promoters and insulators while enhancers tend to associate with cell type–specific chromatin structures.
For each cell type, DNase-chip data was generated using three concentrations of DNase on each of three biological replicates (See Figure S1 for correlation plots). Averaged data from all replicates (Figure 1A) was used for subsequent analyses, because we have previously shown that averaging data from replicate datasets generates higher sensitivity and specificity . Similar numbers of DNaseI HS sites were identified from each cell type, indicating data consistency (Table 1). To determine specificity for each cell line, we determined the overlap of DNase signal from previously reported “gold standard” negative sets of DNaseI HS sites for CD4+ T cells and GM06990 cell lines using real time PCR , and calculated >92% specificity for all six cell lines (Table 1). As a second measure of specificity, we determined the numbers of significant signals that are detected in two ENCODE regions (ENr112 and ENr313) that are depleted for TSS, DNaseI HS sites, active histone modifications, and ChIP-chip signals. Significant signals that map within these two regions are considered likely false positives. For each cell line, only a few significant signals were observed in these two regions, which also indicates high specificity (Table 1). We have previously shown that sensitivity of DNase-chip experiments from CD4+ T cells and GM06990 was >86% . To assess the sensitivity of these additional cell lines, we examined five well-characterized DNaseI HS sites that make up the globin locus control region [9,10]. We robustly detect all five DNaseI HS LCR sites in K562 cells, as well as the well-characterized 3′ DNaseI HS site  (Figure 1B). In addition, results from all six cell lines show a significant enrichment for TSS and CpG islands, one of the hallmarks of active chromatin (Table 1 and unpublished data). Together, these results indicate that the sensitivity and specificity in the four newly studied cell lines are consistent with those in CD4+ T and GM06990 cells. All DNase-chip data described here is publicly available on the University of California Santa Cruz (UCSC) genome browser  (http://genome.ucsc.edu).
DNaseI HS sites are classified as cell type specific (only found in one out of six cell lines), common (found in two to five cell lines), or ubiquitous (found in all six cell lines) (Table 1; Figure 2A). Between any two cell lines, fewer than 50% of DNaseI HS sites overlap. The highest overlapping datasets were from the two lymphocyte cell lines, CD4+ and GM06690. On average for each cell type, 32% of DNaseI HS sites are cell type specific, 46% are common, and 22% are ubiquitous. A total of 3,904 distinct DNaseI HS sites were identified from the six cell types.
To test whether we can determine cell type specificity from DNaseI HS sites, we compared cluster dendrograms from both DNaseI HS sites and expression data. Both dendrograms have the closest clustering occurring between CD4+ T cell and GM06990 B lymphoblastoid (Figures 2A and S2). This is to be expected, as these two cell types are derived from a common lymphoid progenitor. Interestingly, K562, which is an erythroleukemia cell line, does not cluster closely with CD4+ T and GM06990 using either DNase or expression data. However, other studies have shown that K562 cells have characteristics distinct from B and T cells .
To determine whether we have identified most DNaseI HS sites in the ENCODE regions, we computed the cumulative percentage of base pairs as a function of the number of cell lines tested. As additional cell lines are included, the total percentage of base pairs of the ENCODE regions covered by DNaseI HS sites increases steadily, reaching ~8% at six cell lines (Figure 2B). We wanted to know whether the new sites are predominantly those that are unique to one cell type. DNaseI HS sites that are observed in only one cell type tend to be less sensitive to DNaseI cleavage and hence may be more affected by microarray noise than sites observed in multiple cell types. Nonetheless, as discussed in the last section of Results, cell type–specific DNaseI HS sites are enriched in the regions bound by enhancer proteins and regions with modified histones, thus, they contain bone fide regulatory elements. We do not detect a significant leveling off after the addition of the sixth cell type, even if we only analyze DNaseI HS sites that are common in at least two cell types. This indicates that additional cell lines must be tested in the future to identify most DNaseI HS sites.
We calculated the distance of DNaseI HS sites to the nearest TSS, with the TSS set defined in a comprehensive way by the Integrated Analysis group of the ENCODE consortium  (Figure 2C). Thirty-four percent of cell type–specific DNaseI HS sites are proximal to a TSS (<2 kb). In stark contrast, 86% of ubiquitous DNaseI HS sites are proximal to TSS. The distribution of proximal DNaseI HS sites in terms of the numbers of tissues in which they occur is significantly different from those of other DNaseI HS site categories in Figure 2C (p-value < 2.2 × 10−16 by Wilcoxon test). The dramatic increase in proximal DNaseI HS sites in six cell lines over five cell lines indicates the high quality of the DNase data, because one would expect a more gradual shift for data with low sensitivity and specificity. While proximal DNaseI HS sites are overrepresented in the genome, distal DNaseI HS sites are underrepresented and become increasingly underrepresented the further away from the TSS (Figure S3). Therefore, distal DNaseI HS sites are not uniformly distributed in the genome but instead are located closer to genes.
Because CpG islands are generally associated with housekeeping promoters of mammalian genes, we asked whether the percentage of CpG dinucleotides differs among unique, common, and ubiquitous DNaseI HS sites. Cell type–specific DNaseI HS sites tend to have similar percentages of CG dinucleotides regardless of whether the DNaseI HS sites are proximal (<2 kb) or distal (>2 kb) to a TSS (Figure 2D). DNaseI HS sites that are common to more cell types are more CpG rich than cell type–specific sites, and the slope of this increase is much greater for proximal sites than for distal sites (Figure 2D). A similar but more moderate trend is detected for G + C mononucleotide (Figure S4).
For the 222 proximal DNaseI HS sites (<2 kb from TSS) that are ubiquitous in all six cell lines, 78% overlap recently published ChIP-chip data specific to basal promoter factors (RNA PolII and TAF1) , enhancers (p300 and TRAP220) , or the insulator factor CTCF  (Figure 3A and and3B).3B). The majority (81%) of DNaseI HS sites bound by p300 or TRAP220 also bind Pol II or TAF1. However, only 22% DNaseI HS sites bound by CTCF are also bound by TAF1 or Pol II, suggesting that CTCF binding to the promoter decreases the likelihood of binding by other factors examined. For the ubiquitous proximal DNaseI HS sites that do not overlap known promoter factors, 53% overlap with the ChIP hits of H3K4me3, a histone modification mark for active promoters (Figure 3A) .
Of the 259 DNaseI HS sites that are ubiquitous in all six cell lines, 37 are distal (>2 kb) to a TSS. To identify the protein(s) that putatively bind to these regions, the minimal intersecting region for each ubiquitous site was analyzed by the de novo motif-finding algorithm MEME . The most significant motif (p-value < 2.2 × 10−16; Figure 4A) was not in the TRANSFAC  or Jaspar  databases, but is nearly identical to the motif recently discovered in a genome-wide ChIP-chip study with an antibody against CTCF in IMR90 cells . Using this ChIP-chip data, we find that 70% (26/37) of the ubiquitous distal DNaseI HS sites overlap with CTCF binding sites (Figure 4B; Tables S1 and S2). An additional four ubiquitous distal DNaseI HS sites that do not overlap with CTCF hits contain the CTCF motif. Some of these 26 distal ubiquitous sites that overlap CTCF are clustered in the genome. For example, three DNaseI HS sites are in the IGF2/H19 locus (ENCODE region ENm011), clearly isolating the H19, IGF2, and TH genes (Figure 4C). The well-characterized imprinting insulator between H19 and IGF2  is not one of these three; however, it overlaps with a DNaseI HS site that is present in five cell lines. In addition, four ubiquitous DNaseI HS sites overlap CTCF hits in the HoxA locus (Figure S5).
The CTCF motif can be found in 88 (55%) of the 160 DNaseI HS sites that overlap with CTCF hits (p-value cutoff of 10−5 as computed by the MAST algorithm ; Table S2). Most (85%) of these 88 DNaseI HS sites contained only a single CTCF motif site. DNaseI HS sites that contain two (n = 23) or three motifs (n = 4) were not more enriched for CTCF ChIP-chip hits (unpublished data), indicating a single CTCF motif is sufficient to facilitate significant CTCF binding. CTCF motif sites in both distal and proximal DNaseI HS sites are significantly more conserved than neighboring genomic regions based on phastCons  conservation scores (Figure S6). Approximately 19% of DNaseI HS sites in IMR90 that do not overlap CTCF ChIP-chip data (139/1084) contain the CTCF motif (Table S2). Although these CTCF motif sites are on average less conserved than those that overlap CTCF ChIP-chip hits (Figure S6), the subset in distal DNaseI HS sites are still significantly more conserved than neighboring genomic regions (leftmost bar in Figure S6), indicating that they may bind CTCF in living cells.
We performed cell culture enhancer-blocking assays  on seven CTCF motif-containing DNaseI HS sites; six of these are ubiquitous and the other one is common in five cell types (Table S3). All seven clones display significant enhancer-blocking activity (Figure 4D; p-value = 0.002), including the DNaseI HS site that does not overlap a CTCF ChIP-chip hit (DHS4). Three of the DNaseI HS sites are proximal to TSS (DHS1, DHS4, and DHS6). Although we only tested a small number of DNaseI HS sites, our results indicate that DNaseI HS sites that occur in many cell types and contain the CTCF motif are likely functional insulators. In addition, proximal DNaseI HS sites near TSS can also function as insulators.
Of the 225 CTCF ChIP-chip hits that map within ENCODE regions, 160 (71%) overlap with DNaseI HS sites identified in IMR90 cells. The percentage of DNaseI HS sites that overlap CTCF ChIP-chip hits steadily increases for DNaseI HS sites that are more common, with the highest percentage occurring within ubiquitous DNaseI HS sites (Figure 5). This is in contrast to the binding sites for p300 and TRAP220, proteins with enhancer activity, which are preferentially detected in cell type–specific and less common DNaseI HS sites (Figure 5). This indicates that insulators, but not enhancers, comprise the majority of ubiquitous distal regulatory elements.
Previously, ChIP-chip for five histone modifications (H3K4me2, H3K4me3, H3ac, and H4ac) was performed on three cell lines (HeLa, GM06990, and K562) and ChIP-chip for H3K4me1 was performed on two cell lines (HeLa and GM06990) . We calculated the number of DNaseI HS sites that overlap ChIP-chip hits for each histone modification in 3-by-3 cell line combinations (2-by-2 in the case of H3K4me1). Ubiquitous DNaseI HS sites often overlap with ubiquitous histone modification hits, in particular with H3K4me3 and H3ac, which are strong markers for the 5′ ends of active genes. This is consistent with our aforementioned results indicating that 86% ubiquitous DNase HS sites are promoters. Respectively, there are 78 and 103 ubiquitous H3K4me3 and H3ac hits in the ENCODE regions; 59 and 80 of them overlap with ubiquitous DNaseI HS sites, respectively. Ubiquitous DNaseI HS sites and ubiquitous histone modification hits were excluded from the remaining analysis in this section, because they merely increase all counts of overlap. The counts were divided by the corresponding row sum and column sum and multiplied by the matrix sum to obtain enrichment values, which is done in the same way as the χ2 test (see Figure S7 for detailed explanation). In Figure 6A and and6B,6B, we plot the enrichment factor for H3K4me2 in a 3-by-3 grid (see Figure S8 for other histone modifications). The diagonal matched cell line enrichment values (all >1) are much larger than off-diagonal mismatched cell line values (<1 for all comparisons except H3ac in the K562-HeLa comparison), indicating that DNaseI HS and ChIP-chip experiments are both detecting similar genomic regions that reflect cell type specificity. This agreement is particularly striking given that the DNaseI HS and ChIP-chip experiments were performed in different labs and on different microarray platforms  (the histone modification experiments were on spotted PCR arrays).
We performed overlap analysis on DNaseI HS sites and p300 ChIP-chip hits in three cell types (HeLa, K562, and GM06990), in the same way as described above for histone modifications. Again, ubiquitous DNaseI HS sites and ubiquitous p300 hits were excluded from this analysis. The results are shown in Figure 6C and and6D,6D, for proximal and distal DNaseI HS sites separately, both indicating strong colocalization.
We hypothesized that cell type–specific DNaseI HS sites are involved in cell type–specific gene regulation, and therefore expected them to colocalize with (within 2 kb of) the TSS of genes active in the corresponding cell line. Because only a few genes were strictly expressed exclusively in one cell line within the ENCODE regions, the definition of cell type–specific genes was relaxed to include genes expressed in no more than two cell lines in addition to the cell line of interest. The 6-by-6 enrichment matrix (Figure 6E) was constructed in the same way as described above for histone modifications. The diagonal enrichment values (matched cell lines) are larger than off-diagonal values (mismatched cell lines), indicating that cell type–specific DNaseI HS sites tend to colocalize with genes that are expressed in the corresponding cell types (p-value = 1.15 × 10−4 by Wilcoxon ranked sum test). The significance is maintained if we loosen the proximity criteria to DNaseI HS sites that are within 5 kb or 10 kb of a cell type–specific TSS (unpublished data).
To identify putative regulatory factors that bind cell type–specific DNaseI HS sites, we analyzed these sites with Clover , a motif-finding algorithm that identifies motifs from the TRANSFAC database that are enriched in a set of sequences, namely the DNaseI HS sites specific to a cell type. We used two sets of background sequences for computing the enrichment: the union set of all ChIP-chip hits generated by the ENCODE Transcription Regulation group at the 5% false discovery rate cutoff , and random dinucleotide shuffling of the input sequence set (DNaseI HS sites specific to a cell line). We obtained similar results from both sets of background sequences. Motifs enriched in each cell line were identified for DNaseI HS sites proximal or distal relative to the TSS (Table 2). Many of the overrepresented motifs are functionally relevant to the cell type from which the DNase data was generated. For example, the TAL1 motif [25,26], enriched in CD4+ T specific DNaseI HS sites, binds a well-known transcription activator involved in hematopoietic stem cell function and the development of T cell acute lymphoblastic leukemia . K562 DNaseI HS sites are enriched for the GATA1 motif, which is a factor known to be involved in erythroid maturation . H9 ES cell DNaseI HS sites are enriched for the Octamer , Sox, and STAT family motifs, which have been reported to be involved in pluripotency and early differentiation [30,31]. The AP-1 motif  is enriched in HeLa DNaseI HS sites . AP-1 is especially enriched in those HeLa DNaseI HS sites that overlap with p300 ChIP-chip hits (p-value < 2.2 × 10−16 by χ2 test; Table S4). The AP-1 components, c-jun and c-fos, are among the many proteins known to interact with p300 . Because the AP-1 motif is the most enriched motif in p300 ChIP-chip hits (unpublished data), this suggests that AP-1 contributes to the DNA binding specificity of p300.
We present DNase-chip data from six cell lines and classify the DNaseI HS sites into cell type–specific, common (found in more than one but not all cell lines), and ubiquitous categories. Only 22% of all DNaseI HS sites are ubiquitous in all cell lines, indicating that the majority of gene regulatory elements are involved in cell type–specific function.
The identification of ubiquitous DNaseI HS sites provides clues to the function of housekeeping chromatin structures that are maintained in most cell types. We detected 259 such ubiquitous sites in the ENCODE regions. Approximately 86% of ubiquitous DNaseI HS sites are proximal to TSS and map to basal transcription factor binding sites, indicating that these regions function as housekeeping promoters. The majority of ubiquitous distal DNaseI HS sites bind to CTCF, a protein with known enhancer-blocking insulator activity , indicating that CTCF is involved in stable chromatin structure and gene expression maintenance across many cell types. Because most ubiquitous sites bind to either basal transcription machinery or CTCF, we conclude that ubiquitous DNaseI HS sites function primarily as promoters and insulators, but not enhancers. Cell type–specific DNase HS sites, however, are more enriched for protein binding sites with known enhancer activity, cell type–specific histone modifications, and cell type–specific gene expression.
Although our DNaseI HS site data was limited to 1% of the human genome, the integration over multiple data types allowed us to conclude that we have uncovered many different types of functional regulatory elements. In the future, as additional cell types are analyzed using whole genome DNase-chip, we will be able to better characterize DNaseI HS sites that are truly cell type specific, as well as those that are shared between cell types of similar lineages. We expect that genome-wide analysis from additional cell types, under different cellular conditions, or at different developmental stages, will provide for more powerful de novo motif discovery, similar to our identification of CTCF, for identifying and characterizing unknown factors that regulate temporal and spatial gene expression.
Our integrated approach combines the strengths of four high-throughput technologies (DNase-chip, ChIP-chip, expression array, and motif discovery). DNase-chip can identify all types of regulatory elements in a single experiment and integration with other datasets has allowed us to delineate the functions of subsets of DNaseI HS sites. This approach will be increasingly more powerful as more high-throughput datasets become available and will be an important part of ensuring that no regulatory element is missed. Nonetheless, our analysis is missing an important component—we cannot identify the target gene(s) of a DNaseI HS site. Technologies such as chromosome conformation capture carbon copy (5C)  are ideal for detecting large numbers of long-range interactions between genomic elements. Since 5C works best by anchoring to known regulatory elements, DNaseI HS sites identified in our study can be used to significantly reduce the search space.
DNaseI HS sites can be used as a general tool for evaluating future ChIP-chip datasets that have been performed on only one of the cell types described here, to determine whether those factors bind genomic DNA in a cell type–specific or ubiquitous manner. This is illustrated for the p300 ChIP-chip data in Figure S9, which shows that the percentages of p300 binding (performed in HeLa cells) are highest for HeLa-specific distal DNaseI HS sites. Other examples of cell line-specific marks are H3K4me1 and H3K4me2 (Figure S9). In contrast, H3K4me3, H3ac, H4ac, and CTCF show less cell line specificity (Figure S10). Our DNaseI HS data can also be used to help identify unknown transcription factors binding and unknown histone modification patterns. For example, only 60% of proximal DNaseI HS sites overlap with the five histone modifications we examined in this study. Future studies will be needed to identify the histone modification(s) that are associated with these regions.
While DNaseI HS sites from each cell type cover approximately only 2%–3% of the genome, the combined DNase data from six cell types covers roughly 8% of the genome. Since the actual functional regulatory sequences (i.e., protein binding sites) may make up a fraction of each DNase HS site, the actual percentage of functional DNA may be smaller. As we have not detected a significant decrease in the number of new DNaseI HS sites identified with the addition of each cell type, this indicates that a large percentage of the genome may be functional in all possible cell types, disease states, and responses to external stimuli. Whole genome identification of all DNaseI HS sites using DNase-chip  or DNase-sequencing  methods will play a key role in identifying and ultimately understanding the function of all functional noncoding DNA sequences.
DNase-chip was performed as previously described . Briefly, intact nuclei were digested with optimized amounts of DNase. DNase digested ends were blunted, ligated to biotinylated linkers, sonicated, and enriched on a streptavidin column. Sheared ends were blunted and ligated to nonbiotinylated linkers. DNase-enriched material was amplified by linker-mediated PCR, labeled, and hybridized to NimbleGen ENCODE arrays. Randomly sheared DNA was used as a reference control. For each cell type, DNase-chip material was generated from three biological replicates and three different DNase concentrations (total of nine hybridizations per cell type). Raw ratio data from each cell type was averaged and significant signals were identified using ACME (p-value = 0.001). All DNase-chip data is publicly available on the UCSC genome browser  (http://genome.ucsc.edu).
Human ES cell line H9  (WiCell Research Institute; National Institutes of Health Code WA09) was cultured on a feeder layer of mitotically inactivated mouse embryo fibroblasts in medium consisting of DMEM-F12 supplemented with 20% KSR (Invitrogen, http://www.invitrogen.com), 5 ng/ml FGF2 (R&D Systems, http://www.rndsystems.com/), 2mM L-glutamine, 0.1mM 2-mercaptoethanol, and 1× nonessential amino acids. For analysis, hES cell colonies were separated away from the feeder layer and processed for DNaseI hypersensitive site mapping. The undifferentiated state of the cultures was determined by morphology, immunohistochemistry, and Affymetrix expression array (http://www.affymetrix.com) analysis.
Total RNA was extracted from CD4+ T cells, GM06990, HeLa S3, K562, and H9 undifferentiated stem cells using Trizol (Invitrogen). RNA was analyzed by Bioanlyzer to confirm high-quality 18s and 28s ribosomal bands (Agilent, http://www.home.agilent.com/), labeled, and hybridized to Affymetrix U133 Plus 2.0 arrays. The expression data for IMR90 was publicly available (http://licr-renlab.ucsd.edu/download.html). All data was normalized together using RMA  through the BioConductor project's Affymetrix package (http://www.bioconductor.org). Only genes expressed in the ENCODE regions were used for the analysis. Gene expression was categorized as expressed or not expressed by the Affymetrix A/P call.
The enhancer blocking assay was performed as described previously . Briefly, the β-globin DNaseI HS2 site, which is a known enhancer element , was cloned upstream of a NeoR gene. Putative insulators were cloned between the enhancer and NeoR gene. The previously described chicken insulator was used a positive control . DNaseI HS sites proximal or distal to TSS that also overlapped CTCF binding sites were cloned into the enhancer block vector. All plasmids were purified from three independent bacterial cultures, linearized, and each DNA prep was electroporated independently into K562 cells. Each electroporation was plated in triplicate (total of nine experiments per plasmid). The next day, cells were transferred to soft agar media containing G418. After 16 days, plates were scanned and colonies were counted.
Publicly available ChIP-chip data of CTCF, p300, RNA Pol II, TRAP220 was obtained from http://licr-renlab.ucsd.edu/download.html. ChIP-chip data for five histone modifications were obtained from the UCSC genome browser (http://genome.ucsc.edu). The coordinates for TSS were obtained from the ENCODE pilot study  (TSS set ABCDE defined in supplement 3.5 therein). These data were mapped onto the DNaseI HS sites in each cell line based on their overlapping coordinates. To determine whether distal DNaseI HS sites are still statistically near genes, we binned all DNaseI HS sites according to their distances to the closest TSS. We computed an enrichment score for each bin, defined as the ratio between the number of DNaseI HS sites and the number of all possible positions in the ENCODE regions in the same distance bin.
MEME  with the “-zoop” option was used to identify the CTCF sites in the ubiquitous distal DNaseI HS sites. Clover  was used to identify motifs overrepresented in cell type–specific DNaseI HS sites at each distance category: proximal (<2 kb), distal (between 2 kb and 10 kb) and far distal (>10 kb). Two background sets were used (union of ChIP-chip hits and random dinucleotide shuffling of input sequences). Overrepresented motifs (p-values < 0.01) from the TRANSFAC database were reported. Overlapping motifs were reported as groups.
Each correlation plot corresponds to raw ratio values from each DNase-chip replicate (x-axis) compared to the averaged raw ratio values of all DNase-chip replicates from each cell type (y-axis). The Pearson correlation coefficients (R) are shown at the bottom of each plot. Correlation coefficients are lower for CD4+ T cells and the GM06990 cell lines, but this data was previously shown by extensive quantitative PCR validation  to have high sensitivity (88%) and specificity (97%).
(619 KB TIF)
Six cell lines were clustered using DNaseI HS sites data in the same way as described in Figure 2A. For expression data, the Euclidian distances are calculated from the Robust Multichip Average (RMA)  normalized expression levels between each cell line for genes that map within ENCODE regions. Other algorithms also cluster CD4+ and GM06990 first, a sign that this grouping is robust and biological (unpublished data). To test whether this was a result of analyzing genes within 1% of the genome, we performed the same clustering using genome-wide expression data and find that CD4+ T and GM06990 cells remain the closest-clustering cell lines, while K562 remains more distantly clustered (unpublished data).
(635 KB TIF)
For each cell line, DNaseI HS sites were assigned to distance bins based on the distances to the nearest transcription start site (TSS). Similarly, each genomic position in the ENCODE regions is assigned to a bin based on its distance to the nearest TSS. To calculate the enrichment of DNase sites, the percentage of DNaseI HS sites in each distance bin is normalized by the percentage of positions in each distance bin. DNaseI HS sites near TSS were enriched and DNaseI HS sites were less likely to be found at large distances from a TSS.
(279 KB TIF)
(272 KB TIF)
Arrows represent ubiquitous DNaseI HS sites.
(776 KB TIF)
The y-axis is the percentage of base pairs with phastCons scores of different ranges. PhastCons takes a multiple sequence alignment, a phylogenetic model for conserved regions, and a phylogenetic model for nonconserved regions as input. It scans along for regions that better fit the conserved model than the nonconserved model and output the probability that each base is in such a region as the conservation score for that base. The categories are: “Distal non-overlap,” CTCF motifs located in distal DNaseI HS sites (greater than 2 kb from any TSS) that do not overlap with CTCF ChIP chip hit regions; “Distal overlap,” CTCF motifs located in distal DNaseI HS sites that overlap with CTCF hit regions; “Proximal non-overlap,” CTCF motifs located in proximal DNaseI HS sites (less than 2 kb from a TSS) that do not overlap with CTCF hit regions; and “Proximal overlap,” CTCF motifs located in proximal DNaseI HS sites that overlap with CTCF hit regions. For each category, the genomic regions 100 bp downstream from the motif coordinates were used as the control.
(1.0 MB TIF)
Shown here is an example using H3K4me2 data (the final matrix is plotted in Figure 6B).
(667 KB TIF)
Each plot shows the enrichment of the overlap between proximal or distal DNaseI HS sites and each histone modification hit from three different cell types. See main text for the enrichment of H3K4me2 and the description of the enrichment calculation.
(3.5 MB TIF)
ChIP-chip from HeLa cells are compared to proximal and distal DNaseI HS sites that are unique to HeLa cells, common, or ubiquitous to all six cell types.
(2.7 MB TIF)
ChIP-chip from HeLa and IMR90 cells are compared to proximal and distal DNaseI HS sites that are unique to HeLa or IMR90 cells, common, or ubiquitous to all six cell types.
(3.2 MB TIF)
We thank Ross Hardison and Laura Elnitski for current β-globin LCR DNaseI HS site annotations and Tae Hoon Kim for sharing CTCF ChIP-chip data. In addition, we thank Holly Dressman and the Duke Microarray Core facility for expression analyses. We thank Soohyun Lee and Ulas Karaoz at Weng's lab for help with obtaining datasets.
Author contributions. HX, ZW, and GEC conceived and designed the experiments. TV, JGC, PJT, and GEC performed the experiments. HX, HPS, JML, YF, TSF, ZW, and GEC analyzed the data. DMB, RDGM, and BR contributed reagents/materials/analysis tools. HX, JML, YF, ZW, and GEC wrote the paper.
Funding. This project was partly funded by National Human Genome Research Institute intramural funds to DMB and by National Institutes of Health grant HG03110 to ZW and grant HG003169 to GEC.
Competing interests. The authors have declared that no competing interests exist.