|Home | About | Journals | Submit | Contact Us | Français|
Insulator elements affect gene expression by preventing the spread of heterochromatin and restricting transcriptional enhancers from activation of unrelated promoters. In vertebrates, insulator’s function requires association with the CCCTC-binding factor (CTCF), a protein that recognizes long and diverse nucleotide sequences. While insulators are critical in gene regulation, only a few have been reported. Here, we describe 13,804 CTCF binding sites in potential insulators of the human genome, discovered experimentally in primary human fibroblasts. Most of these sequences are located far from the transcriptional start sites, with their distribution strongly correlated with genes. The majority of them fit to a consensus motif highly conserved and suitable for predicting possible insulators driven by CTCF in other vertebrate genomes. In addition, CTCF localization is largely invariant across different cell types. Our results provide resource for investigating insulator function and possible other general and evolutionarily conserved activities of CTCF sites.
CTCF plays a critical role in transcriptional regulation in vertebrates (for reviews, see (Ohlsson et al., 2001) (Klenova et al., 2002) (Dunn and Davie, 2003)). It was first identified by its ability to bind to a number of dissimilar regulatory sequences in the promoter-proximal regions of the chicken, mouse, and human MYC oncogenes (Filippova et al., 1996; Lobanenkov et al., 1990). CTCF is a ubiquitously expressed nuclear protein with 11 zinc finger (ZF) DNA-binding domain (Filippova et al., 1996; Klenova et al., 1993). It is essential (Fedoriw et al., 2004) and highly conserved from Drosophila to mice and man (Moon et al., 2005). Point-mutations at the distinct DNA-recognition amino acid positions in ZF3 and ZF7 of CTCF have been identified in a variety of cancers selected for LOH at 16q22 where CTCF maps, suggesting its role as candidate tumor-suppressor gene (Filippova et al., 1998; Filippova et al., 2002).
Initial biochemical analyses revealed that CTCF contains two transcription repressor domains, and can act as a transcriptional repressor (Baniahmad et al., 1990; Burcin et al., 1997; Klenova et al., 1993; Lobanenkov et al., 1990). However, others have found that it could also function as a transcriptional activator in a different sequence context (Vostrov and Quitschke, 1997). Recent studies have identified CTCF to be the vertebrate insulator protein (Bell et al., 1999). So far, CTCF remains as the only major protein implicated in establishment of insulators in vertebrates (Felsenfeld et al., 2004), including those involved in regulation of gene imprinting and mono-allelic gene expression (Fedoriw et al., 2004) (Ling et al., 2006), as well as in X-chromosome inactivation and in the escape from X-linked inactivation (Filippova et al., 2005; Lee, 2003).
There has been a great interest in identifying where potential insulators are located in the eukaryotic genome, because knowledge of these elements can help understand how cis-regulatory elements coordinate expression of the target genes. Transcription of every eukaryotic gene begins with the assembly of an RNA polymerase preinitiation complex (PIC) at the promoter (Kadonaga, 2004), a process that is regulated by sequence specific transcription factors and cis-regulatory elements. Genetics studies in Drosophila first identified the importance of insulators in ensuring proper enhancer/promoter interactions (Udvardy et al., 1985). More recent studies have implicated insulators in the establishment of euchromatin/heterochromatin boundaries in vertebrates (Felsenfeld et al., 2004; Gerasimova and Corces, 2001; Jeong and Pfeifer, 2004). In addition, it has been demonstrated that an insulator in the IGF2/H19 locus is critical for the imprinting of the locus (Bell and Felsenfeld, 2000; Hark et al., 2000; Kanduri et al., 2000).
The mechanism of insulator function remains unclear. One model proposes that insulators, by formation of special chromatin structures, compete for enhancer-bound activators, preventing the activation of downstream promoters (Bulger and Groudine, 1999). Alternatively, insulators may facilitate the formation of loops, for example, via attachment of chromosomal regions to the nuclear membrane (Yusufzai et al., 2004), keeping the intermediate regions exposed for only local interactions between enhancers and promoters. Consistent with this model, it was recently shown that CTCF could mediate long-range chromosomal interactions in mammalian cells, providing a possible mechanism by which insulators establish regulatory domains (Kurukuti et al., 2006; Ling et al., 2006; Yusufzai et al., 2004). The extent at which each mechanism plays a role in shaping genome expression remains unresolved. Knowledge of insulators in the genome would provide a much-needed framework for understanding the genome organization and function.
The effort to computationally identify potential insulators in the human genome has been hampered by an incomplete understanding of the DNA recognition sequence of CTCF. Biochemical assays have indicated that the 11-zinc-finger protein can use different combinations of the zinc-finger domains to bind different DNA target sequences (Filippova et al., 1996; Ohlsson et al., 2001). Thus, the CTCF binding sites identified from in vitro protein/DNA interaction assays and a limited number of known insulators exhibit extensive sequence variation and not enough specificity for genome-wide prediction of CTCF binding (Ohlsson et al., 2001). Recently, an attempt has been made to systematically isolate insulators in the mouse genome through chromatin immunoprecipitation followed by cloning and sequencing (Mukhopadhyay et al., 2004). Unfortunately, due to a limited scale of the sequencing effort, only about 200 DNA-fragments with the enhancer-blocking activity, each driven by various CTCF binding sites, have been identified. However, no consensus of CTCF binding motif has been so far reported from this study.
As a first step towards understanding how insulators contribute to gene expression in human cells, we have located the sites of CTCF binding in the human genome using chromatin immunoprecipitation followed by detection with genome-tiling microarrays (Kim et al., 2005b; Kim and Ren, 2006). Our analyses have generated a high-resolution genomic map of CTCF binding, with on average 2.5 genes bounded by a pair of CTCF binding sites. We also identify a clear consensus of CTCF binding motif shared by a majority of the experimentally determined in vivo CTCF binding sites. We show that the sites of CTCF binding sequences in the human genome are highly conserved in other vertebrates, consistent with the widespread and fundamental role of CTCF in cellular function. In addition, we demonstrate that CTCF binding to DNA is largely invariant from cell to cell, with a subset interacting with the protein in a cell type dependent manner. Our results offer a general resource for understanding the role of CTCF in insulator function, gene regulation, and genome organization in human cells.
Previously, we developed an improved genome-wide location analysis strategy to identify transcription factor binding sites throughput the genome in human cells (Kim et al., 2005b). This method, also known as ChIP-chip, involved the immunoprecipitation of transcription factor bound DNA from formaldehyde crosslinked cells, followed by detection with genome-tiling arrays. To identify CTCF binding sites in the human genome, we performed the same analysis with monoclonal antibodies against CTCF and chromatin extract from the primary human fibroblast, IMR90 cells. The CTCF-bound DNA was identified using a series of 38 arrays containing a total of 14.6 million 50-mer oligonucleotides, evenly positioned every 100 basepairs (bp) along the non-repeat sequence of the human genome. By applying a simple statistical filtering that requires the signals from four consecutive probes to be above a threshold (2.5 times the standard deviation of the average log ratios), we identified an initial list of 15, 221 genomic regions bound by CTCF (Figure 1a,b). To verify the binding of CTCF to these putative CTCF binding sequences, we designed a new oligonucelotide microarray representing these regions and the surrounding sequences at 100bp resolution. Using this array, we performed ChIP-chip analysis against CTCF with an independent chromatin sample of IMR90 cells and confirmed its binding to 13,804 regions.
To assess the accuracy of these in vivo CTCF binding sites, we first randomly selected 84 (Supplemental Table 1) and performed conventional ChIP assays. This analysis validated the binding of CTCF to 80 (95%) tested sites (Supplemental Figure 2a), and suggested a high degree of specificity of our method.
Next, we examined CTCF binding on 60 previously characterized CTCF binding sites and insulators in the human genome, and found that 32 (~53%) were detected by our analysis (Supplemental Table 2). To determine whether the failure to detect CTCF binding at the remaining 28 sites was due to a moderate sensitivity of our method, we performed conventional ChIP assays and detected binding of CTCF to four of these sites (Supplemental Figure 2b, Supplemental Table 3). Since these known CTCF binding sites would be considered false negatives of our method, the sensitivity of our method was estimated to be about 88% (32 out of 36).
Third, we examined a multiple species sequence alignment score (PhastCon) for each CTCF binding site (Siepel et al., 2005) to determine their sequence conservation. A significant fraction (55%, P < 2.2×10−16) of the CTCF binding sites are conserved in vertebrates with a PhastCon score of 0.8 or higher (Supplemental Figure 2c), suggesting that most CTCF binding sites identified in our analysis are likely functional.
To characterize how the CTCF binding sites are distributed along the human genome, we compared their localization to a total of 20,181 well annotated human genes (Kent et al., 2002). We performed correlation analysis of CTCF binding sites with the number of genes or transcripts found on the chromosomes, or with the total nucleotide length of each chromosome (Figure 1c, d, Supplemental Figure 3a). As a control, we examined two enhancer binding proteins whose genomic binding sites were recently determined in human cells: estrogen receptor (ER) (Carroll et al., 2006) and p53 (Wei et al., 2006) (Supplemental Table 4). The results showed that CTCF binding correlates strongly with the number of genes on each chromosome (r2 = 0.85), and the degree of correlation is much higher than both ER and p53. In contrast, CTCF binding only weakly correlates with the chromosomal length (r2 = 0.42), and the degree of correlation is much less than that of the two transcription activator proteins (Carroll et al., 2006) (Figure 1c, d). Based on this analysis, we conclude that the distribution of CTCF binding sites along the genome is closely correlated with genes, and distinct from other known sequence specific transcription factors.
An independent analysis of CTCF localization along each chromosome also confirmed a strong correlation between CTCF binding and gene density. We segmented each chromosome with a sliding 2 Mbp window and calculated the correlation between numbers of CTCF binding sites and genes within each window. In general, the CTCF binding sites correlate strongly with genes, with a correlation coefficient of 0.786. In contrast, the average correlation coefficient between randomly generated genomic sites and genes is only 0.32 (Figure 2a). The degree of correlation between the CTCF binding sites and genes is similar to that between the TAF1 binding sites, mapped previously in the same cells, and genes (correlation coefficient of 0.792). This analysis indicates that CTCF binding is highly restricted to genes, displaying the same property as a general transcription factor. This property of CTCF distribution is consistent with its role at insulators, and suggests a widespread function of CTCF in the genome.
While the distribution of CTCF binding sites resembles that of a general transcription factor such as TAF1, there are important differences between the two. The majority of TAF1 binding sites (89%) are within close proximity to the known 5′ ends of transcripts; in contrast, CTCF binding sites are generally very far from promoters, with an average distance of 48,000bp (Figure 2b). Nearly half (46%) of the CTCF binding sites are located in the intergenic (46%) regions, consistent with their potential role as insulators. Only about 20% CTCF sites are near transcription start sites. Unexpectedly, a significant number of CTCF binding sites fall within genes, with 22% in the introns and 12% in the exons (Figure 2c). There is no marked enrichment of CTCF binding sites near the polyadenylation sites (Supplemental Figure 3b). The binding of CTCF near promoters to a large extent is negatively correlated with gene activity, as most of these promoters (72%) are not occupied by the general transcription factor TAF1. This observation is consistent with the possibility that CTCF might function as a repressor at these promoters. The significance of CTCF binding within the introns and exons is not clear, but presumably it might be related to its insulator function to block enhancers and silencers present nearby these sequences. Combined together, these results demonstrate that CTCF binding sites are ubiquitous throughout the genome, and display unique distribution that is distinct from enhancers and promoters.
While CTCF binding sites are generally correlated with genes along the entire length of chromosomes, there are isolated regions that deviate from this trend (Figure 2a). Two notable types of loci can be defined: one type of loci is characterized by a relative depletion of CTCF binding sites and the other by an enrichment of CTCF binding sites. We can define CTCF depleted loci as those 2 Mbp windows that exhibit a lower than average density of CTCF binding sites (less than 2 per 2 Mbp, P < 0.05 for most chromosomes, Supplemental Table 5). Likewise, we can define CTCF enriched loci as those 2 Mbp windows that exhibit higher than average CTCF sites density (P < 0.001, Supplemental Table 6). We observe that the CTCF depleted domains tend to include clusters of related gene families and genes that are transcriptionally co-regulated, while CTCF enriched domains often have multiple alternative promoters (81% contain 2 or more alternative promoters). Both cases are consistent with the assumption of CTCF binding sites acting as insulators.
We have characterized these two types of regions further by considering only genes with multiple CTCF binding sites or clusters of genes with no CTCF binding sites. We have defined 13,766 genomic regions that are flanked by a pair of consecutive CTCF binding sites along the genome and named themCTCFpair defineddomains (CPD). About 43% (5969) of CPDs contain at least one gene locus in its entirety, while the remaining CPDs do not contain a complete gene. About 74% of all genes in the genome are surrounded in their entirety by the CTCF binding sites. The remaining genes are either telomeric to CTCF binding site (2.6% of genes) or contain internal CTCF binding sites (23% of genes). On average, about 2.5 genes are found in a CPD. The average size of a CPD is 212,090bp. A significant number of them (189 CPDs, P<0.001) contain 9 or more genes, with the largest one containing as many as 56 genes (P=3.42×10−56). Table 1a lists all CPDs with 15 or more genes, P=2.2×10−8. These CPDs often correspond to large clusters of related genes (Sproul et al., 2005), such as the Olfactory Receptor (OR) gene clusters (Figure 2d), ZNF gene clusters, KRTAP gene clusters (Supplemental Figure 4a), type I interferon (IFN) gene cluster, etc.
In contrast to depletion of CTCF binding sites within clusters of related genes, there is a significant concentration of CTCF binding sites at genes that display extensive alternative promoter usage. Forty nine genes contain significantly more CTCF binding sites (8 or more, P=0.0018, Table 1b) than expected by chance, including such genes as the protocadherin gamma (PCDHG), T cell receptor alpha/delta, beta, gamma loci (TCRα/δ TCRβ TCRγ), and immunoglobulin heavy chain locus (IgH), and light chain kappa and lambda loci (IgLκ IgLλ) (Supplemental Figure 4b). These genes all contain a large number of alternative promoters, most of which are separated from each other by CTCF binding sites (Figure 2e).
In conclusion, CTCF binding sites are distributed along the genome in a non-random fashion that is different from the general transcription factor and sequence specific activators previously characterized. In one aspect, the CTCF binding sites distribution is similar to a general transcription factor in that they both closely track the gene distribution on each chromosome. In comparison, the distribution of previously characterized sequence specific activators is less strongly correlated with the gene density but more significantly with chromosome length. However, unlike general transcription factors, which usually associate with the transcription start sites, the majority of CTCF sites are located remotely from the promoters. Such unique property of CTCF localization is consistent with its putative role as an insulator binding protein.
Previous studies have implicated divergent and variable modes of binding by CTCF and suggested that CTCF recognizes diverse sequences (Ohlsson et al., 2001). Identification of a large number of in vivo CTCF binding sites provides a unique opportunity to better define the in vivo recognition sequence for this DNA binding protein. Using the discriminating matrix enumerator (DME) algorithm (Smith et al., 2005b), we have identified a motif that best distinguishes the CTCF binding sites from their adjacent, control sequences (Figure 3a). This 20-basepair motif is similar to one particular form of CTCF binding consensus (Bell and Felsenfeld, 2000), but refines it significantly in six nucleotide positions (positions 7, 8, 9, 10, 13, and 17, Figure 3a). This motif is present in over 75% of the experimentally identified CTCF binding sites, but in less than 17% of the control, surrounding sequences. It is usually located in the middle of the experimentally identified CTCF binding fragments, as would be expected if they serve as the point of contact by the protein in vivo (Figure 3b).
To test if this motif is indeed the CTCF recognition sequence, we performed electrophoretic mobility shift analysis (EMSA) with 12 randomly selected CTCF binding sites determined above. For each binding site, we designed an 80-mer EMSA probes with the recognizable 20-mer CTCF motif in the middle (Supplemental Table 7). We also designed a control probe by randomly shuffling the 20-mer CTCF motif within each test sequence. Eleven of the 12 probes were confirmed to interact specifically with a recombinant CTCF protein in this assay, while the shuffled probes did not (Figure 3c), indicating that CTCF indeed recognizes the newly identified motif. The one probe that failed to interact with CTCF protein may represent an inferior scoring motif that is more centrally located but may not correspond to the true in vivo CTCF binding site.
From these results, we conclude that under our experimental conditions CTCF binding in vivo appears to be mediated by a class of similar sequences that is well described by a consensus motif. However, a rather significant population of in vivo CTCF binding sites lacks this motif. Additional analysis has failed to identify any significantly overrepresented motifs within these regions. To test whether these sequence bind direct to CTCF in vitro, we have generated consecutive, overlapping DNA fragments to represent two randomly selected CTCF binding sites without the motif (Supplemental Table 8), and performed EMSA. Our results confirm that CTCF can indeed bind to both sequences in vitro (Supplemental Figure 5a, b). Therefore, a fraction of the in vivo CTCF binding sites might have a distinct binding mode and interact with this protein at different sequences. Additional experiments are required to resolve the binding sequence of CTCF at these sites.
The CTCF protein displays an unusually high conservation with over 95% amino acid sequence identity within its DNA binding domains among all vertebrate homologs. Moreover, the few amino acid substitutions within the CTCF DNA binding domain do not map to any residues predicted to make direct contacts with the DNA (Pabo et al., 2001). This high degree of sequence conservation supports an evolutionarily conserved function for CTCF, and predicts that the CTCF binding sites should also be conserved in other vertebrate genomes. Consistent with this prediction, the 20-mer motif sequence within each in vivo CTCF binding sites is highly conserved evolutionarily compared to randomly shuffled motifs (Supplemental Figure 6).
Furthermore, we have also searched the entire human genome for the occurrences of CTCF motif, extracted their aligned sequences in other vertebrate genomes where sequence information is available, and asked whether a high scoring CTCF motif is also present in the corresponding homologous sequences. To increase the specificity of computational prediction of CTCF binding sites, we have restricted the bases at position 6, 11, 14, and 16 to the nucleotide that is predominantly present within the experimentally defined CTCF binding sites (see Experimental Procedures for details). A total of 31,905 potential CTCF binding sites are identified in the human genome using this method. Of these sites, 19,271 can be aligned to the mouse genome, and 6,553 contained the CTCF consensus motif as defined above. In contrast, a similar search in the genome with a random matrix of the same length and base composition identifies an average of only 149 conserved occurrences, suggesting that the CTCF binding sequences are highly conserved (P=1.27×10−8, Figure 4a). In addition to the mouse genome, we have also examined the conservation of the predicted human CTCF binding sequences in other vertebrate genomes, finding 8,082 (P = 1.19×10−5), 8,154 (P = 3.84×10−6), 6,362 (P = 1.02×10−8), 263 (P = 5.09×10−5) and 204 (P = 5.48×10−5) to be significantly conserved in dog, cow, rat, chicken and zebrafish genomes, respectively (Figure 4a). In total, 12,799 (out of 31,905) computationally predicted CTCF binding sites in the human genome are conserved in at least one other vertebrate genome (excluding the chimp genome, Figure 4b). We define these highly conserved CTCF recognition sequences as potential CTCF binding sites.
The conserved CTCF recognition sequences in the human genome imply that the corresponding motifs in other species may also function as CTCF binding sites. To test this prediction, we have performed EMSA with two predicted CTCF binding sites in the chicken genome (Supplemental Table 9). The results confirm the binding of CTCF to both CTCF sites in vitro (Figure 4c).
To evaluate the variability of CTCF binding in a different cell type, we have performed ChIP-chip analysis to identify CTCF binding sites in a hematopoietic progenitor cell line U937. We have focused our analysis on a set of 44 genomic regions representing 1% sampling of the human genome known as the ENCODE regions (Consortium, 2004; Kim et al., 2005a) (ENCODE arrays). These regions have been semi-randomly selected by the ENCODE consortium as a common platform for genomic research. We have used the previously described genome tiling arrays for this experiment (Kim et al., 2005a). These arrays contain PCR products as probes instead of the oligonucleotides. We have detected 232 sites in U937 cells at the confidence level of P < 0.000001 (Figure 5a, b), which overlap 151 of 225 (67%) CTCF sites detected within the same regions in IMR90 sites (Figure 5b). Less restricted criteria results in a larger degree of overlap (Supplemental Figure 7). This analysis shows that most of the CTCF binding sites detected in IMR90 cells are also occupied in another cell type, indicating that perhaps most CTCF binding sites in the genome are cell type invariant.
On the other hand, while the overlap between CTCF binding sites in U937 and IMR90 cells does increase with loosened criteria, it does not become 100%. A subset of the CTCF binding sites appears to interact with this protein in a cell type dependent manner. To confirm this, we have performed conventional ChIP assays to test the binding of CTCF to two IMR90-specific sites and one U937-specific site (Supplemental Table 10). The results indicate that the two IMR90 specific CTCF binding sites are indeed associated with the protein in IMR90 cells but not in U937 cells, while a U937 specific CTCF binding site interacts with this protein in an opposite way (Figure 5c). We conclude that a fraction of the CTCF binding sites in the genome may be subject to cell type dependent regulation, although the full extent of this population of CTCF sites remains to be determined.
Since we were able to computationally map CTCF binding sites in other vertebrate genomes, we were interested in knowing how these sites have evolved in different vertebrate species and whether the changes might reflect CTCF function. We have identified 14,352 nucleotide changes within the 12,799 evolutionarily conserved CTCF recognition sequences. Interestingly, the predominant base substitution occurs at the cytosine at position 16, which happens to be the dominant CG di-nucleotide within the consensus sequence (Figure 6). The cytosine to thymidine transition at this position accounts for nearly 17% of all nucleotide changes. One explanation for the unusually high rate of C-to-T substitution at this position is potential DNA methylation at the base (Jones and Baylin, 2002; Rideout et al., 1990), which is consistent with the regulation of CTCF binding by DNA methylation. This observation suggests an intriguing evolutionary model of deriving differential regulation of genes by simply altering CTCF binding in the genome, a process that can be facilitated by environmental and epigenetic factors.
In summary, we have generated a high-resolution map of CTCF binding sites in the human genome with unique distribution and sequence features. This map not only confirms most known insulators and CTCF binding sites, but also identifies over 13,000 novel CTCF binding sequences and potential insulators. Nearly 80% of the CTCF binding sites share a consensus motif that is highly conserved during evolution. We have found that CTCF binding sites are largely invariant between cell types. Our results represent a critical step toward comprehensive identification of CTCF-dependent insulators in the human genome.
Unlike sequence specific transcription activators such as ER and p53, CTCF binding sites are ubiquitously and universally present throughout the genome, and its chromosomal distribution is strongly correlated with genes. In this aspect, CTCF resembles the behavior of general transcription factors. Yet, locations of CTCF binding sites are clearly different from that of general transcription factors. Except for a relatively small fraction (20%), the vast majority of CTCF binding occurs at sites remotely from the transcription start sites (Figure 2b). In contrast, nearly 90% of the TAF1 binding sites are located at promoters. This unique distribution of CTCF binding sites in the genome is consistent with the potential role of these sequences as insulators.
About a half of the CTCF binding sites are far away from genes. These distal sites likely define insulators and in many cases coincide with boundaries for gene clusters, such as olfactory receptor gene clusters. A number of genes in the mammalian genome are arranged into clusters, and existence of these clusters has implicated coordinated regulation of expression by shared long range elements such as locus control regions, as it is observed for the Hox and beta-globin gene clusters (Sproul et al., 2005). Recently, a study showed that the OR gene clusters located on separated chromosomes share a single enhancer that selectively interact with only one promoter, resulting in a highly exclusive activation of a single promoter out of about 1,500 others (Lomvardas et al., 2006).
Consistent with this gene segregation property of CTCF, the CTCF binding sites coincide with boundaries of genes that escaped X-inactivation (Filippova et al., 2005). X-inactivation has been shown to involve establishment of heterochromatin on one of the two X chromosomes of the female genome. A recent study shows that X-inactivation is not uniform along the inactive X chromosome (Carrel and Willard, 2005), and identified a number of gene clusters that can escape the chromosome-wide heterochromatin formation. If the CTCF binding sites indeed function as insulators, then one might expect them to segregate the gene clusters that escape inactivation on the X chromosome. Indeed, we have observed several domains on the X chromosome that are surrounded by CTCF binding sites (Supplemental Figure 8).
While nearly half of the CTCF binding sites are found in sequences between genes, an equivalent number of CTCF sites are located within genes. It is not immediately obvious whether these sequences function as insulators. We note that many of them appear to segregate alternative promoters within a single gene and perhaps contribute to alternative promoter usage. Examples of this are provided by the protocadherin gamma locus (PCDHG, Figure 2e), T cell receptor alpha/delta, beta, gamma loci (TCRα/δ TCRβ TCRγ), immunoglobulin heavy chain (IgH), and light chain kappa and lambda loci (IgLκ IgLλ, Supplemental Figure 4b). In each case, CTCF binding segregates transcriptional start sites that display differential activities across tissues. About 52% of the human genes possess multiple promoters. While alternative promoter usage is very common (Carninci et al., 2005; Carninci et al., 2006; Kimura et al., 2006), the mechanisms are not clearly understood. It is generally assumed that different promoters employ distinct regulatory mechanisms to achieve tissue and temporal specific activities. The observation that CTCF binding sites punctuating alternative promoters may suggest involvement of insulator elements in the selection of promoters in distinct cell types.
One of the surprising findings of our study is that the vast majority of the experimentally identified CTCF binding sites are characterized by a specific 20-mer motif. We demonstrate that this motif is highly conserved in vertebrates and can be used to predict other potential CTCF binding sites in the genome. Furthermore, we show that the newly characterized CTCF consensus sequence specifically interacts with CTCF protein in vitro. Given the overwhelming diversity of sequences that CTCF may recognize in vitro, our finding of a single dominant CTCF binding consensus sequence within the in vivo CTCF binding sites is unexpected.
On the other hand, our results do not rule out the existence of additional CTCF binding motifs that may be recognized by the insulator binding protein along the genome. As a matter of fact, it is important to note 18% of the in vivo binding sites do not contain the newly characterized CTCF binding consensus sequence. When analyzed in vitro, some of these CTCF binding sites can indeed directly interact with CTCF, supporting the existence of different CTCF recognition sequences. Furthermore, quite a number of previously characterized CTCF binding sequences and insulators lack the newly identified motif. It is entirely possible that CTCF may bind to different classes of DNA sequences, either directly or in association with a partner. So far, our search has failed to yield another significant motif among this subset of in vivo CTCF binding sites.
In conclusion, we report here the first high-resolution map of CTCF binding in the human genome, which reveals several new aspects of CTCF function. Our results provide a much-needed resource for further investigation of CTCF’s role in insulator function, imprinting, and long-range chromosomal interactions.
A detailed description of the experimental methods and materials can be found at Cell onlinesupplemental materials. All raw and processed data are available at http://licr-renlab.ucsd.edu/download, the UCSC genome browser http://genome.ucsc.edu/, and Gene Expression Omnibus http://www.ncbi.nlm.nih.gov/geo/ (accession #GSE5559). Monoclonal CTCF antibodies used in this study have been characterized and described by E. Pugacheva and co-workers (Pugacheva et al., 2005) and are available from them upon request.
IMR90 and U937 cells were grown and maintained according to the direction from American Type Culture Collection. Cells were harvested and crosslinked with 1% formaldehyde when they reached ~80% confluency on plates. Chromatin immunoprecipitation was performed as described (Kim et al., 2005b), with the use of 50ul of equimolar mixture of nine CTCF monoclonal antibodies, and three distinct array platforms - a whole human genome tiling arrays (Kim et al., 2005b), a condensed array which contained a total of 742,156 oligonucleotides, and PCR product arrays covering the ENCODE regions (Kim et al., 2005a). Microarray data analysis was carried out as described previously (Kim et al., 2005a; Kim et al., 2005b) (see onlinesupplemental materials).
Quantitative real-time PCR was performed in duplicate with 0.5 ng of CTCF ChIP DNA and unenriched total genomic DNA, with iCycler™ and SYBR green iQ™ SYBR green supermix reagent (Bio-Rad Laboratories). Normalized Ct (ΔCt) values for each sample were calculated by subtracting the Ct value obtained for the unenriched DNA from the Ct value for the CTCF ChIP DNA (ΔCt = Ctctcf − Cttotal). The fold enrichment of the tested promoter sequence in ChIP DNA over the unenriched DNA was then estimated as described previously (Bernstein et al., 2005; Cawley et al., 2004). Primers used for this analysis are listed in the Supplemental Table 1.
Motif discovery was performed as described in (Smith et al., 2005a; Smith et al., 2005b). All the CTCF binding sites were used as positive sequences and the flanking sequences as negative sequences. The overrepresented sequence motif found in the positive sequences compared to the negative was selected. Using this sequence motif, we generated an initial 20-bp position weight matrix (PWM). This 20-mer PWM was searched against the entire set of CTCF binding sites, and all the motifs found in the binding sites were used to generate the final PWM. The program Storm was then used to search the human genome (hg17) for presence of this motif. The high scoring motifs were selected for the presence key nucleotides C, G, G and C at positions 6, 11, 14 and 16. The resulting CTCF binding sites were then mapped to 14 vertebrate genomes using the available liftOver and genome alignment information available from UCSC genome browser. Each sequence was then scored using Storm and filtered for the critical nucleotides as per the human genome scan.
EMSA were carried out as described (Pugacheva et al., 2005). Briefly, the DNA-binding domain of CTCF (11ZF) and Luciferase (Luc) were in vitro synthesized from pET-11ZF and T7 control plasmids, respectively (Awad et al., 1999; Filippova et al., 1996) by using TnT T7 Quick Coupled Transcription/Translation System (Promega, Madison, WI, Cat.# L1170). DNA fragments (Supplemental Table 2) were end-labeled at their 5′ ends using 32P-γ-ATP and T4 polynucleotide kinase. The labeled DNA were gel purified and combined with equal amounts of in vitro–synthesized protein, and incubated for 30 minutes at room temperature followed by electrophoresis on 5% non-denaturing polyacrylamide gels.
Statistical significance of the computationally mapped CTCF sites was analyzed by comparing the number of mapped site to the distribution of number of sites mapped using random motif resulting from 1,000 iterations. The random PWM was derived from randomizing the position within the 20-mer CTCF motif. Statistical significance of observed gene clusters within CPDs and multiple CTCF binding sites within a gene was analyzed by calculating the expected probability of each number of observed genes per CPD or each number of CTCF binding sites per gene using Poisson distribution function. Statistical significance of observed evolutionary conservation of CTCF binding sites compared to random sites was analyzed by Mann-Whitney-Wilcoxon test.
We gratefully acknowledge computer resources made available to us by the Super Computer Center (NBCR award number P41 RR 08605 from NCRR, NIH). This research was supported in part by Ruth L. Kirschstein National Research Service Award F32CA108313 (THK), Ludwig Institute for Cancer Research (BR), U01HG003151 (BR), R33CA105829 (BR), R21CA116365-01 (RDG) and HG001696 (MQZ) from NIH, EIA-0324292 (MQZ) from NSF, and by the Intramural Research Program of the NIH, National Institute of Allergy and Infectious Diseases (VVL).
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.