Identification of TFBS Creation Events
In order to reveal the role of CpG deamination in TFBS formation, we were interested in tracing the mutational events that led to the creation of TFBSs in the human promoters after the split of human and chimp. Thus, we reconstructed the sequences of the human–chimp ancestor promoters employing multiple sequence alignments of human, chimp, and rhesus (see Material and Methods). In this approach, rhesus constitutes an outgroup to infer the directionality of mutations. Using 217 TF matrices obtained from the TRANSFAC database (constituting a diverse repertoire of motifs recognized by transcription factors, see Materials and Methods), we annotated binding sites in the human promoters and in the reconstructed human–chimp ancestor promoters (see Material and Methods). Of all annotated TFBSs in the human promoters, ~96.5% also get annotated at the orthologous positions in the ancestral sequence. The remaining ~3.5% (48861) of sites comprises a group of candidates for TFBSs created in the human promoters since the split of human and chimp lineages. Out of those putative creation events, the vast majority (~96%) was due to a single nucleotide mutation event and thus we focused on such creation events in our further analysis. We further subdivided the single nucleotide creation events into two groups. These are, likely “CpG deamination–driven creation events” encompassing CpG → CpA or TpG mutational events, accounting for ~22% of all single nucleotide creation events and “nondeamination-driven creation events” harboring all other mutational events.
High Efficiency of CpG Deamination–Driven Creation Events in Human Promoters
On the first look, the CpG deamination–driven creation events might seem to constitute a rather modest fraction of all creation events specific to the human lineage. However, in order to interpret this result, the following has to be taken into account. The number of TFBSs created by CpG deamination events after the split of human and chimp lineages depends primarily on the following two factors:
- The mutational rate of CpGs.
- The number and the composition of the CpG-containing motifs residing in the human–chimp ancestral sequence.
In connection with point 2, it has to be emphasized that CpG dinucleotides are underrepresented in the mammalian genomes, including promoter regions (Saxonov et al. 2006
We have thus set out to derive a new measure of a propensity of CpG deamination events to create binding sites that account for above factors. First, building on our observation that the vast majority (96%) of creation events in the human lineage were single nucleotide mutation events and point 2 above, we annotated all single nucleotide mutation events that could potentially lead to creation of a TFBS within the reconstructed human–chimp ancestral sequence. Any such mutation event, we call a “potential TFBS creation event.” Depending on their origin, we distinguish two classes of potential TFBS creation events: potential CpG deamination–driven creation events (defined as mutations of Cs within CpGs) and potential nonCpG deamination–driven creation events (encompassing all the remaining mutations). Clearly, only a fraction of all annotated potential creation events in the ancestral sequence occurred after the split of human and chimp lineages. We compute this fraction by introducing the measure of creation efficiency (CE). We define CEdeam for CpG deamination–driven creation events as the ratio of the number of CpG deamination–driven creation events that occurred after the split of human and chimp to the number of all potential CpG deamination–driven creation events in the ancestral sequence.
Analogously, we define CEnondeam
for nonCpG deamination–driven creation events. How efficient are CpG deamination–driven creation events in comparison with nondeamination–driven creation events? In order to answer this question, we compute for each of 217 TFs (represented by a set of 217 TF matrices) two types of CE: CEdeam
. In addition, we determine average CEdeam
values over all 217 TFs. We performed the calculation separately on two distinct classes of human promoters (Saxonov et al. 2006
) characterized by low (LCPs) and high contents of CpG dinucleotide (HCPs). Importantly, LCPs were shown to be mostly methylated, whereas the vast majority of HCPs are hypomethylated in somatic cells (Weber et al. 2007
). As depicted in , the average CEdeam
value in LCPs was ~0.04 meaning that on average 4 of 100 potential CpG deamination–driven creation events took place in LCPs. This was ~9-fold higher than in HCPs where the average CEdeam
value was ~0.005. In contrast, the average CEnondeam
value was approximately equal in HCPs and LCPs (). In LCPs, CpG deamination–driven creation events occurred ~28-fold more frequently than nondeamination-driven creation events, and in HCPs, we obtained a value of ~3.4-fold enrichment. These data are compatible with a ~20.4-fold higher rate of CpG mutation (into products of deamination CpA or TpG) in LCPs when compared with HCPs (not shown). In line with these observations, it has been established that in germ line cells (the cells contributing genetic material to the next generation), CpGs are constitutively methylated in LCPs and, in contrast, CpGs are largely unmethylated in HCPs (Weber et al. 2007
FIG. 1.— CpG deamination creates TFBSs with high efficiency in human promoters. Each point represents a CE value calculated for each of 217 TFs. The horizontal axis represents CE values recorded in HCPs, and the vertical axis represents CE values obtained in LCPs. (more ...)
These results strongly indicate that CpG deamination–driven depletion of CpG dinucleotide, which has been recognized to contribute to variation in regional methylation (Feinberg and Irizarry
; Kerkel et al. 2008
), serves as an efficient mechanism for generation of new TFBSs.
CpG Deamination Creates In Vivo Binding Sites
Since the evolutionary analysis of binding sites annotated in the human promoter regions revealed a remarkable efficiency of CpG deamination–driven creation events, we were interested if the same could be observed for TFBSs detected in “genome-wide” experiments. Interestingly, we found that >85% of 217 TF-binding matrices recognize on average at least one product of deamination, CpA or TpG (Materials and Methods) and reasoned that a large number of TFs is capable of interacting with these dineuclotides in vivo.
We identified 13 TF-binding data sets generated in genome-wide ChIP-Seq experiments in human cells (in large part produced by the ENCODE consortium; Celniker et al. 2009
, ) for which we could reliably estimate creation efficiencies (see Materials and Methods). The 13 TFs belong to the following major classes: helix-loop-helix/leucine zipper, helix-turn-helix/homeodomain/Paired box/Tryptophan clusters, Cys2His2 zinc finger domain, Cys4 zinc finger of nuclear type, and histone fold class. The values of CEdeam
for the 13 TFs were ~4-fold to ~25-fold higher (and on average ~11-fold higher) than the corresponding values of CEnondeam
(). The latter values fall in the range observed for the annotated binding sites in the promoter regions (). On average, per TF, CpG deamination–driven creation events comprised ~25% of all binding site creation events (). This all indicated that CpG deamination events strongly contributed to creation of in vivo TFBSs.
Numbers of Created In Vivo Binding Sites and Corresponding Creation Efficiencies
In the following, we illustrate the significance of CpG deamination–driven creation events with an evolutionary analysis of in vivo TFBSs of such key transcription factors as c-Myc, Nanog, Oct4, and Ctcf. It is important to note that these factors contain CpA and TpG dinucleotides as part of their binding motifs (; supplementary figs. 1
, Supplementary Material
online). c-Myc is known to bind to two different E-box motifs composed of CpA, TpG, and CpG dinucleotides: the so called canonical Myc E-box 5′-CACGTG-3′ and noncanonical Myc E-box 5′-CACATG-3′/5′-CATGTG-3′ (Zeller et al. 2006
). Our analysis of in vivo c-Myc sites detected in a chromatin immunoprecipitation with the paired-end ditag experiment (ChIP-PET) (Zeller et al. 2006
) revealed that CpG deamination–driven creation events make up as much as 56% of noncanonical and 28% of canonical c-Myc site creation events, and as expected, they are localized at positions 2, 3, 4, and 5 within c-Myc motifs ( and ). The values of CEdeam
for canonical and noncanonical c-Myc sites were 8- and 14-fold higher, respectively, than the values of CEnondeam
. Likewise, the strong contribution of CpG deamination events was also observed for c-Myc sites annotated in the two classes of human promoters (supplementary table 1
, Supplementary Material
online). In LCPs, CpG deamination–driven creation events constituted as much as 78% of all noncanonical c-Myc site creation events, and the ratio of the CEdeam
values was ~51.
FIG. 2.— CpG deamination drives creation of in vivo TFBSs. (A) Upper panel: Histogram of c-Myc- and Oct4-binding sites created via single nucleotide mutation events. The presence of the CpA and TpG dinucleotides within individual binding sites is highlighted by (more ...)
Moreover, we found further evidence suggesting that CpG deamination created Myc E-box sites. Interestingly, a recent study reported the identification of a canonical Myc E-box site in AluS sequences (Wang et al. 2009
). Likewise, we identified noncanonical Myc E-box sites to reside in sequences of AluS transposons (). Our literature searches identified experimental studies in which two canonical Myc E-box sites within AluJ and AluS elements inserted in the second intron of the CDC25A gene (Galaktionov et al. 1996
) and in the distal promoter region of the KIR gene (Cichocki et al. 2009
), respectively, were shown to function as Myc-responsive elements. Provocatively, we found that all these canonical and noncanonical c-Myc sites are located in one particular location in AluS/J, which overlaps with the p53-binding site previously described by us (Zemojtel et al. 2009
) (). The reconstructed consensus sequences of AluS and AluJ subfamilies contain the CGCGCG sequence at the location corresponding to the Myc-binding site. Thus, we conclude that like p53 sites, they were also created via CpG deamination after the insertion of Alu transposons into the genome. Specifically, two and three CpG deaminations are required to create from a CGCGCG template the canonical and the noncanonical c-Myc sites, respectively. Interestingly, it has been reported that methylation of the CpG dinucleotide present in the canonical E-box inhibits Myc binding both in vitro (Prendergast et al. 1991
) and in vivo (Perini et al. 2005
). In light of this, spontaneous deamination of the methylated CpG in a canonical Myc-binding site would result in the creation of a noncanonical-binding site, which is desensitized to methylation.
Likewise, evolutionary analysis of TFBSs of Oct4, Nanog, and Ctcf detected in human ES cells (Kunarso et al. 2010
) also pointed to a strong contribution of CpG deamination–driven creation events. For Oct4, Nanog, and Ctcf, we obtained, respectively, a 24-, 22-, and 6-fold enrichment of CpG deamination–driven creation events when compared with nondeamination–driven creation events (). CpG deamination events constituted 25%, 22%, and 10% of all creation events for Oct4, Nanog, and Ctcf, respectively ( and ; supplementary fig. 3
, Supplementary Material
online). For example, it can be seen that CpG deamination–driven creation events were abundant at positions corresponding to nucleotides 4, 8, 9, and 12, where CA and TG dinucleotides are present within the Oct4-binding motif (, supplementary fig. 1
, Supplementary Material
online). We provide here experimental evidence supporting the notion that single nucleotide mutations from CpG to TpG (resulting from CpG deamination) such as seen at position 9 in (bottom panel) can create functional Oct4-binding motifs. Recently, an SNP at position 9 in a putative Oct4 motif (as referenced in ) was discovered in a patient with Beckwith–Wiedemann syndrome (Demars et al. 2010
). The mutation from T to C occurred in the context of a TpG dinucleotide and resulted in the creation of a CpG dinucleotide in the patient sequence; WT: GTTTGAGATG
CTAAT → P: GTTTGAGACG
CTAAT. The study used in vitro GEMSA (Gel Electrophoretic Mobility Shift Assay) with nuclear extract from Oct4-overexpressing cells to provide evidence that Oct4 did not bind to the sequence variant containing CpG (P) but only to the one having a TpG instead of CpG.
Together, these results highlight the efficiency of CpG deamination events in the creation of TFBS.
Originally, it has been postulated that a key role of CpG methylation and deamination is in the inactivation of transposons and thus in protecting mammalian genomes from their insertional activity (Yoder et al. 1997
; Zemach and Zilberman 2010
). One-third of all CpGs in the human genome are located within Alu transposons. Over time, CpG deamination permanently inactivates initially CpG-rich Alus. As a side effect of this process, decaying Alu sequences give birth to new regulatory elements. In particular, the CpG-rich ~20 nt long template sequence residing in Alu elements (), can be converted via means of deamination into p53 (Zemojtel et al. 2009
), PAX-6 (Zhou et al. 2002
), and c-Myc TFBSs as documented here ().
This phenomenon is not limited to these three TFs and Alu transposons. By analyzing mutational events leading to creation of TFBSs in human promoters (217 TFBSs) and in genome-wide regions bound in vivo (14 TFBSs), we document that CpG deamination events create TFBSs with much higher efficiency than other single nucleotide mutational events. In a recent study reporting genome-wide locations of Ctcf sites in human genome, it has been noticed that orthologous Ctcf-binding sequences in vertebrate genomes accumulated at a very high rate C → T mutations at the position where the CpG dineuclotide is located in the Ctcf-binding motif (position 15 as referenced in the supplementary fig. 2
, Supplementary Material
online) (Kim et al. 2007
). This observation is compatible with CpG deamination as a driving process. In light of this, it is tempting to speculate that CpG deamination might in fact constitute a double-edged sword involved not only in creating but also in inactivating binding sites. In this context, we propose that CpG deamination, which is known to induce variable regional methylation in human populations, constitutes an evolutionary benefit facilitating innovation in gene regulation.