The molecular functions of methyl-CpG binding proteins rely on their ability to recognize and bind methylated DNA. As this property is central to understanding their roles in vivo, we will review the current progress in this area in more detail.
Deletion analyses have identified the minimal region of MeCP2 responsible for the interaction with methylated CpGs [
32]. Further comparison with other MBD proteins defined the MBD domain as a protein motif of about 75 amino acids [
34,
47]. Since the “classical” MBD was described, proteins containing MBD-like domain, including ESET/SETDB1 and TIP5, have also been identified in different species. However, as in the case of TIP5 protein, the MBD-like domains are predicted not to form specific interactions with methylated DNA and therefore may serve other functions. A unifying name of TAM domain (TIP5, ARBP, MBD) is now used to unify both canonical MBD and MBD-like domains [
47]. MBD-like domains proteins are not the subject of this review.
Sequence comparison of all human MBD family proteins show the presence of 16 strictly conserved amino acids within the MBD domain. MBD3, which does not bind to methylated DNA, lacks four of these conserved residues [
34]. Pairwise comparison reveals the presence of two subclasses, with the MBD domains of MBD4 and MeCP2 being more closely related to each others, while those of MBD1, MBD2 and even MBD3 forming a separate subgroup. Solution structures of MBD domains of human MeCP2 and MBD1 have been determined by NMR revealing a similar α/β sandwich fold composed of four β-strands and an α-helix [
48,
49] (). Detailed information on how the MBD fold binds symmetrically methylated CpG was derived from the NMR structure of the MBD domain of MBD1 in complex with methylated DNA [
50] ().
MBD proteins interact with methylated DNA in the major groove, where the two methyl groups from the mCpG point towards the exterior of the double helix (). Several residues from the L1 loop, connecting the β2 and β3 strands, and the α helix respectively make several contacts with the sugar/phosphate backbone on each strand of the DNA molecule (). Four conserved residues (R22, Y34, R44, S45) in MBD1 are involved in recognizing the methyl-CpGs via a complex set of interactions. It appears that each side chain interacts with DNA in a somehow bivalent way, where the polar moiety of each of these residues contacts C or G base, while their hydrophobic regions stack around the methyl groups. Such of bivalent contacts from each important amino acid side chain may explain why both the CpG dinucleotide and the two methyl groups are strictly required for efficient recognition by the MBD. Subtle variations in this network might abolish binding. MBD3 for example has only three of the four conserved residues with Tyrosine (Y) to Phenylalanine (F) substitution at the equivalent position of Y34. The loss of a single hydroxyl group renders MBD3 incapable of binding to methylated DNA [
34,
51]. This particular arrangement of critical amino acids is likely to explain the high selectivity observed
in vitro towards methylated DNA versus either hemimethylated or unmethylated CpGs. The structural data also confirm that one MBD domain can only accommodate one symmetrically methylated CpG as the MBD domain binds DNA as a monomer [
32,
48]. However, this does not exclude the presence of potential homo or heterodimerisation interfaces on MBD proteins, even if MeCP2 appear to be mostly monomeric in solution [
52]. The only case that complicates this picture is MBD4, whose MBD domain seems capable of interacting preferentially with mCpG:TpG mismatches arising from deamination of methyl-cytosine and even with hemimethylated DNA [
53,
54]. The structural information available so far does not explain why MBD4, whose MBD domain is related to MeCP2 more than any other, would display an altered DNA binding specificity. However, it seems that cytosine methylation and even the MBD itself can be dispensable for MBD4 G:T mismatch-specific thymine glycosylase activity [
55].
About 70 to 80% of CpG are methylated in mammalian genomes, creating a relatively high number of potential binding sites for MBD proteins [
2]. Then what determines their pattern of occupancy at these sites? One possible model would be that each MBD protein randomly occupies any available methylated CpG (). In this scenario, the relative abundance of each MBD protein within a cell, together with the methylation density will dictate the occupancy of individual methylated sites. This random behaviour would imply high redundancy and is the principal argument to explain the relatively mild phenotypes of MBD1, MBD2 and Kaiso null mice [
56-
58]. In another model, one can envisage that other factors may influence the distribution of MBD proteins within a cell nucleus, making it non-uniform and non-random, with each MBD protein occupying unique sites in the genome (). This model would predict that a subset of genes would be affected by the loss of one MBD protein but not other. Examples of genes missexpressed in the absence of specific MBD proteins are becoming more abundant and the phenotypes of MBD deficient mice, although subtle, are markedly different [
56,
58-
61].
In support of the second model, a recent study demonstrates that in primary human fibroblasts, MBD1, MBD2 and MeCP2 do not share binding sites
in vivo, at least at the number of genomic sequences examined [
62]. Morpholino-mediated depletion of MeCP2 and MBD2 suggested the existence of a mechanism dictating preference of MeCP2 but not MBD2 for a subset of methylated sites
in vivo [
62]. Whether this selective binding is retained in cancer cells, which tend to accumulate aberrant DNA methylation patterns, is unclear [
63]. Specific targeting of MBD proteins, observed in primary cells, may be achieved via interactions with binding partners (see below) including other DNA binding activities which may facilitate targeting of MBDs to chromatin or DNA at specific loci. Experimental evidence for recruitment via partner proteins is currently missing. Another possibility, which has been validated to some extent, is that the various members of the MBD family display different DNA binding specificity, meaning that they recognize and bind to more complex sequences than a single methylated CpG.
Recent
in vitro experiment showed that, unlike MBD2, MeCP2 requires a run of four or more A/T base pairs adjacent to methylated CpG for high affinity binding [
62]. Furthermore [A/T]
≥4 runs are present at MeCP2 target sequences identified
in vivo [
62]. These findings constitute the first example where the enhanced binding specificity towards a particular set of methylated sequences allows discriminative binding site occupancy of an MBD protein. Whether this is the case for other MBD proteins remains to be determined. However, as MeCP2 and MBD1 contain additional DNA binding domains, it is possible that a single methylated CpG is not sufficient to support high affinity binding of an MBD protein to DNA. Early studies on MeCP2 detected a potential second DNA binding activity independent of the MBD domain and sequence analyses identified the presence of two AT-hooks [
31,
62,
64] (). The AT-hook motif is capable of interacting with the minor groove of AT rich DNA and has been characterized in high mobility group proteins such as HMGA1 [
65]. However, the AT hooks are frequently present in conjunction with other functional DNA or chromatin binding domains [
66], and in the case of MeCP2 their functionality remains to be determined. Surprisingly, the AT hooks are not required for selective binding of MeCP2 to CpG followed by [A/T]
≥4 run [
62]. However, these motifs may interact with other stretches of A/T-rich DNA in
cis or
trans. Additionally, a role of the C-terminus of MeCP2 in helping binding to DNA, matrix attachment regions and nucleosome has also been reported [
67-
70]. Whether multiple DNA and chromatin binding interfaces play a role in MeCP2 function requires further studies.
On the other hand, MBD1 protein carries a second functional DNA binding motif separate from the MBD. Depending on the isoform, MBD1 can have two or three zinc finger motifs defined by 8 conserved cysteines, the CxxC zinc finger [
71,
72]. However, each copy is not strictly equivalent to one another, as they display primary sequence differences that would alter their biochemical properties. The most C terminal zinc finger (usually referred as CxxC3), which we will consider as a canonical CxxC motif, is also present in various other proteins, including DNMT1, CpG binding protein CGBP, H3K4 histone methylase MLL and H3K36 histone demethylases of the Jumonji family JHDM1A and JHDM1B [
71,
73,
74]. This canonical version of the CxxC zinc finger has been shown to bind non methylated CpGs
in vitro in the case of MBD1, MLL CGBP and JHDM1B [
71,
74-
76]. The two other CxxC motifs of MBD1 lack a conserved glutamine residue and a KFGG motif, characteristic of all DNA binding CxxC zinc fingers, and as a consequence are unable to bind DNA [
71]. The role of these divergent CxxC zinc fingers is unclear but they might be involved in protein-protein interactions [
61].
In reporter gene assays, MBD1 represses transcription from CpG rich unmethylated promoters in a CxxC3 domain-dependant manner [
71,
72]. This suggests that this domain could be as efficient as the MBD for targeting MBD1 to DNA
in vivo and therefore, MBD1 may play a role in silencing certain unmethylated CpG island promoters. However, the CxxC3 domain by itself does not provide enough sequence specificity to discriminate between different CpG islands, which are defined by their high CpG content. A mechanism(s) that would account for the specific targeting of MBD1, and other CxxC containing protein, to specific DNA loci remains to be uncovered. An attractive hypothesis would be that MBD1 requires each of its two DNA binding domains for efficient binding at specific loci
in vivo. As each domain interacts with a very short sequence (2 nucleotides) compared to classical DNA binding transcription factors, one can speculate that the use of two separate DNA binding domains might enhance the specificity for particular sequences
in vivo. Another possibility is that, similar to MeCP2 which requires a methylated CpG followed by an [A/T]
≥4 run, each DNA binding domain of MBD1 recognizes a more complex sequence than currently known. Whether MBD1 binds DNA through an independent use of its two DNA binding domains, or they collaborate with each other to target efficiently MBD1 to specific loci will be an intriguing question to answer.
It might appear surprising that a methyl-CpG binding protein carries a domain that allows it to bind unmethylated CpGs. However, this ability to interact with methylated and unmethylated DNA is not a unique feature of MBD1. The identification of Kaiso showed that the MBD domain is not the only protein fold able to recognize DNA methylation, as Kaiso and the Kaiso-like proteins ZBTB4 and ZBTB38 use a set of C2H2 zinc fingers to bind methylated DNA [
39,
40]. Early studies suggested that Kaiso requires at least two mCpGs for efficient binding, while ZBTB4 and ZBTB38 seem to interact with a single mCpG [
39,
40].
In vitro studies also show that Kaiso interacts specifically with unmethylated consensus sequence, the Kaiso Binding Site (KBS:TCCTGCNA), which is present at promoters of Wnt target genes [
77,
78]. Interestingly, only zinc fingers 2 and 3 of Kaiso are necessary and sufficient for binding to either type of sequences
in vitro [
41]. The ability to bind unmethylated DNA is shared by ZBTB4, but surprisingly not by ZBTB38. High resolution structural information may help to explain how the zinc-fingers of Kaiso and Kaiso-like proteins interact with methylated and unmethylated DNA. Such structural studies may facilitate the design of specific point mutations which would allow uncoupling of mCpG and KBS binding activities and clear cut discrimination between the functions of Kaiso and ZBTB4 that rely on their interaction with either methylated or unmethylated DNA.
In summary, several lines of evidence suggest that methyl-CpG binding proteins recognize more complex sequences than a single methylated CpG, thus favouring a gene or locus specific role for each member of the MBD and Kaiso-like families. As MBD proteins are widely expressed in different tissues and constitute relatively abundant chromosomal proteins, it has been suggested that they may also exert functions unrelated to recognition of methylated DNA. Although binding of MBD proteins to other nucleic acids such as RNA and cruciform DNA structures
in vitro has been reported [
79,
80], evidence
in vivo for the most part firmly supports the function of MBD proteins in reading DNA methylation patterns at specific loci. One likely explanation for the existence of two (and maybe more) families of divergent methyl-CpG binding proteins, with members that display different sequence specificity towards methylated DNA, could be the evolutionary adaptation of pre-existing nucleic acid binding motifs for binding to methylated DNA.