Nucleosomes, the main components of chromatin, consist of histones, and histone proteins have positively charged amino-terminal tails that are exposed on the outside of nucleosomes. These tails are subject to several post-translational covalent modifications, including acetylation, phosphorylation, ubiquitination, sumoylation and methylation (reviewed in [1
]). Methylation been found on a range of lysine residues in various histones: K4 (using the single-letter amino-acid code for lysine), K9, K27, K36 and K79 in histone H3, K20 in histone H4, K59 in the globular domain of histone H4 [2
] and K26 of histone H1B [3
]. Several proteins responsible for the methylation of specific residues have been characterized, and all but one of these contains a SET domain; they make up the SET-domain protein methyltransferase family (Table ). The exception to the rule is the DOT1 family, members of which methylate K79 in the globular region of histone H3 and which are structurally not related to SET-domain proteins [4
]. Recent work suggests that SET-domain-containing proteins methylate a few proteins in addition to histones (see later); they should therefore be named protein lysine methyltransferases rather than histone lysine methyltransferases. The function of SET-domain proteins is to transfer a methyl group from S
-adenosyl-L-methionine (AdoMet) to the amino group of a lysine residue on the histone or other protein, leaving a methylated lysine residue and the cofactor byproduct S
-adenosyl-L-homocysteine (AdoHcy). Methylation of specific histone lysine residues serves as a post-translational epigenetic modification that controls the expression of genes by serving as 'markers' for the recruitment of particular complexes that direct the organization of chromatin.
Sites and functions of histone lysine methylation
The SET domain (Figure ) was first recognized as a conserved sequence in three Drosophila melanogaster
proteins: a modifier of position-effect variegation, Suppressor of variegation 3-9 (Su(var)3-9) [7
], the Polycomb-group chromatin regulator Enhancer of zeste (E(z)) [8
], and the trithorax-group chromatin regulator trithorax (Trx) [9
]. The domain, which is approximately 130 amino acids long, was characterized in 1998 [10
] and SET-domain proteins have now been found in all eukaryotic organisms studied. There are currently 157 entries for human SET-domain proteins in the SMART database [11
] and 93 entries in the Pfam database [12
], although both databases contain duplicate entries. Seven main families of SET-domain proteins are known - the SUV39, SET1, SET2, EZ, RIZ, SMYD, and SUV4-20 families - as well as a few orphan members such as SET7/9 and SET8 (also called PR-SET7; see Table for a list of the members of each family in humans and their properties). Proteins within each family have similar sequence motifs surrounding the SET domain, and they often also share a higher level of similarity in the SET domain.
Figure 1 A protein sequence alignment of the SET domains of several representative histone lysine methyltransferases (HKMT) grouped according to their histone-lysine specificity. All sequences are human with the exceptions of Saccharomyces cerevisiae SET1 and (more ...)
Properties of some human SET-domain proteins
The SUV39 family has been characterized in most detail. Members of this family - human SUV39H1, murine Suv39h2, and Schizosaccharomyces pombe
Cryptic loci regulator 4 (CLR4) - were the first SET-domain protein lysine methyltransferases to be characterized, following the discovery of sequence homology between their SET domains [13
]. These proteins, with other members of the family such as D. melanogaster
Su(var)3-9, specifically methylate lysine 9 of histone H3 (H3 K9) [13
]. Human SUV39H1 and its closely related paralog, SUV39H2, are 55% identical at the amino-acid level. The structures of the genes encoding the two proteins are shown in Figure : both have six exons, and they have identical intron-exon junctions. It appears that SUV39H1
resulted from a recent gene-duplication event, as only mammals have two copies. The SUV39-family proteins of the frog Xenopus laevis
(GenBank accession number AAH70805
) and the zebrafish Danio rerio
) are more closely related to human SUV39H1 than to SUV39H2; they share 75% and 63% amino-acid identity with SUV39H1, respectively. The zebrafish gene also shares all intron-exon junctions with human, and the frog gene shares at least four of the junctions. The D. melanogaster
Su(var)3-9 protein is only 30% identical to human SUV39H1, and their genes share no intron-exon junctions. The S. pombe
member of the SUV39 family, CLR4, is only 27% identical to the human protein and its gene contains no introns.
Figure 2 Schematic representations of the gene and primary protein structures of two pairs of related SET-domain histone methyltransferases in the SUV39 family. (a) Human SUV39H1 (gene, mRNA and protein); (b) human SUV39H2 (gene and mRNA for comparison with SUV39H1); (more ...)
The members of the SUV39 family discussed above are involved in both euchromatin and heterochromatin, but another member of the same family, G9a, is the predominant histone H3 K9 methyltransferase in mammalian euchromatin [14
]. There are two isoforms of G9a in the mouse: the short form (GenBank accession number NP_671493
) corresponds to human G9a and the long form (NP_665829
), which lacks intron one, has additional Arg-Gly repeats at the amino terminus. No human expressed sequence tag (EST) corresponding to the long form of G9a has yet been isolated, although the sequence is present in the genome. Similar to the situation with SUV39H1, G9a also has a closely related paralog in mammals, Gga-like-protein-1 (GLP1). The human G9a
gene has 28 exons and is about 17.3 kilobases (kb) long (Figure ). GLP1 is 45% identical to G9a and most of the divergence is in the amino-terminal third of the protein. The GLP1
gene has 25 exons - it lacks homologs of the first three introns of G9a
- and the 20 exons from the 3' end have identical junctions to those found in G9a.
gene is quite large, 120 kb in human and 92 kb in mouse, with introns as long as 16 kb (Figure ). No obvious orthologs of G9a or GLP can be found in the worm, frog or yeast genomes; in the D. melanogaster
genome there is one gene (CAB65850) encoding a protein that is distantly related to human G9a (20% identity) or GLP (18% identity) in the carboxy-terminal half of the protein. The chicken genome also encodes one protein (CAH65313) that shares 75% identity with human GLP. Interestingly, both a frog (Xenopus tropicalis
) and three species of fish (D. rerio
, Tetraodon nigroviridis
, and Takifugu rubripes
) have both G9a and GLP in their genomes, although most have not yet been annotated as such. The zebrafish GLP ortholog (CAE49087) is 45% identical to human GLP, and the gene shares all but three of its 23 intron-exon junctions with human GLP1.
G9a and SUV39H1 both belong to the same family of SET-domain proteins and both have pre-SET and post-SET domains surrounding the SET domain, but they do not share any intron-exon junctions, even though a number of these junctions occur within the highly conserved SET domain. The other two SUV39 family proteins, ESET (also called SETDB1) and CLLL8 (SETDB2) also have significant similarities in their genomic structures with each other but not with G9a or SUV39H1 (data not shown). Several proteins in other SET families are also found in closely related pairs: EZH1 and EZH2 (members of the EZ family), MLL1 (also called HRX) and MLL2 (HRX2, both members of the SET1 family), SET1 and SET1L (SET1 family), NSD2 (WHSC1) and NSD3 (WHSC1L1; SET2 family), and SUV4-20H1 and SUV4-20H2 (SUV4-20 family).