Identification of 37 putative Arabidopsis genes encoding SET domain proteins
The SET domains from the proteins encoded by the Drosophila genes E(z), trx, Su(var)3-9 and ash1 were used for BLASTP and TBLASTX searches against the Arabidopsis non-redundant sequence databases. Proteins encoded by putative genes recognized by the hits in the BLAST searches were identified from the annotations in the databases. In total 37 putative Arabidopsis SET domain protein-coding genes (AtSET) (Table ) were found based on an E value inclusion threshold of <0.001 compared to one or more of the Drosophila SET domains.
Putative genes encoding SET domain proteins in Arabidopsis
The AtSET genes can be divided into four classes based on their SET domains
SET domains of putative Arabidopsis proteins were aligned with selected proteins from Saccharomyces cervisiae, Schizosaccharomyces pombe, Drosophila melanogaster, Caenorhabditis elegans and Homo sapiens using the ClustalX program and manual adjustment with the GeneDoc program. Protein predictions were corrected on the basis of: (i) comparison of gene predictions generated by different programs (GENSCAN, GeneMark and Gene finder); (ii) comparison of duplicated genes (see below); (iii) analysis of protein domains encoded by predicted neighboring genes. In some cases, alignments indicated that exons had been overlooked in the annotations (see Table and text below). These putative exons were added to the predicted proteins when confirmed by the alignments. Predicted exon–intron borders were checked against sequences of cDNAs, RT–PCR products or ESTs when available (see below and Table ).
The majority of the putative Arabidopsis SET domain proteins could easily be fitted into the alignment. However, seven putative proteins contained only parts of the 130–160 amino acid long domain (Fig. ). In addition, two domains (Fig. , ATXR5 and ATXR6) diverged substantially from all the others. A tree based on the alignment of 28 Arabidopsis SET domains and 12 such domains of proteins from other species was constructed by the neighbor joining method, using ClustalX (Fig. ). Bootstrap values >60% are shown.
Figure 1 Structure of Arabidopsis SET domain proteins. Protein sequences obtained from annotations in the EMBL and MIPS databases, adjusted by ESTs, sequences of RT–PCR products and cDNAs, were analyzed for conserved domains (see Materials and Methods). (more ...)
Figure 2 Relationship between SET domain proteins of Arabidopsis and other organisms. The tree was constructed using the ClustalX program based on alignments of SET domains by ClustalX and manual adjustment. Figures indicate bootstrap values (1000 = 100%). (more ...)
proteins, MEDEA, CLF and EZA1, group together with Drosophila
E(Z), its human counterpart EZH2 and C.elegans
MES-2. The tree gives very good support (99.9%) for recognition of the E(Z)-like proteins of all species included, as a distinct group. The E(Z) group of Arabidopsis
encompass two genes which are already known. The MEDEA
gene is involved in inhibition of endosperm development in the absence of fertilization (12
). Mutations in the CLF
gene result in altered leaf morphology and also homeotic alterations in flower development (11
). The last member in this group is EZA1, for which the mRNA has been cloned (AAD09108).
The tree also gives solid support (93.9%) for the grouping of five Arabidopsis
proteins in a separate class together with TRX of Drosophila
and its human (HRX) and yeast (SET1) homologs (Fig. ). Two of the genes encoding proteins of this class were recently described as ARABIDOPSIS TRITHORAX 1
). The additional genes we have identified were therefore named ATX3
(Table and Fig. ). We could also align the SET domain of one A
ELATED protein (ATXR7) which showed less overall similarity with the ATX proteins (see below). The SET domain of this protein is most similar to that of SET1.
Putative Arabidopsis proteins with SET domains more similar to the Drosophila ASH1 and SU(VAR)3-9 proteins were also identified (Fig. ). Four proteins that group most closely together with ASH1 and its yeast homolog SET2 have the SET domain placed in the central region, as do other ASH1 class proteins. The genes encoding them were consequently named ASH1 HOMOLOG 1 to ASH1 HOMOLOG 4 (ASHH1–ASHH4; Table and Fig. ). A fifth protein has the SET domain at the C-terminus and was therefore named ASH1-RELATED (ASHR3; Fig. ) (For ASHR1 and ASHR2 see below).
The remaining Arabidopsis SET domains are most closely related to SU(VAR)3-9 and its human (SUV39H) and S.pombe (CLR4) homologs. We have called 10 of the encoding genes SU(VAR)3-9 HOMOLOGS (SUVH1–SUVH10; Table and Fig. ). The SUVH proteins are clustered in the tree and have a common additional domain (see below). There are high bootstrap values for branches within this group, SUVH1, SUVH3, SUVH7 and SUVH8 (99.6%); SUVH5 and SUVH6 (89.3%); and SUVH2 and SUVH9 (100%). SUVH10 seems to be a copy of the SUVH class that has suffered an internal deletion that removed a part of the region encoding the SET domain.
The proteins encoded by the remaining SU(VAR)3-9-RELATED (SUVR1–SUVR5) genes are more diverse, with a separate branch for SUVR1, SUVR2 and SUVR4. The SET domains of these three proteins are most similar to that of the human G9a protein (Figs and ). SUVR4 seems most closely related to SUVR1 and appears to have been generated after a deletion resulting in the removal of nearly 290 amino acids, but leaving the C-terminus, including the SET domain and surrounding cysteine-rich regions, intact.
Figure 3 Alignment of SET domains and flanking cysteine-rich regions of the four classes of SET domain proteins. The SET domains of all proteins are perfectly aligned from the GWG motif (positions 149–151), while the cysteine-rich domains N-terminal to (more ...)
The relationship between the putative proteins with truncated or deviating SET domains and the Arabidopsis SET domain classes were analyzed separately. Six were most similar to the domain of the ATX group and the putative encoding genes were therefore called ATXR1–ATXR6 (Table and Fig. ). Two proteins have a truncated domain most equal to the ASHH group and the putative genes were named ASHR1 and ASHR2 (Table and Fig. ).
Distinct cysteine-rich domains are present in the four classes of AtSET proteins
After assignment of the identified proteins to different classes based on their SET domains, characteristics of the other parts of the proteins were investigated by comparison to the homologs of other species and searches in the conserved domain databases Smart, Pfam-A and Profile prosite.
In addition to distinct differences in the amino acid sequence of the SET domain, the four classes of proteins have other significant characteristics (10
; Fig. ). Proteins of the E(Z) class have a region with 16–18 cysteine residues spaced in a given pattern in front of the C-terminal SET domain. Proteins of the SU(VAR)3-9 class have a SET domain-associated cysteine-rich region (SAC) with seven to eight cysteines in certain positions in front of the SET domain (N-SAC) and three C-terminal cysteines in the pattern CXC(X)4
C (C-SAC) after the SET domain. The C-SAC is also found in TRX class proteins, which lack a cysteine-rich region N-terminal to the SET domain. ASH1 class proteins have, in contrast to the other three classes, the SET domain centrally placed. Their SET domains are preceded by a cysteine-rich region and followed by the C-SAC pattern. The number and spacing of the cysteine residues in the N-terminal cysteine region differ from that of the E(Z) C-rich region and also from the N-SAC.
The similarities between the E(Z), MEA and CLF proteins have been recognized previously (11
). In addition to the SET domain, these proteins have a C-rich stretch and the so-called domain II in common with the E(Z) proteins of other organisms (Figs and ). The C-rich region is also present in the protein encoded by EZA1
(Figs and ).
All the ATX proteins have a complete C-SAC motif, while the ATXR proteins lack at least one of the C-terminal cysteines (Figs and ). All four ASHH proteins have the C-SAC motif and the cysteine-rich domain conforming to ASH1 class proteins (Figs and ). The ASHR proteins lack the C-rich regions with the exception of the C-SAC motif of ASHR3 (Figs and ).
A complete SAC domain is present in 10 SUVH and SUVR proteins (Figs and ). Truncated N-SAC regions are present in SUVR3, SUVR5 and SUVH10, while SUVH2 and SUVH9 only have one C-terminal cysteine residue.
A novel domain called the YDG domain is present in SUVH proteins
Alignments of the N-terminal part of the SUVH proteins and the use of domain-finder programs revealed that these 10 proteins contain a conserved domain also found in non-SET proteins containing a RING finger motif (in mammals and Arabidopsis) or an HNH nuclease motif (in the bacteria Deinococcus radiodurans) (Fig. A). We have chosen to call the 150–170 amino acid long region the YDG domain because of a characteristic YDG motif. Further characteristics of the domain are the conservation of up to 13 evenly spaced glycine residues and a VRV(I/V)RG motif (Fig. A).
Figure 4 (Opposite) Alignment of domains found in SET domain proteins. (A) YDG domain. Note that the first six amino acids (GLVPGV) of SUVH10 are from another reading frame, followed by 11 amino acids (DVGDIFFFRGE) from the same frame as the annotated ORF (T6P5.10). (more ...)
PWWP domains, PHD fingers and extended PHD fingers are found in ATX proteins
In all the ATX proteins the PWWP motif, first identified in the human protein WHSC1 (38
), was found (Fig. C). WHSC1 is most closely related to the ASH1 class of SET domain proteins, but we did not identify this domain in any of the Arabidopsis
ASHH or ASHR proteins. The PWWP domain is present in a diverse groups of nuclear proteins (38
) and typically has conserved PWWP residues. The first proline residue is present in three of the five ATX proteins, but none of them contain the first of the two tryptophans. The most conserved motifs are GDΦΦWXK (where Φ are hydrophobic residues), WPAΦΦΦD and VXFFG (Fig. C).
In four of the five ATX proteins, as well as in ATXR5 and ATXR6, we identified amino acid motifs similar to the PHD finger (Figs and B) found in the Drosophila
and mammalian TRX/HXR proteins and a number of other nuclear proteins (26
). The characteristic C4
pattern is present once or twice in the Arabidopsis
proteins. In the ATX proteins the PHD fingers are situated about midway between the PWWP motif and the SET domain.
Finally, the ePHD motif (25
) was found in all the ATX proteins, positioned just after the PHD finger (Fig. D). The second half of this motif resembles a PHD finger (compare conserved cysteine and histidine residues).
The DAST motif recently identified in ATX1, ATX2, TRX and HRX (27
) was not found in the other ATX proteins.
Putative nuclear localization signals and AT-hooks are found in many AtSET proteins
The domain-finder programs recognized putative bipartite nuclear localization signals (NLSs; see Fig. ) in MEA and CLF, but not in EZA1. Among the ATX and ATXR proteins one or two such NLSs were identified in all proteins but four (ATX1, ATX4, ATXR2 and ATXR4). This signal was also found in three proteins of the ASH groups (ASHH2, ASHH4 and ASHR3) and three proteins in the SUV groups (SUVH6, SUVH4 and SUVR1). We cannot exclude the possibility that other types of NLSs are present in the other proteins.
Putative AT-hooks, which mediate protein binding to the minor groove of AT-rich tracts in DNA (39
), were identified in three of the SUVH proteins, in ASHH1 and in ATXR7 (Figs and E). This motif has a characteristic GRP core.
Pairs of AtSET genes are found in large genomic duplications
The chromosomal positions of the 37 putative genes encoding SET domains are spread over all five Arabidopsis chromosomes (Table ). The MIPS Interactive Redundancy Viewer was used to investigate whether any of these genes were positioned in duplicated regions of the genome. Five likely gene pairs were found: MEA and EZA1 seem to be part of a large duplication between chromosomes I and IV; ATX1 and ATX2 belong to a duplication between chromosomes I and II; ATX4 and ATX5 are found in a duplication on chromosomes IV and V; ASHH3 and ASHH4 are found in a duplication on chromosomes II and III; and SUVH3 and SUVH7 are found in a duplication on different regions on chromosome I (Table ). In addition, SUVR1 is in an area on chromosome I that shares duplicated regions with the area on chromosome V where SUVR2 is positioned.
In all cases, members of gene pairs belonged to the same class of SET domain genes. The encoded proteins were compared pair-wise and could be aligned along their total lengths (data not shown). The positions of annotated exons and introns in gene pairs were also very similar. Twelve introns of 16 were in identical positions in MEA/EZA1, 20 of 23 in ATX1/ATX2, all 20 introns in ATX4/ATX5 and eight of the 11 and 10 introns in ASHH3 and ASHH4, respectively.
The majority of SUVH group ORFs are intronless
The number of annotated introns in all the Arabidopsis proteins and their positions in the SET domain were compared. In the majority, numerous introns are present and up to five were found within the SET domain (Table ). In the Arabidopsis E(Z) class genes the positions of these introns are conserved. However, intron positions differ from those in the E(Z) proteins of other species (data not shown). For the other classes, identical intron positions are only found between closely related pairs of genes (cf. above).
In contrast to the majority of genes, ATXR1 and SUVR3 contain one intron only, which for SUVR3 is found in the SET domain-encoded region. Among the 10 SUVH genes all but SUVH4 have intronless ORFs (Table ). In contrast, the SUVH4 gene contains 13 introns.
The AtSET genes are active
For each of the putative AtSET
genes different databases were examined for the presence of matching ESTs and cDNA sequences (Table ). As mentioned above, the cDNAs from the genes in the E(z)
class and recently also two ATX
genes have been cloned by others (11
). RT–PCR and cDNA cloning were used to verify expression of additional AtSET
genes. cDNAs for SUVH1
(Fig. A), and SUVH7
(not shown) confirmed these genes as being intronless and showed that an annotated intron in the genomic region corresponding to SUVH7
(F2H15.1) is not spliced out. This results in an ORF encoding a protein containing the C-SAC motif but is shorter (693 amino acids) than the annotated gene (954 amino acids).
Figure 5 RT–PCR expression analyses. (A) Agarose gels stained with ethidium bromide showing cDNA fragments of SUVH1, SUVH2, SUVH3, SUVH4, SUVH5 and AtCyclophilin (positive control; 63) amplified by RT–PCR using gene-specific primers. RT–PCR (more ...)
For SUVH2, 5′-RACE and RT–PCR showed no intron in the leader sequence and the putative ORF, but an intron of 83 bp in the trailer of the transcript (Fig. A). SUVH3 has matching ESTs (AA728521, AI998299 and T04123) which together with RT–PCR could be extended to an almost complete cDNA containing the expected intronless ORF. However, in the leader sequence of the SUVH3 gene there are two introns of 464 and of 111 bp (Fig. A).
The SUVH4 transcript, identified from two λZAP cDNA clones and by RT–PCR, consists of 2.1 kb and sequencing confirmed the presence of 13 introns (Fig. A). Expression of SUVH1, SUVH2, SUVH3, SUVH5 and SUVH6 is supported by the presence of matching ESTs in the databases generated from rosette leaves, roots and/or developing seeds (Table ). Expression of five SUVH genes was detected by RT–PCR in seeds, roots, leaves, stems, flowers and/or siliques (Fig. A and Table ). Only SUVH1 seems to be expressed in roots. We did not succeed in RT–PCR amplification of SUVH8 and SUVH10 in any tissues tested. Analysis of the DNA sequence upstream of the annotated SUVH10 gene indicates that this is a gene that has been inactivated by mutations. The database protein sequence of SUVH10 (T6P5.10) starts just inside the YDG domain (see Fig. A). However, this domain would be completely contained in SUVH10 if 1 nt was inserted 33 nt before the start codon. This would lengthen the putative ORF of about 279 nt (including a full YDG domain).
Expression pattern of genes encoding SET domain proteins in Arabidopsis
Primers designed to investigate whether SUVR1, SUVR2, SUVR3 and SUV4 were expressed, successfully amplified RT–PCR products that were shorter than their genomic counterparts due to the presence of introns (Fig. B). Expression of SUVR3 is further confirmed by corresponding ESTs. An additional intron in the C-terminal part of the SUVR2 gene changes the amino acids between the C-SAC and the stop codon, as compared to the annotated protein sequence (MRH10.10). Sequence analysis of SUVR4 revealed the omission of an exon in the C-terminal region of the annotated protein (T27C4.2). This exon contains the C-SAC motif and renders the protein 477 amino acids long, not 424 as annotated. Two sets of SUVR5 primers were designed, but none of these produced any RT–PCR product. This putative gene is annotated either as one large gene (see Fig. ) or three separate genes of which one contains only the SET domain (see Table ).
In the ATX group, ESTs matching four genes have been cloned from developing seeds and aerial organs (Table ). EST sequences are also found which correspond to the four ATXR
genes with truncated SET domains (Table ). These are from inflorescences and aerial organs (Table ). Expression of the ATX1
genes was confirmed by amplification of their central parts by RT–PCR using mRNA from different organs (Fig. B and Table ). Our RT–PCR products and RACE verified that the annotations of ATX1 and ATX2 are not in agreement with the cDNA sequences, as noted recently (27
). In the EMBL database, the region encoding the ATX1
transcript is annotated as three separate genes (T9H9.15, T9H9.16 and T9H9.17). The ATX2
gene (T20M3.10) is annotated with two additional exons at the C-terminus, resulting in an amino acid extension after the C-SAC which would not agree with the notion that TRX class proteins have the C-SAC at their C-terminus. For ATX3, sequence analyses revealed the presence of a GWG motif at the beginning of the SET domain, seven additional amino acids in the ePHD domain and an exon encoding 25 amino acids in the SET domain, in contrast to the annotated protein (F15G16.130). Two exons missing in ATX4
were revealed by comparison to the matching EST AV524242. The annotated ATX4 protein (T13J8.20) terminates just after the NHSC motif in the SET domain. The additional two exons (T13J8.30) extend the C-terminal end of the protein so as to give a complete SET domain and C-SAC.
ESTs matching ASHH1, ASHH2 and ASHH3 have been found in developing seeds (Table ). RT–PCR using mRNA from different tissues (Fig. B and Table ) confirmed that these genes are active.
To show that the gene products of the SET domain coding transcripts are nuclear proteins, as already shown for all functionally described SET domain proteins to date (see for example 4
), transient expression assays were used (34
) (see Materials and Methods). Constructs containing an in-frame fusion of the GUS gene, or a red-shifted GFP gene, and a cDNA encoding CLF
in a transient expression vector, were shot into the inner epidermis of onions using a particle gun. Whereas the GUS protein alone is not localized to the nucleus, all of the fusion protein variants became concentrated in the nucleus (data not shown). The transiently expressed GFP fusions confirmed that nuclear transport was not an artifact of the test system: in contrast to GFP alone, all proteins became concentrated in the nucleus (Fig. A–E). Drosophila
SU(VAR)3-9 also showed nuclear localization in onion cells (Fig. F).
Figure 6 Nuclear localization of SET domain proteins in onion epidermis transient expression assay with the plant GFP reporter system. Histochemical localization of GFP activity following bombardment of onion epidermal cell layers with DNA constructs expressing (more ...)