|Home | About | Journals | Submit | Contact Us | Français|
Modified bases in nucleic acids present a layer of information that directs biological function over and beyond the coding capacity of the conventional bases. While a large number of modified bases have been identified, many of the enzymes generating them still remain to be discovered. Recently, members of the 2-oxoglutarate- and iron(II)-dependent dioxygenase superfamily, which modify diverse substrates from small molecules to biopolymers, were predicted and subsequently confirmed to catalyze oxidative modification of bases in nucleic acids. Of these, two distinct families, namely the AlkB and the kinetoplastid base J binding proteins (JBP) catalyze in situ hydroxylation of bases in nucleic acids. Using sensitive computational analysis of sequences, structures and contextual information from genomic structure and protein domain architectures, we report five distinct families of 2-oxoglutarate- and iron(II)-dependent dioxygenase that we predict to be involved in nucleic acid modifications. Among the DNA-modifying families, we show that the dioxygenase domains of the kinetoplastid base J-binding proteins belong to a larger family that includes the Tet proteins, prototyped by the human oncogene Tet1, and proteins from basidiomycete fungi, chlorophyte algae, heterolobosean amoeboflagellates and bacteriophages. We present evidence that some of these proteins are likely to be involved in oxidative modification of the 5-methyl group of cytosine leading to the formation of 5-hydroxymethyl-cytosine. The Tet/JBP homologs from basidiomycete fungi such as Laccaria and Coprinopsis show large lineage-specific expansions and a tight linkage with genes encoding a novel and distinct family of predicted transposases, and a member of the Maelstrom-like HMG family. We propose that these fungal members are part of a mobile transposon. To the best of our knowledge, this is the first report of a eukaryotic transposable element that encodes its own DNA-modification enzyme with a potential regulatory role. Through a wider analysis of other poorly characterized DNA-modifying enzymes we also show that the phage Mu Mom-like proteins, which catalyze the N6-carbamoylmethylation of adenines, are also linked to diverse families of bacterial transposases, suggesting that DNA modification by transposable elements might have a more general presence than previously appreciated. Among the other families of 2-oxoglutarate- and iron(II)-dependent dioxygenases identified in this study, one which is found in algae, is predicted to mainly comprise of RNA-modifying enzymes and shows a striking diversity in protein domain architectures suggesting the presence of RNA modifications with possibly unique adaptive roles. The results presented here are likely to provide the means for future investigation of unexpected epigenetic modifications, such as hydroxymethyl cytosine, that could profoundly impact our understanding of gene regulation and processes such as DNA demethylation.
Catalytic modification of bases in nucleic acids is universally observed across the three primary superkingdoms of life and is the basis for a wide range of biological functions.1 Certain modifications of bases in rRNAs and tRNAs, such as methylation, thiouridylation and pseudouridylation, are traceable to the last universal common ancestor of all life and are essential for survival.2,3 Other RNA base modifications are more limited in their distribution. For example, wybutosine is found only in archaeal and eukaryotic tRNAs, whereas certain forms of methylation and thiouridylation show even more narrow phyletic distributions.2,3 In contrast, DNA base modifications are apparently less diverse and more sporadic in their distribution; enzymes catalyzing them are not essential in most lineages of life.4 This difference is potentially attributable to the constraint of needing to maintain double-helical pairing in DNA and protecting the genetic material from the potentially mutagenic effects of base modifications.5
The most common DNA modification in prokaryotes, methylation of cytosine or adenine, is primarily catalyzed by methylases encoded by mobile restriction-modification (RM) systems.4,6 These methylases have a predominantly defensive role in immunizing the host DNA against the activity of the restriction endonucleases, which cleave invading DNA, such as those of bacteriophages.7,8 In certain prokaryotes, DNA methylation additionally supplies an epigenetic mark for DNA repair.9 Eukaryotes too possess several distinct DNA cytosine methylases related to the bacterial RM methylases. These have been shown to have a role in chromatin organization, regulatory gene silencing, repression of selfish DNA elements, and possibly other epigenetic processes in several animals, fungi and plants.10–13 DNA modifications other than methylation are primarily known from caudate bacteriophages and include a spectacular array of modified bases such as 5-hydroxymethylpyrimidines and their mono- or di-glycosylated derivatives, α-putrescinylated or α-glutamylated thymines, sugar-substituted 5-hydroxypentyluracil, and N6-carbamoylmethyl adenines (called Momylation after the mom enzyme of phage Mu).8,14 These atypical modifications are used by phages in countering the host DNA restriction response. Other DNA base modifications have become apparent in eukaryotes. The simplest of these is deamination of cytosine that appears to be mainly involved in diversification of immunity molecules in vertebrates.15–17 Another well-studied eukaryotic modification is the formation of β-D-glucosyl-hydroxymethyluracil or base J from thymine in euglenozoans, including the parasites Trypanosoma and Leishmania.18
While enzymes catalyzing several of the major RNA modifications have been biochemically well-characterized and crystallized, fewer DNA-modifying enzymes have been studied in detail. Of the latter, the well-studied ones are DNA methylases, namely the 5-methylcytosine-generating methylases of both bacteria and eukaryotes and the bacterial N6-methyladenine-generating enzymes.19–22 Additionally, the classic T-even phage modification system comprised of the 5-hydroxymethylcytosine (hmC) synthase and glucosyltransferases that further modify this base have been characterized. These studies have revealed that phage 5-hydroxymethylcytosine and 5-hydroxymethyluracil synthases are derived from the classical thymidylate synthases, which are often encoded by several DNA viruses, including these T-even phages.23 Thus, phage 5-hydroxymethylpyrimidines are not derived by direct DNA modifications but by an incorporation of pre-modified base during viral DNA synthesis. On the other hand phage DNA bases glycosyltransferases (that modify the 5-hydroxymethylpyrimidines) are members of the glycogen synthase/glycogen phosphorylase fold (e.g., alpha-glucosyltransferase and beta-glucosyltransferase) or the Fringe-like glucosyltransferase (e.g., beta-glucosyl-HMC-alpha-glucosyltransferase), which directly modify the hmC in DNA.24,25 Likewise, the phage mu enzyme, Mom, directly modifies adenines in DNA by adding a carbamoylmethyl or a related adduct, and was recently shown to belong to the GCN5-like acetyltransferase fold.26 Other recent studies have shown that the first step in the synthesis of base J in trypanosomes, i.e., oxidation of the methyl group on thymine to generate 5-hydroxymethyluracil, occurs in situ in DNA. This reaction is catalyzed by JBP1 and JBP2, enzymes of the 2-oxoglutarate- and iron(II)-dependent dioxygenase (2OGFeDO) superfamily, and represent the first example of in situ oxidative modification of methylpyrimidines, in contrast to the T-even phage hmC generation pathways in which premodified bases are incorporated into DNA.27,28
Other enzymes of the 2OGFeDO superfamily catalyze a variety of oxidative reactions such as:
(1) Oxidation of carbons in an aromatic ring to generate phenolic groups; e.g., hydroxylases in flavonoid synthesis.29 (2) Oxidation of aliphatic and alicyclic carbons e.g., amino acid modifications in proteins, namely hydroxylysine and hydroxyproline catalyzed respectively by lysyl and prolyl hydroxylases.30 (3) Ring opening/closing via C-N and C-S bond formation; e.g., isopenicillin synthase.31 (4) Oxidation of C-C bond in side-chains linked to an aromatic ring; e.g., thymine-7-hydroxylase, which oxidizes thymine to carboxyuracil in the thymidine/uridine salvage in fungi and bacteria.32 Trypanosome JBP1 and JBP2 also catalyze this class of reaction, albeit on DNA rather than the free base.27,28 (5) Demethylation of N-CH3 side chains linked to heterocyclic aromatic rings. This is typified by the AlkB family that functions in DNA repair by reversing methyl adducts on bases (e.g., N6-methyladenine) produced by DNA alkylating agents via complete oxidation of the methyl group to formaldehyde.33,34 Although both AlkB and JBP1/2 operate on methyl groups on bases in DNA their catalytic domains are only distantly related. This suggested that there might be as yet undetected enzymes that catalyze the oxidative modification of DNA in this superfamily. Most enzymes of this superfamily that act on low-molecular weight substrates are standalone proteins with compact dioxygenase domains. However, those that act on biopolymers like nucleic acids and proteins are frequently fused to other nucleic-acid- or protein-interacting domains (e.g., Swi2/Snf2 ATPase module in JBP2,35 and the MYND finger in Egl-9 like prolyl hydroxylases34). Alternatively, they contain peculiar conserved inserts within the catalytic domain that help in binding their biopolymer targets (e.g., AlkB).36 We accordingly hoped to utilize these features as contextual information in a computational protocol to identify potentially novel members of the 2OGFeDO superfamily that catalyze in situ oxidative modifications of nucleic acids.
We first identified novel 2OGFeDO domains by means of iterative profile searches with the PSI-BLAST program using several seeds including versions of these domains from JBP1 and JBP2, AlkB, prolyl hydroxylases and several low molecular weight compound dioxygenases, such as the thymine-7-hydroxylase and isopenicillin synthase. In some cases these profile searches converged rather rapidly; hence, we improved the profiles via further searches of the protein sequence database of uncultured microbes from environmental samples. For example, a search of the NCBI non-redundant (NR) database using the 2OGFeDO domains of JBP1 and JBP2 converged within 3 iterations. However, upon searching the NCBI environmental sample database, we identified numerous homologous proteins potentially derived from uncultured marine organisms. Including these hits in the profile for a renewed search of the NR database resulted in the detection of homologous oxygenase domains in the gp2 proteins from the mycobacteriophages Cooper and Nigel and a related prophage integrated into the genome of Frankia alni (e-values < 10−4). Further iterations of these searches recovered homologous regions in the 3 paralogous human oncogenes Tet1 (CXXC6), Tet2 and Tet3,37,38 and their orthologs found throughout metazoa (e < 10−5). These searches also recovered a vast expansion of homologous domains from the mushrooms Laccaria bicolor and Coprinopsis cinerea, smaller expansions in the chlorophyte algae Chlamydomonas reinhardtii and Volvox carteri with significant e-values (e < 10−5). Searches against a panel of eukaryotic proteomes using the profile generated from the above search also recovered few representatives from the heterolobosean amoebofla-gellate Naegleria, the stramenopile algae Aureococcus, Emiliania, Phaeodactylum and Thalassiosira, and the chlorophyte algae Ostreococcus and Micromonas. In reciprocal PSI-BLAST searches these versions consistently recovered each other prior to recovering any other member of the 2OGFeDO superfamily, suggesting that they formed a distinctive family comprised of JBP1/2, the animal Tet proteins and their homologs.
Likewise, profile searches with the other queries also recovered a large number of previously undetected 2OGFeDO domains. To identify versions amongst these, which potentially act on nucleic acids, we used a library of sequence profiles for domains involved in nucleic acid metabolism and chromatin function and scanned all the newly detected 2OGFeDO domain-containing proteins for fusions to any of these domains. As result we identified conserved fusions to different DNA-associated domains such as SAD(SRA), R3H, DNA glycosylase, Swi2/Snf2 ATPase and TAM(MBD),11 and also several RNA-associated domains such as the RRM, pseudouridine synthase, pyrmidine carboxylase fold and RNA methylase domains.2 Additionally, some of the proteins with 2OGFeDO domains more closely related to JBP1/2 were linked in the same polypeptide to the DNA-binding CXXC domain and the chromatin-associated chromodomain. Additional evidence for a possible role in nucleic acid modification was also obtained through systematic analysis of gene neighborhoods and genomic organization (see below for details).
We then clustered these proteins using the BLASTCLUST program and further refined these clusters based on conserved, shared sequence signatures and predicted structure features, and domain architectures to identify 5 distinct families. We aligned each of these families and an examination of their conservation patterns (Fig. 1) showed that they typically conserved: (1) The HxD signature (where ‘x’ is any amino acid), which chelates Fe(II) and is associated with the extended region after the first core strand. (2) A pair of small residues at the end of the strand immediately downstream of the HXD motif, which helps in positioning the active site arginine. (3) The HXs (where ‘s’ is a small residue) in the penultimate conserved strand, in which the H chelates the Fe(II) and the small residue contacts the 2-oxo acid cofactor. (4) The RX5a/R signature (where ‘a’ is usually an aromatic residue: F, Y, W) in the last conserved strand of domain. The first R in this motif forms a salt bridge with the 2-oxo acid and the aromatic residue or second arginine helps in positioning the first metal-chelating histidine (Fig. 1).34 We also generated HMMs for each of the families using their multiple alignments and performed a profile-profile comparison of these HMM against a library of HMMs generated for all structurally characterized domains (i.e., domains from PDB) using the HHpred program. These searches uniformly recovered known 2OGFeDO structures (e.g., prolyl hydroxylase, 2jij or AlkB, 2fd8) with significant p-values (p < 10−12) and an alignment spanning all the key catalytic residues (Fig. 1). Together, these observations indicated that we had indeed identified several novel 2OGFeDOs predicted to oxidatively modify nucleic acids.
We outline below the 5 sub-families along with their inferred evolutionary histories and predicted functional features based on domain architectures and other forms of contextual information such as genomic organization and gene neighbors (Table 1).
This family is defined by all 2OGFeDOs that are closer to the kinetoplastid JBP proteins and the metazoan Tets than any other family of dioxygenases. They are characterized by a shared derived character (synapomorphy) in the form of an extended α-helix just N-terminal to the first core strand (Fig. 1). This long α-helix appears to be kinked in most members of the Tet/JBP family by a conserved proline in the middle of the helix. This family can be divided into 5 distinct subfamilies (Table 1). At least a subset of members of each of the families shows either fusions to DNA-binding or chromatin-associated protein domains, or gene-neighborhood/genome-context associations with known DNA-binding domains (Table 1, Fig. 2). This strongly supports a role for most members of the family in modifying DNA. In the first experimentally studied subfamily of this group, the JBP subfamily, JBP2 is fused to a Swi2/Snf2 ATPase module, which is consistent with the role for ATP-dependent chromatin reorganization in synthesis of the J base in kinetoplastid DNA.35 The kinetoplastid JBP1 is fused to a previously uncharacterized C-terminal domain, which is also present as a solo version in other uncharacterized kinetoplastid proteins. Its predicted secondary structure indicates an α + β topology and it contains several strongly conserved polar residues including an absolutely conserved GGTRY motif (Suppl. material). This implies that it could possess an uncharacterized enzymatic activity or could constitute a specific base J-binding domain.
The DNA-binding CXXC domain found in the Tet subfamily proteins also occurs in several chromatin proteins, including the animal DNA methylase DNMT1 and the methylated DNA-binding protein MBD111,39,40 (Fig. 2). Given the domain architectural parallel to the DNMT1 methyltransferase and the precedence of the pyrimidine modification catalyzed by the related JBP subfamily we proposed that this subfamily might catalyze oxidative modification of 5-methylcytosines in animal DNA. Further experimental characterization of the human Tet1 protein showed that it indeed catalyzes this reaction to generate 5-hydroxymethylcytosine both in vitro and in cells. Studies based on overexpression of Tet1 in cultured human cells support its role in potential demethylation of 5-methylcytosines directly or indirectly via this oxidative intermediate.38 Based on the crystal structure of the AlkB protein (PDB: 2fd8) we observed that the unique cysteine-rich extension found at the N-terminus of the Tet subfamily 2OGFeDOs (Table 1) is likely, in part, to occupy a position similar to the N-terminal DNA-binding extensions of the AlkB protein.36 Hence, we speculate that this domain might similarly be involved in forming a DNA-recognition surface. The low complexity insert in the core double stranded β-helix of the Tet subfamily is exactly in the same position as an unstructured insert seen in the prolyl hydroxylases (PDB: 2JIJ, Fig. 1) and is inferred to be located on the exterior surface on one side of the 2OGFeDO catalytic domain. Its persistence across the entire Tet subfamily, despite lack of sequence conservation, suggests that it might form a generalized protein-protein interaction surface. In most members we also identified one or more high confidence sumoylation sites in this insert suggesting that the Tet family might be regulated through this protein modification.
In the gnathostome lineage, after the divergence of agnathan vertebrates, the Tet subfamily underwent a triplication to spawn the Tet1, Tet2 and Tet3 genes, which are conserved in all gnathostomes. Of these Tet1 and Tet3 retained the ancestral CXXC domain, whereas Tet2 appears to have lost the CXXC domain. However, analysis of the chromosome neighborhood shows that a standalone CXXC domain protein (CXXC4) is encoded in the same chromosome as a neighboring gene usually in the opposite orientation. This suggests that in Tet2 a local chromosomal inversion detached the ancestral CXXC domain-encoding exon to form a separate gene. In light of this we speculate that CXXC4 could possibly function as an independent protein interacting with Tet2 in a complex. Given the regulation of CXXC4 by the Wnt pathway,41 it would be interesting to investigate if this might constitute a specific regulatory mechanism feeding into Tet2 enzymatic action. The exon-intron structure of the Tet family is also largely retained across animals. Except Tet2, in all cases the first conserved coding exon encodes the CXXC domain, which probably represents the ancestral gene bearing the CXXC domain that was fused to the 2OGFeDO domain-coding segment. Thus, the Tet progenitor appears to have been acquired prior to the divergence of extant metazoans and underwent a gene-fusion event with the progenitor of the N-terminal exon encoding the CXXC domain. The sequence between the CXXC and the 2OGFeDO domains is a large low complexity region, which is extremely fast evolving and shows poor conservation of exon-intron boundaries across metazoa. This region is also subject to insertions of microsatellite DNA repeats as seen in Tet1 of the platypus, where it appears to have been incorporated into the coding sequence (Suppl. material). In zebrafish Tet3 this region shows a large insertion of an integrated retrovirus into the intron following the CXXC exon. These observations suggest that this intervening low-complexity region between the CXXC domain and the catalytic module appears to be under little evolutionary constraint. The catalytic module comprising of the cysteine-rich extension and the 2OGFeDO domain are encoded by 7 highly conserved exons. However, within the low complexity insert in the 2OGFeDO domain there is considerable variability in exon-intron boundaries and exon numbers.
The expansions of the Tet/JBP homologs in mushrooms and algae define a distinct subfamily where, with a few exceptions, most representatives are standalone proteins (Table 1, Figs. 2 and and3).3). However, their genomic context suggests that they are genes within a novel DNA transposon, which appears to have proliferated in some of these organisms (See below). The bacteriophage gp2 subfamily (Table 1) is found close to the viral origin of replication associated with a gene encoding the ParB protein, which belongs to a superfamily of DNA-binding proteins implicated in bacterial and phage chromosome segregation.42 Typically, the bacteriophage chromosome origins are enriched in genes related to chromosome-segregation, partitioning and packaging, suggesting that gp2 might interact with the ParB protein in these functions. In other phages the ParB protein shows fusions with DNA methylases, and these enzymes have been implicated in regulation of replication or chromosome partitioning in enter-obacteriophages such as P1.43 Hence, the actinophage gp2 might modify methylated bases or reverse their methylation to regulate chromosome partitioning in these viruses. Given the presence of 5-hmC in DNA of other phages, it is possible that, like Tet1, these phage proteins oxidize 5-methylcytosine to 5-hmC directly on DNA. On the whole the Tet/JBP family shows a sporadic distribution: in prokaryotes it is restricted to certain bacteriophages of the caudoviral group or their prophage derivatives integrated in various genomes. The bacteriophage gp2s are the smallest versions of these proteins and represent more-or-less the minimal 2OGFeDO catalytic domain. Hence, they could potentially be the ancestral versions, which spawned the different eukaryotic versions through lateral transfer. The four remaining eukaryotic subfamilies of the Tet/JPB family are not particularly closely related to each other in terms of sequence similarity or domain architectures and have very patchy phyletic patterns (see Table 1). Thus, the eukaryotic versions could have emerged either via multiple transfers from the bacteriophage/bacterial source and might have also disseminated via cross-species transposition within eukaryotes (see below).
This family is defined by the presence of a synapomorphic tryptophan just N-terminal to the strand prior to the first helix of the core catalytic domain (Suppl. material). In sequence searches, these proteins also tend to recover the Tet/JBP family prior to any other version of the 2OGFeDO superfamily, suggesting that there might be distant relationship between the two families. This family is currently found only in two phylogenetically distinct groups of algae, namely chlorophytes and stramenopiles such as diatoms, pelagophytes and haptophytes. This phyletic pattern suggests that they probably emerged in the primary endosymbiotic photosynthetic chlorophytes and were transferred to stramenopiles during the one or more endosymbiotic associations with the primary photosynthetic lineages. Almost all of these proteins show fusions of the 2OGFeDO domain to domains related to RNA-binding or enzymatic domains involved in RNA metabolism (Table 1 and Fig. 2). The two independent fusions of these 2OGFeDO domains to RNA methylases (Fig. 2) suggests that, like the JBP/Tet family, they might also catalyze further modification of methylated bases, possibly generating a hydroxymethyl derivative like hmC or hmU. Further, fusions to the cysteinyl tRNA synthetase C-terminal domain, which recognizes the anticodon of tRNACys44 and the pseudouridine synthase which modifies tRNA45 suggests that these enzymes catalyze the formation of hydroxylated bases unique to the tRNAs of algae. 2OGFeDO domains of this family are also fused to a TIM-barrel domain, which belongs to a superfamily that includes decarboxylases and amidohydrolases.46 This enzyme could hence potentially act in conjunction with the 2OGFeDO domain in catalyzing removal of a methyl group on a base through an oxidation-decarboxylation mechanism that was proposed earlier for the thymine-7 hydroxylase.32 A member of this family from the pelagophyte Aureococcus contains an interesting C-terminal fusion to a second 2OGFeDO domain, which is however of the AlkB family (see below), suggesting that it might catalyze two distinct oxidative modifications. Members of this family in haptophytes, pelagophytes and diatoms are also fused to a distinctive leucine-rich repeat domain (Fig. 2). The domain architectural diversity of this family (Fig. 2) suggests that it has radiated to catalyze a range of unique RNA modifications, which might have a distinctive adaptive role unique to these algae. In contrast to the majority of these fusions, a representative of this family from Aureococcus (Fig. 2, Suppl. material) is fused to the methylated-DNA binding TAM(MBD) domain. This fusion implies that like Tet1, this protein might catalyze the synthesis of hmC from methylated cytosine in DNA.
The AlkB family has been previously described and has been extensively characterized both in biochemical and structural terms.34,47 In addition to the versions acting on DNA, we had also described versions from RNA viruses and eukaryotes that are likely to act on RNA. Representatives of the latter group from animals are fused to RNA methylase domains.2 This family is defined by the synapomorphy in the form of a conserved arginine in place of the usual aromatic residue at the end of the C-terminal strand of the core domain34 (Fig. 1). Members of this family also have a unique β-hairpin insert, just N-terminal to the first helix of the core catalytic domain, which interacts with nucleic acids36 (Fig. 1). In this study we discovered a novel subfamily of AlkB proteins in fungi. While the classical AlkB proteins, which act on DNA are not combined with any additional specific DNA-binding domains, this fungal subfamily is typified by a remarkable fusion to multiple N-terminal domains (Table 1 and Fig. 2), including a SAD(SRA) domain.48 Some exemplars of the SAD(SRA) domains have been shown to specifically recognize DNA with methylated cytosines;49 however, representatives of this AlkB subfamily are found in fungi, such as Cryptococcus neoformans, which apparently lack any known DNA cytosine methylation system (Suppl. material). Further, all characterized representatives of the AlkB family appear to function on N-methylated bases rather than C-5 methylated pyrimidines.47 Hence, this version of the SAD(SRA) domain might not recognize 5-methylcytosine, but perhaps some other methylated base. The presence of additional N-terminal DNA-binding domains in this subfamily suggests that, unlike classical AlkBs, it might bind either specific DNA sequences or distinctive DNA structures. Hence, unlike the regular AlkB proteins which repair methylated-DNA in a non-specific manner, members of this subfamily might be involved in a localized DNA repair via recognition of specific sequences or structures of DNA. In evolutionary terms, this fungal AlkB subfamily appears to have been derived through duplication and divergence of the ancestral fungal AlkB, prior to the divergence of ascomycetes and basidiomycetes.
This novel family identified in the current study is defined by a RxxW signature at the N-terminus of the first helix of the core catalytic domain and an additional conserved domain just C-terminal to the 2OGFeDO domain (Fig. 2, Suppl. material). This conserved C-terminal domain contains a strongly-conserved signature in the form of HxY and GxD motifs at the N-terminus and a GNxG motif followed by a conserved tyrosine at the C-terminus. This strongly conserved pattern suggests that the domain is likely to be enzymatic. These two N-terminal domains are usually further linked in the same polypeptide to a C-terminal cysteine cluster, predicted to chelate Zn, and a R3H domain (Table 1). R3H domains have been shown to bind both single stranded DNA and RNA,50 indicating that this family is likely to act on bases in single stranded nucleic acids, probably in conjunction with the unknown activity catalyzed by the second conserved domain. Interestingly, this family shows a very patchy phyletic pattern, being found in several phylogenetically distant eukaryotes and bacteria (Table 1). For example, it is only found in Daphnia amongst animals or only in Phytophthora among stramenopiles, but none of their close sister groups among completely sequenced genomes. However, it is fairly widespread in the fungi suggesting that it was at least present in the common ancestor of ascomycetes and basidiomycetes. On the whole the phyletic pattern suggests that the family has either undergone extensive lateral transfer between certain lineages and/or multiple instances of gene loss. This family is also notable for lineage-specific expansions in certain lineages, such as the crustacean Daphnia, the mushroom Coprinopsis, the heterolobosean amoeboflagellate Naegleria and the moss Selaginella (Table 1). Multiple copies have often emerged via local gene-duplications and show no evidence for any association with a conserved transposase or transposon encoded proteins. Such expansions are typical of families that provide an adaptive value by being present in multiple diversified copies, usually in the context of counter-pathogen strategies or detoxification of diverse environmental compounds.51 Hence, a possible role for these proteins could be in the defense against viral nucleic acids or genomic parasites via a novel oxidative modification of nucleic acids.
This family is prototyped by proteins from species of the chlorophyte alga Ostreococcus (e.g., OSTLU_17228), which combine a novel version of the 2OGFeDO domain with a C-terminal DNA glycosylase module. The DNA glycosylase module is orthologous to the animal MBD4,52 and like it combines a EndoIII-superfamily DNA glycosylase domain with a divergent TAM(MBD) domain (Fig. 2). However, the fusion with the 2OGFeDO domain appears to be a lineage-specific one that is not represented in multicellular plants. The domain architecture suggests that the 2OGFeDO domain might function in conjunction with the DNA glycosylase domain, with the former domain oxidatively modifying a base and the latter probably carrying out excision of the modified base. Given that members of this family are also found in photosynthetic stramenopile algae and cyanobacteria (Table 1, Fig. 3), it is possible that they were first acquired by the primary endosymbiotic plant lineages from the cyanobacteria and subsequently transmitted to stramenopiles. However, these versions are usually small proteins that do not show the fusions to the DNA glycosylase domain, making it unclear if they actually modify bases in nucleic acids.
We were struck by the unusual pattern of the lineage-specific expansions and chromosomal distributions of the subfamily of Tet/JBP family from mushrooms and algae (Table 1). These lineage-specific expansions, particularly in mushrooms like Coprinopsis (~40 copies) and Laccaria (~60 copies), are characterized by closely related or even identical copies, with paralogs from the same organism usually being closer to each other than their cognates from other organisms. The different copies are distributed throughout the genome rather than as few loci of multiple tandem repeats. In both Coprinopsis and Laccaria, we observed that the majority of copies of the gene encoding the Tet/JBP family 2OGFeDO protein co-occurred in a tightly-linked genomic neighborhood with either or both of two distinct ORFs; a smaller subset of these neighborhoods also included a further 3rd conserved co-occurring ORF (Fig. 3). In some cases the identical copies of the Tet/JBP family are also found linked to identical copies of one or more of these ORFs at different chromosomal locations in these fungi (Suppl. material). These ORFs also showed a strongly preserved relative orientation with respect to each other (Fig. 3, see below). In computational experiments, the probability of these genes being neighbors in a particular preferred orientation so frequently by chance alone was found to be less than 10−19. Such conserved repetitive gene neighborhoods, which are widely dispersed over the genome in eukaryotes, are only found in the case of transposable elements or integrated viruses. Interestingly, in multiple cases these conserved gene neighborhoods are embedded in what appear to be chromosomal hotspots for transposon integration, as evidenced by the En/Spm transposons or retroelements53 found in their vicinity (Fig. 3). Taken together, these observations strongly indicate that this subfamily of the Tet/JBP family is encoded by a novel active transposable element, which additionally encodes at least the two other ORFs that most frequently co-occur with it in these mushroom genera. Most of the full-length 2OGFeDO genes are predicted to encode active proteins, indicating that they probably function in cis for each copy of the predicted transposable element. However, there are multiple instances in each organism, where one or more of the genes in an element are truncated or disrupted by deletions or mutations, suggesting that they represent non-functional or satellite versions of the parent transposon (Fig. 3, Suppl. information). In the genome of the alga Chlamydomonas, we only found two sufficiently long contigs to investigate the neighborhoods of its representatives of this subfamily of the Tet/JBP family. In both those cases we found linkages with the larger of the co-occurring ORFs found in the above fungal gene neighborhoods (Fig. 3). This suggests that they are indeed likely to be part of a similar transposon even in the alga. The pattern of distribution, which is currently limited to certain fungi and chlorophytes, also implies that these elements are likely to have spread laterally across phylogenetically distant groups.
To obtain a better understanding of this element we performed sequence profile analysis of the ORFs that co-occur in these elements. The smaller of the two most frequently occurring ORFs (Fig. 3) in the predicted transposon was found to contain a specialized version of the DNA-binding HMG domain that is most closely related to HMG domains of animal maelstrom proteins.54 The larger of the two ORFs encodes a protein of 850–1,100 amino acids, which often contains multiple cysteine cluster domains, potentially defining one or more Zn-chelating units (Fig. 4, Suppl. material). However, the core of this ORF contains a highly conserved domain with 6 characteristic sequence motifs (CX1–2H, GE, DXXC, HXXXHXXXC and GEXXE, where h is a hydrophobic residue). We propose that this distinctive conservation pattern defines the catalytic domain of the novel transposase used by these mobile elements. While it appears unrelated to any previously characterized transposase domain, the predicted secondary structure of this conserved domain is not incompatible with the RNAse H fold found in several transposases.53,55 The third ORF that co-occurs with these only in a subset of fungal elements (Fig. 3) is a small predicted α-helical protein with multiple conserved tryptophans and no detectable relationship to characterized protein domains (Suppl. material). Genes encoding the Tet/JBP family protein and the HMG domain protein are always oriented in the same direction with respect to each other, whereas those encoding the predicted transposase protein are oriented in the opposite direction (Fig. 3). Thus, the predicted transposase gene is most often head-to-head with respect to the Tet/JBP family gene and tail-to-tail with the HMG protein-encoding gene. The third uncharacterized ORF if present is almost always oriented in the same direction as the predicted transposase gene. It is conceivable that the strictly maintained pattern opposite orientation of these two genes with respect to the predicted transposase might provide a means of differentially regulating their expression, possibly in a mutually exclusive fashion. Studies in the model mushroom Coprinopsis have suggested that RNA-targeted DNA cytosine methylation might play a role in gene silencing.56 In animals, the maelstrom protein, which contains an HMG domain related to the version encoded by these novel transposons, has been implicated in the repression of transposons via cytosine methylation and is part of the RNA-binding nuage complex.57 In this light, we speculate that the expression of the transposon encoded Tet/JBP-related 2OGFeDO protein might result in oxidation of methylated cytosines on the transposon to hmC, which in conjunction with the HMG domain protein, could help in regulating gene expression of the transposon and/or activity of the transposase.
While the highly mobile restriction-modification systems encode DNA modification (methylation) enzymes, such DNA-modification systems have not yet been observed in conventional multi-copy number transposons. To our knowledge the above-identified Tet/JBP-family-encoding transposons represent the first case of an apparently active eukaryotic transposable element that encodes its own DNA-modification enzyme with a potential regulatory role. Hence, we were curious to investigate if further examples of such transposons encoding DNA-modification enzymes existed. We according systematically searched transposons identified on the basis of recognized transposase domains53 with a library of profiles of catalytic domains known to modify DNA, such as DNA methylases, 2OGFeDOs and phage Mu Mom-like enzymes. As a result we uncovered two distinct groups of transposons in bacteria such as Kuenenia, Nitrococcus, Acidithiobacillus and Leptospirillum, which showed associations with Mom-like enzymes (Fig. 3, Suppl. material). In these cases a catalytically active Mom domain is respectively fused to transposase domains of either the TnpA or the TN5 family. Given that the Mom family catalyzes addition of carbamoylmethyl or a related adduct to DNA,8,14,26 it is likely that these transposon proteins are “two-headed” enzymes that catalyze both the modification of DNA via the Mom domain and transposition via the transposase domain. Even in these cases we suggest that Momylation by the transposon encoded protein might regulate the transcription or transposition of the mobile elements that encode them. Such elements are found to be particularly expanded in the bacterium Kuenenia stuttgartiensis. In addition to these linkages, we also found a conserved fusion of the Mom domain with a restriction endonuclease-like domain of the very short patch repair nuclease superfamily55 in several bacteria, and to a nuclease of the Colicin E9-like family in Streptomyces (Fig. 3, Suppl. material). These might represent uncharacterized restriction-modification systems, wherein Momylation might take the place of methylation. Thus, the above observations imply that transposons from both eukaryotes and bacteria might encode their own DNA modifying enzymes to regulate their gene expression or transposition.
The non-redundant (NR) database of protein sequences (National Center for Biotechnology Information, NIH, Bethesda) was searched using the PSI-BLAST programs.59 Profile searches using the PSI-BLAST program were conducted either with a single sequence or a sequence with a PSSM used as the query, with a profile inclusion expectation (E) value threshold of 0.01, and were iterated until convergence.59 For all compositionally biased queries the correction using composition-based statistics was used in the PSI-BLAST searches.60 Multiple alignments were constructed using the Kalign program,61 followed by manual correction based on the PSI-BLAST results. The multiple alignment was used to create a HMM using the Hmmbuild program of the HMMER package.62 It was then optimized with Hmmcaliberate and the resulting profile was used to search a database of completely sequenced genomes using the Hmmsearch program of the HMMER package.62 Profile-profile searches were performed using the HHpred program.63 The JPRED program64 and the COILS program65 were used to predict secondary structure. Globular domains were predicted using the SEG program with the following parameters: window size 40, trigger complexity = 3.4; extension complexity = 3.75.66
The Swiss-PDB viewer67 and Pymol programs68 were used to carry out manipulations of PDB files. Reconstruction of exon-intron boundaries was done using the NCBI Splign program69 with the tblastn searches against chromosomes as a guide. Gene neighborhoods were determined using a custom script that uses completely sequenced genomes or whole genome shotgun sequences to derive a table of gene neighbors centered on a query gene. Then the BLASTCLUST program70 is used to cluster the products in the neighborhood and establish conserved co-occurring genes. These conserved gene neighborhood are then sorted as per a ranking scheme based on occurrence in at least one other phylogenetically distinct lineage (“phylum” in NCBI Taxonomy database), complete conservation in a particular lineage (“phylum”) and physical closeness on the chromosome indicating sharing of regulatory—10 and—35 elements.
Our prediction of novel enzymes catalyzing the oxidative modification of nucleic acids has notable implications for both the evolution of nucleic acid metabolism and the future study of gene regulation. While oxidative modifications of proteins have been known for a long time, the existence of direct oxidative modifications of nucleic acids was not widely suspected. Our prediction of AlkB as an oxidative DNA-repair enzyme of the 2OGFeDOs superfamily, and its subsequent experimental confirmation, provided the first computational support for such enzymes and the modifications catalyzed by them being more widely prevalent. This was further extended by studies on the biochemistry of the unusual DNA-modification, base J of kinetoplastids.18 In this study we show that there are several such potential enzymes, both in previously well-studied model organisms such as mammals, as well as poorly characterized clades of fungi, algae and various early branching eukaryotes. Strikingly, there is no support for these nucleic-acid-modifying enzymes forming one related sub-group within the 2OGFeDO superfamily, instead they appear to belong to several families, most of which are only distantly related to each other. This would imply that the 2OGFeDO superfamily has been recruited for oxidative modification of nucleic acids on multiple occasions. Within the Tet/JBP family there appears to be a correlation between their spread and the evolution DNA methylation. Of these the Tet subfamily is correlated in animals with the presence of DNA-modifying cytosine methylases DNMT1 and DNMT3 and appears to act primarily in the oxidation of 5-methylcytosine.38 Unlike animals, multicellular plants have a novel DNA glycosylase, which appears to be their primary demethylating enzyme, and accordingly entirely lack enzymes of the Tet/JBP family.58 In contrast, chlorophyte algae, mushrooms and Naegleria, which also encode multiple DNA methyltransferase genes, have members of the Tet/JBP family that might modify the methylated cytosine in these organisms. Further, in addition to acting on RNA, some members of the algal RNA-modification-associated family (Table 1) might operate on methylated cytosine in DNA, as suggested by the fusion to the TAM(MBD) domain (Fig. 2). It is also interesting to note that, like the eukaryotic DNA methyltransferases, even the Tet/JBP family of enzymes might have descended from selfish elements like viruses or transposons found in the bacterial world. In this context is interesting to note that the phage mu MOM-like enzyme has been acquired by stramenopile algae, such as the diatom Phaeodactylum and Emiliania (Suppl. material), suggesting that on multiple, independent occasions DNA-modifying enzymes of prokaryotic selfish elements might have been reused in regulatory contexts by eukaryotes.
Our computational prediction of novel oxidative modifications of nucleic acids opens up new vistas for exploring previously unforeseen aspects of gene regulation in eukaryotes. Experimental analysis of the Tet1 protein shows that it might be a critical regulator of gene expression by oxidatively modifying 5-methylcytosine and possibly facilitating its demethylation.38 The presence of such enzymes across several eukaryotic lineages suggests that oxidized pyrimidine derivatives could provide novel epigenetic marks, or even a means of erasing prior marks in the form of DNA methylation. This study also points to the possibility of novel RNA-modifying enzymes in particular eukaryotic lineages. Experimental investigation of modifications catalyzed by these enzymes might indeed help in elucidating the lineage-specific adaptive value of such modifications.
L.A. and L.M.I. acknowledge the Intramural research program of the National Library of Medicine, National Institutes of Health, USA for funding their research. M.T. and A.R. are supported by a pilot grant from the Harvard Stem Cell Institute.
After this paper was submitted for review two new publications on the TET2 gene came to our attention (A Tefferi et al., Frequent TET2 mutations in systemic mastocytosis: clinical, KITD816V and FIP1L1-PDGFRA correlates, Leukemia 2009 and A Tefferi et al, TET2 mutations and their clinical correlates in polycythemia vera, essential thrombocythemia and myelofibrosis, Leukemia 2009). Both these studies show that the TET2 gene is mutated in multiple myeloproliferative neoplasms. The mutations recorded in these studies include those resulting in loss of the catalytic domain as well as point mutations disrupting the second metal-chelating histidine and other conserved residues in the catalytic domain and its cysteine-rich N-terminal extension. These mutations suggest that the Tet2 catalytic activity is likely to be required for its tumor suppressor function and loss of Tet2 activity might be correlated with the hypermethylation reported in these myeloproliferative neoplasms.