|Home | About | Journals | Submit | Contact Us | Français|
The mobility of restriction–modification (RM) gene complexes and their association with genome rearrangements is a subject of active investigation. Here we conducted systematic genome comparisons and genome context analysis on fully sequenced prokaryotic genomes to detect RM-linked genome rearrangements. RM genes were frequently found to be linked to mobility-related genes such as integrase and transposase homologs. They were flanked by direct and inverted repeats at a significantly high frequency. Insertion by long target duplication was observed for I, II, III and IV restriction types. We found several RM genes flanked by long inverted repeats, some of which had apparently inserted into a genome with a short target duplication. In some cases, only a portion of an apparently complete RM system was flanked by inverted repeats. We also found a unit composed of RM genes and an integrase homolog that integrated into a tRNA gene. An allelic substitution of a Type III system with a linked Type I and IV system pair, and allelic diversity in the putative target recognition domain of Type IIG systems were observed. This study revealed the possible mobility of all types of RM systems, and the diversity in their mobility-related organization.
Restriction enzymes recognize and cut at specific DNA sequences, while their cognate modification enzymes methylate the same sequence to inhibit restriction enzyme cleavage. Restriction (R) and modification (M) enzyme genes are often tightly linked, forming a restriction–modification (RM) gene complex (1). When cells harboring an RM gene complex are invaded by foreign DNA, the R enzyme protects the cells by digesting the unmodified invading DNA, while the cellular DNA, which is protected by methylation from the M enzyme, is left intact. This benefit is the major reason RM systems are thought to be maintained in bacterial and archaeal genomes (2,3).
Four types of restriction systems (I–IV) are currently recognized (4). Type II R enzymes cleave DNA at definite positions within or near the recognition sequence (4,5). Fusion of R and M enzymes yields Type IIG (4,6). Type I systems consist of R and M genes, and sequence recognition (S) subunit genes, the products of which form multi-subunit enzymes for modification (SM) or restriction (SMR) (7). Type III systems consist of res and mod genes. The mod gene product has M activity on its own, while the complex of the two gene products has R enzyme activity (8). Type IV R enzymes, such as McrBC from Escherichia coli, cleave DNA near a methylated recognition sequence (9,10).
Some restriction systems are known to occasionally attack the host genome. If the RM gene complex is lost from a bacterial cell, the R and M enzymes gradually decrease in intracellular concentration as the cells grow and divide. Eventually, the M enzyme cannot methylate the chromosomal recognition sites sufficiently to protect against lethal attack by the remaining R enzyme molecules. This selfish post-segregational killing behavior forces host cells to maintain at least some Type II RM systems (11). Host cell killing also occurs with the Type IV enzyme McrBC when a particular DNA methylation system is introduced (10). Under some conditions of genome instability, Type I R enzymes attack the host chromosome at an arrested replication fork (12,13).
The mobility of RM genes has also been investigated. Phylogenetic trees of RM genes suggest horizontal transfer between distantly related prokaryotes (14–16). The average GC content and codon usage of RM genes often deviates from the rest of the genome (14,17–20). Genome context analysis has shown that some RM genes are on mobile elements such as plasmids and prophages (21–28), and some are linked to recombination-related genes such as integrases, invertases and transposases (29,30). RM systems and apparently solitary M genes flanked by insertion sequence (IS) elements have been observed (31–34). Genome comparisons have also shown that RM systems are involved in genome rearrangements such as insertion, deletion and transposition (14,17,35–37). Intragenomic comparisons of Helicobacter pylori demonstrated large inversion events next to RM genes (17). Allelic RM systems have also long been recognized. In E. coli, the hsd locus is occupied by either an EcoKI Type I system, an EcoB Type I system or other non-RM genes (38).
RM gene complexes are occasionally flanked by direct repeats (39,40). Genome context and genome comparison analysis led to the classification of the repeats into three groups: site-specific recombinations (Figure 1b), insertions with long target duplications (Figure 1c), and chance insertions between repeated sequences. The first class was observed for RM systems on prophages (21,23–28), or in the vicinity of integrase genes (30,41). We demonstrated the second class by genome comparison analysis revealing insertion of RM systems with long and variable target duplications, with no other mobile elements (37).
This study is the first report of a systematic, intraspecific genome comparison to explore the repertoire of genome rearrangements linked to RM genes within a given species. We also systematically analyzed RM gene linkage to flanking repeats. Our data strongly indicated putative mobility for all types of RM systems, and revealed organizational diversity related to mobility. Among the examples are novel, compact types of mobility units that are similar to DNA transposons, in which RM genes are flanked by long inverted repeats.
Sets of multiple complete genome sequences that were available for a single species were retrieved from NCBI (National Center for Biotechnology Information, http://www.ncbi.nlm.nih.gov) on 1 April 2006, resulting in 760 pairs of syntenic regions that included RM genes in both or in one of the regions (Supplementary Table 2). The type, position and orientation of RM systems were obtained from REBASE (http://rebase.neb.com) (2). Sequence similarity between pairs of syntenic regions was visualized using the Artemis Comparison Tool (ACT, http://www.sanger.ac.uk/Software/ACT) (42) with default variables. Conserved domain was searched by NCBI Conserved Domain Search (CD-Search, http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi). The 5-kb flanking sequences of RM systems were used for the classification.
Relatedness between two intraspecific genome sequences was represented by two variables: identity and coverage (Supplementary Table 1). Identity was calculated by the equation:
where is the average nucleotide length of an ith orthologous region between two genomes as detected by the Comparative Genome Analysis Tool (CGAT) (43), and is the fraction of identity (percent identity/100) of the ith orthologous region. Coverage was calculated as the ratio of the sum length of the orthologous regions to the average whole-genome length.
A phylogenetic tree was drawn using the neighbor-joining method of the MEGA4 program (44). Bootstrap values were from 1000 trials, and other variables were default. The GC content of third-codon nucleotides (GC3), and codon usage bias were calculated using CGAT (43). Codon usage bias of gene G against reference gene set R, B(G|R), was calculated using the equation (45):
where PG(a) is the ratio of amino acid a in a protein sequence of G, and fG(x,y,z) and fR(x,y,z) are the frequencies of the codon (x,y,z) in G and R, respectively. fG(a) and fR(a) are the frequencies of amino acid a in G and R, respectively. All genes in a genome were used as the reference gene set R, represented as B (G|all).
The definition of RM genes from REBASE Genomes database (http://tools.neb.com/~vincze/genomes/), accessed on 29 November 2008 was used. A line in the list of RM genes for a prokaryotic genome was assumed to represent an RM system. Systems including only M gene and lacking a gene labeled R were treated as solitary M systems and included in the analysis. RM systems that contained the same genes were removed manually, resulting in 4132 RM systems. One kb of flanking sequences was examined for each gene in an RM system. Within each RM system, all pairs of the left flanking sequence of a gene, and either the right flanking sequence of the same gene, or other genes to the right were compared. For example, if an RM system had three genes, six pairs of flanking sequences were analyzed (Figure 2). Totally 11 554 pairs were analyzed for RM systems in total. The longest bidirectional match detected by Blastn (46) in each pairs was assumed to represent repeat sequences if the match was longer than 20 bp, which was chosen over 30 bp as the threshold for intraspecific genome comparison analysis to increase sensitivity.
The same analysis was carried out for other genes using all genes in 50 randomly selected bacterial genomes as controls, for 119 865 total genes. One-kb sequences flanking n genes (n = 1, 2, 3, 4, 5, 6, 7, 8, 9) were analyzed, resulting in 119 865 × 9 = 1 078 785 total pairs.
For RM system analysis, the numbers of the pair that flanked different number of genes were different. For example, the number of the pairs that flank one gene was 7049, while the number of the pairs that flank two genes was 2917. For control analysis, the numbers of pairs that flank different number of genes were the same. To compare results, the difference was corrected by calculating a weighted average for the control pairs to adopt the ratio of pairs in the RM systems analysis, using the following equation:
where Rw is the weighted average, rncon is the number of pairs for which the repeat sequences flanking n genes were detected in control genome analysis, Cncon is the total number of pairs flanking n genes in the control genome analysis, and CnRM is the total number of pairs flanking n genes in the RM system analysis. Nucleotide sequence alignment was carried out by ClustalW (47).
Genome sequences that corresponded to an ‘empty site’ allele of an RM system were searched using Blastn (46) against the nucleotide sequence database (nr, prokaryote, NCBI), with 1-kb flanking regions and the RM system region as queries. A genome sequence with >500 bp similarity in both flanking sequences, but with no similarity in the RM system, was selected as a subject sequence for genome comparison. Genome comparison results were visualized by ACT, and classified manually.
We performed a comprehensive search for RM gene-linked genome rearrangements by comparing each RM locus to the syntenic regions of all other available complete genomes within the same species. Results of the pair-wise comparisons were classified according to nucleotide sequence similarity in the RM regions and 5 kb of flanking region (Table 1). Half of the pairs did not even have partial sequence similarity in the RM regions (Table 1), suggesting frequent insertions or deletions of RM genes in the history of that species. Pairs with no similarity in the RM region, but with similarity in the flanking sequences were classified into substitutions, or insertion/deletions (indel), based on the length of the unaligned region. Investigation of these cases allowed us to identify three types of potential rearrangements: (i) insertion with a long target duplication; (ii) substitution by other RM genes; and (iii) substitution of the target recognition domain in a Type IIG RM gene.
We searched for cases where RM systems were inserted into the genome with long and variable target duplications, but with no other mobile elements (37). Cases classified as indels in Table 1 were assumed to be insertions with long target duplications if they satisfied the following criteria: (i) inclusion of both R gene and M gene homologs in the inserted sequence; (ii) inserted sequence length of less than 20 kb to exclude large mobile elements such as prophages (48); and (iii) target duplication length longer than 30 bp to exclude typical repeats formed by site-specific recombination events (49). Nine cases were found from H. pylori, Burkholderia pseudomallei, Haemophilus influenzae, Thermus thermophilus, Vibrio vulnificus and Xylella fastidiosa. Three cases in H. pylori were previously reported (37), and a case in B. pseudomallei was reported as a Type I RM on a prophage annotated as a genomic island 5 (50). The other cases are analyzed in detail here, with the length and identity of the repeat sequences summarized in Table 2.
If an inserted region contained only RM genes, it was classified as an RM insertion with a long and variable target duplication, as described previously (37). However, if the inserted region included an integrase gene homolog, and the flanking repeats had sequence similarity to a tRNA sequence, it might be the product of site-specific recombination, since tRNA sequences are known to be an integration target for bacteriophages (51–64) and integrative plasmids (65–68). Integration often involves a long sequence, such as the region from the anticodon loop through 3′ end (51,60,61,63–68), or the entire tRNA gene sequence (55,57), both of which are longer than our 30-bp threshold. Therefore, we examined each polymorphism in detail.
Two cases of RM insertion with a long target duplication were detected in a comparison of two H. influenzae genomes (Figure 3a and b). One case (Figure 3a1) showed an insertion of Type I M, S and R genes interrupted by a transcriptional regulator homolog and an ATP binding protein, with a duplication of a 46-bp long sequence that did not occur elsewhere in either genome. The repeated sequences showed high identity (Table 2). The GC content of the inserted region was 41%, slightly higher than the genome average of 38%. The other case (Figure 3b1) was a Type II RM system insertion with duplication of a 46-bp sequence. This sequence differed from the above repeat, and was also unique in the genome. The repeated sequences showed high identity (Table 2), and the GC content of the insert (32%) was much lower than the genome average (38%).
Cases of site-specific recombination were observed in T. thermophilus, V. vulnificus and X. fastidiosa (Figure 3c–e, Table 2). The repeat sequences showed similarity to the 3′ terminus of a tRNA sequence (Figure 3c2, d2 and e2), and all had a gene in the insert with strong or weak sequence similarity to an integrase, which likely mediated the site-specific recombination.
Figure 3c shows a tyrosine-type phage integrase homolog next to the repeat sequence. A Type IIG RM gene and a Type II M gene are present in the insert, as well as a transposase homolog and genes for hypothetical proteins. Figure 3d shows another tyrosine-type integrase homolog adjacent to the repeat sequence. The insert carries Type I RMS genes, as well as multiple genes for DNA-binding proteins, a virulence-related gene (HipA-like protein), and a multidrug efflux pump gene involved in drug resistance. These two inserts may be considered genomic islands.
In Figure 3e, the insert in the X. fastidiosa Temecula 1 genome contains only Type II R and M gene homologs, and an integrase gene homolog. Perfect sequence identity between the repeats suggests that the integration is a relatively recent event. The last gene product shows very weak sequence similarity to an integrase family protein (e-value of 7e-4 in blastp) (69), so we could not determine if this gene in this putative mobile unit has decayed or has specialized.
Substitution-type allelic RM systems were found in Campylobacter jejuni, E. coli, X. campestris, Rhodopseudomonas palustris and T. thermophilus. The C. jejuni case was reported as a substitution in a region including an S subunit gene from a Type I RM system (70), and the E. coli example was also reported (38). The remaining cases are presented here.
Type III RM alleles at a locus in X. campestris pv. campestris str. ATCC33913 were substituted with Type I RM genes in two genomes of this species (Figure 4). When the same locus was analyzed in other Xanthomonas species genomes, a deletion was found in X. axonopodis pv. citri str. 306 that left only 125 bp of a short open reading frame (ORF), and a substitution of non-homologous Type I RM genes was found in X. oryzae KACC10331 and X. oryzae MAFF311018. These two genomes showed evidence of insertion of ISXo1 into the S subunit gene in the Type I RM genes.
Homologs for all RM genes at this locus were found in distantly related bacteria, suggesting the possibility of horizontal transfer (Supplementary Figure 4). The Type III system homolog genes in X. campestris ATCC 33913 were found in Bordetella pertussis Tohama I (Supplementary Figure 1). The two genera are distantly related according to the phylogenetic tree (Supplementary Figure 4), and the GC3 and codon usage of the homologs were different from the majority of genes in both species, suggesting different origins for these homologs. The Type I and Type IV system homolog genes in X. campestris 8004 were also found in X. campestris pv. vesicatoria 85–10, Methylobacillus flagellatus KT, and Alkalimnicola ehrlichei HLNE-1 (Supplementary Figure 2), and Type IV genes are frequently observed at the vicinity of Type I RM genes (10). These genera are phylogenetically very distant from each other (Supplementary Figure 4). The GC3 and codon usage of these homologs are different from the majority of X. campestris and A. ehrlichei genes, but most of the homologs in M. flagellatus did not show much bias, suggesting that M. flagellatus is the origin of these Type I and Type IV systems. Homologs of the Type I and Type IV systems in X. oryzae KACC 10331 were found in Nitrosomonas eutropha C71 and Methylococcus capsulatus Bath (Supplementary Figure 3), which are phylogenetically distant (Supplementary Figure 4). The bias in GC3 and the codon usage of these homologs suggests the possibility that N. eutropha C71 recently acquired the Type I RM genes.
Type II RM genes in T. thermophilus HB8 were substituted with a different Type II M gene in T. thermophilus HB27 (Figure 5a). These M genes were 68% identical in amino acid sequence, although the neighboring R gene was completely deleted in the latter. The R and M genes of the former genome showed biased codon use and GC content (Figure 5a2), suggesting horizontal transfer. The M gene of the HB27 genome did not show bias, which may indicate amelioration of a horizontally transferred gene after decay (71).
A Type II M gene in R. palustris HaA2 was substituted with a Type IIG gene in R. palustris CGA009 (Figure 5b), and no sequence similarity was observed between the two genes. Codon usage and GC3 of these genes were biased from the majority of genes in the genome (Figure 5b2 and b3), suggesting separate horizontal transfer of these genes to form alleles of a locus.
Sequence similarity and diversity in the C-terminal region of a Type IIG enzyme compared to a Type IS subunit have been reported (72), and relationships of recognition sequences and the region were confirmed previously by in vitro analysis (73).
Allelic diversity in the putative sequence recognition domain of a Type IIG RM protein was found in six cases, by investigating RM loci that were classified as partially matched (Table 1). Omitting cases of frameshift mutation and cases in which the same extent of diversity covered the entire RM gene region and both flanking regions, left two cases to analyze in detail.
In Campylobacter, both sides of the Type IIG homolog showed sequence similarity in all strains, but divergence at the nucleotide sequence level was observed in the C-terminus of the Type IIG homolog (Figure 6a). Amino acid sequences of the genes aligned completely except for the two variable regions at the C-terminus (Supplementary Figure 5). An NCBI Conserved Domain Search (CD-Search) (74) showed that CJE1195, Type IIG enzyme of C. jejuni RM 1221, had modification subunit motif for a Type I RM (COG0286, e-value 4e-28), and the sequence recognition motif for a subunit of Type I RM systems (COG0732, e-value 2e-3). The regions of the sequence recognition domain matched the diverged region between the homologs. In the two unaligned regions, repeats of 20–40 amino acids were observed, which also supported the similarity to the Type I RM sequence recognition subunit. To our knowledge, this is the first reported example of allelic diversity in the target recognition domain of a Type IIG enzyme, in closely related genomes.
Type IIG genes (Sth1066ORF1376P) in S. thermophilus CNRZ1066, and (Sth18311ORF1376P) in S. thermophilus LMG18311 were compared and found to have unmatched regions at the C-terminus (Figure 6b).
In the second part of this study, we systematically searched for repeat sequences flanking RM genes in completely sequenced genomes. Specifically, we wanted to determine the generality of RM system insertions with long and variable target duplications (37). We also wished to examine RM systems in a novel context that might suggest a mechanism for their insertion.
One kilobase pair of flanking sequence was analyzed for 4132 RM systems. The frequency of RM systems with repeat sequences longer than 20 bp was compared to the frequency for other genes (see ‘Materials and methods’ section). Both direct and inverted repeat sequences were observed at significantly higher frequencies in the flanking sequences of RM systems (Figure 7). The longest repeat sequence was chosen from each RM systems and used for further analysis.
Some cases appear to have been caused by insertion of repeat sequences such as ISs, independently of the action of RM genes or integration by site-specific recombination of an RM-carrying phage or other mobile element. Because we were primarily interested in RM-mediated rearrangements that did not involve other mobile elements, we excluded these cases based on mobility-related gene annotation, and genomic copy number of the repeats (Figure 8a).
In 57 out of 179 cases, five protein-coding sequences that flanked the RM system, or that were within the RM system, included genes annotated as mobile elements such as transposons, integrases, resolvases, invertases, topoisomerases and phage-related sequences. Of these, 30 out of 57 included a gene annotated as a transposon, and 25 out of 57 included a gene annotated as an integrase. After omitting these, 122 cases remained for further analysis.
Each repeat sequence was analyzed by Blastn against the entire genome to find sequences that had a match longer than 90% of the length of the query repeat, and were assumed to be copies. Totally 24 out of 122 had more than 10 copies, in addition to the two flanking RM genes, and were excluded.
An empty site was searched in the other genome sequences for 98 out of 122 cases. More specifically, RM systems and 1 kb of flanking sequence on both sides were used as queries for Blastn homology searches against all sequenced genomes. An empty-site genome sequence, lacking the RM genes but with sequence similarity on both sides, was found for 29 out of 98 cases (Table 3), which were used for genome comparison. Although the pool before selection included archaea, no archaeal RM systems survived the selection.
In the 29 cases described above, the RM region was compared to the subject genome to detect RM-related rearrangements. Cases were classified into (Figure 8b1–b3): (i) insertion with long target duplication; (ii) substitution; and (iii) transposon-like structure where RM genes are flanked by inverted repeats.
Evidence of an RM system insertion with a long target duplication has been reported only for H. pylori (epsilon-proteobacteria), from a pairwise genome comparison (37). We found two other cases in H. influenzae (see above). We found more prominent examples of insertion with long target duplications in Campylobacter (ε-proteobacteria), Microcystis (cyanobacteria), Neisseria (β-proteobacteria), Acidovorax (β-proteobacteria), Haemophilus (γ-proteobacteria), Burkholderia (β-proteobacteria) and Clavibacter (actinobacter, gram-positive) (Table 3). With the exception of Burkholderia, Microcystis, Acidovorax and Clavibacter, most occur in naturally competent species (75).
Examples in all types of RM systems, Types I, II, III and IV, were found (Figure 9, Table 3 and Supplementary Figure 7). An mcrA gene (Type IV) appeared to have inserted without other genes through this mechanism (Figure 9b). The discovered mcrA gene is relatively short, and has sequence similarity to only the C-terminal half of mcrA gene homologs in other bacteria. However, CD search (74) detected an HNH nuclease domain (cd00085) that is common in other mcrA homologs, suggesting that this gene may be active. Orphan M was also found to be inserted by this mechanism (Figure 9c and Supplementary Figure 7h).
In many cases, the repeat sequences spanned the translation start or stop site, which, unlike insertion into the coding region, may leave the target gene intact and confer a selective advantage. Insertion of a Type II RM into an operon-like structure was also observed (Supplementary Figure 7m), consistent with a previous example (76).
Several RM systems flanked by repeats have been detected in H. pylori. Two cases were previously reported (Supplementary Figure 7a and k) (48), and in four novel cases, the targets were a Type IIG RM gene (Supplementary Figure 7d), a Type II M gene (Figure 9d and Supplementary Figure 7g), a Type III M gene (Supplementary Figure 7i) and a hypothetical protein gene (Supplementary Figure 7c). In the third case, the M gene in the query genome was frameshifted, while its homolog in the subject genome was not. This suggested that the insertion led to decay of the target gene by frameshift mutation and gene fusion.
The second example (Figure 9d and Supplementary Figure 7g) may be a case of generation of a novel M gene by gene fusion, because the repeated sequence was within the Type II M gene. Insertion with duplication of this sequence was likely the initial event. The M gene in the subject (target) genome was short (480 bp), and carried only motifs I through VIII, in order, and lacked a target recognition domain and motifs IX and X. The insertion apparently fused this partial M gene to a target recognition domain and motifs IX and X, creating a typical m5C methyltransferase (77). In strain J99, this fusion is active (78). The short N-terminal M gene may have been generated by a rearrangement event.
Several cases in which the subject genome did not align, or only partially aligned with the repeated sequence were observed (Figure 8b). These cases were observed in Desulfovibrio (δ-proteobacteria), Neisseria (β-proteobacteria) and Helicobacter (ε-proteobacteria) (Supplementary Figure 6).
In Desulfovibrio, the Type I RM system flanked by repeats was substituted in the subject genome by a prophage with unrelated Type II M and RM genes (Supplementary Figure 6a). The prophage was flanked by 45-bp attL/R sequences, which align with the tRNA-Gly sequence.
In Neisseria, the Type IIG RM gene flanked by repeats was substituted by a transposase homolog (Supplementary Figure 6b), whose homologs, annotated as IS1016, were frequent in both genomes, with eight in the query genome and 16 in the subject genome. Substitution of this RM might have occurred by the action of this transposase.
In Helicobacter, a Type II RM system flanked by direct repeats was substituted by a gene for a hypothetical protein with a transmembrane domain (Supplementary Figure 6c). Only one of the two repeat sequences of the query RM system aligned with the subject genome. Because we could not find a genome with a clean empty site, we could not determine if the Type II RM system inserted into a genome with duplication of the 22-bp target sequence.
Terminal inverted repeats are a feature of many DNA transposons (79). Of the 98 RM systems remaining after removal of linked mobile genes and high-copy number repeats, 24 cases contained inverted repeats (Supplementary Table 3b and Figure 10). These were found in all RM types and orphan M (Figure 10a–f). In some cases, one or more component genes, but not the entire RM system, were flanked by inverted repeats. For example, in one Type II system of M-M-R, one of the two M genes was flanked by inverted repeats (Figure 10i), possibly representing an intermediate status in replacement of M partner by an R gene. In a Type I RM system, the inverted repeats are embedded in two S subunit genes flanking R and M subunit genes (Figure 10g). In Ureaplasma species, inverted repeats within the S genes flank M gene and another S gene of an apparent Type I system, composed of one R, one M and three S genes (Figure 10h).
A genome with an empty site was found for comparison among these examples, which clearly revealed their insertion points. In X. oryzae (Figure 11a), the inverted repeat sequences had 60/65 sequence identity, and an incomplete match is a feature of the terminal inverted repeats of many DNA transposons (80). The repeats did not align with the subject genome sequence, and were not found elsewhere in the query genome, and therefore appear to be unique to the inserted unit (Figure 11a). The entire RM system unit with the terminal inverted repeats was flanked by 8-bp direct repeats, which perfectly matched the 8-bp sequence in the subject genome empty site. This strongly suggested that insertion took place with 8-bp target site duplication. The same relationship was detected for an RM system homolog in another X. oryzae strain (Table 3, panel c). In addition, the 8-bp target sequence is flanked by 5′CTGC and 5′CAG, which are contained in the recognition sequence, 5′CTGCAG (81). The significance, if any, of such target site organization in the life cycle of this element remains unclear.
In N. gonorrhoeae (Figure 11b), inverted repeats showed a 26/26 sequence identity and were not found the subject genome. A unit with these terminal inverted repeats appeared to have inserted with direct duplication of 8-bp target sequence in this case. The inserted unit contained three ORFs, a Type IIG RM gene homolog, a Type I system S subunit homolog, and a hypothetical protein gene. RM systems with a Type IIG RM and a Type I S subunit homolog, such as BcgI (82) or Sau42I (28), are already known. Compared to these examples, the Type IIG RM homolog in this unit appears truncated, lacking its N terminal half. The third gene in this case had a transposase motif, and a likely inactivated derivative (COG3677, e-value 4e-12) by CD search (74). Whether this unit inserted through the activity of the transposase-like gene or through the activity of the RM genes is unknown.
By genome comparison and genome context analysis, we detected genome rearrangements linked to RM systems and their putative mobility forms. Intraspecific genome comparison revealed new examples of RM-linked genome rearrangements, such as insertion with a long target duplication, allelic substitution by different RM types, and allelic diversity of the target recognition domain in a Type IIG gene.
Our group previously discovered the insertion of genes with long target duplication in H. pylori, using whole genome comparison analysis (37). All examples in that study involved Type IIS RM systems, a subtype that cuts DNA outside of the recognition sequence. Examples of this type of insertion have been found in various species of bacteria, and for Type I, II, III and IV RM systems. The lengths of the repeated sequences vary, but are much longer than the range reported for the transposon of Mycoplasma (83). The ubiquitous occurrence of this type of insertion suggests that it involves a property common to all RM types, such as DNA double-strand breakage. RM systems flanked by long direct repeats are known to amplify themselves (84). This led us to hypothesize a virus-like life cycle for RM systems (84), which invade a genome by insertion with a long direct duplication, amplify themselves using the repeats, then release to invade other host cells in subsequent cycles (85). Although there is direct evidence for the amplification step (84), there is yet no experimental evidence for release or subsequent infection.
Genome context analysis revealed flanking repeats at a significantly higher frequency for RM systems than for average genes. The insertion of RM systems with long target duplication was found in several bacterial types, and for all RM system types. We discovered a novel mobile form of RM system, similar to classical DNA transposons, in that RM genes are flanked by imperfect inverted repeats (Figure 10 and Supplementary Table 2). Although mobility-related genes such as transposase and integrase genes are often linked to RM genes (3,10), to our knowledge, this is the first report for these structures (Figures 10 and and11),11), which were found for all RM types. Some inserted into a genome with a short target duplication (Figure 11) similar to classical DNA transposons. We do not know if these RM gene products act as a transposase in this unit. We cannot exclude the possibility that this unit was inserted by a transposase acting in trans, as a non-autonomous transposon. In some cases, only part of an RM gene cluster was flanked by inverted repeats, which might contribute to the diversification of RM systems by component replacement. The presence of the inverted repeats within an S gene of Type I systems might be related to the phase variation of S genes known in a Mycoplasma species (86).
Another interesting form of RM mobility is composed of RM genes and an integrase homolog inserted into a tRNA gene, resulting in flanking long direct repeats (Figure 3d). Restriction modification genes with a similar form are active (87).
In addition, we detected gene fusion during insertion with a long target duplication to generate a novel modification methyltransferase gene (Figure 9d and Supplementary Figure 7g). This might be an intermediate form in evolution of modification methyltransferases (88), explaining the circular permutation of their sequences (89,90). The formation of new specificity in M genes through gene fusion and duplication is an interesting prospect.
Our systematic genome comparison analysis revealed both the generality and variety of RM system mobility, including putative mobile forms of RM systems. This approach will reveal more RM system diversity, as prokaryote sequence data accumulates with metagenomics and innovations in sequencing technology.
Supplementary Data are available at NAR Online.
The 21st century COE project of ‘Elucidation of Language Structure and Semantic behind Genome and Life System’; the global COE project of ‘Genome Information Big Bang’ from Ministry of Education, Culture, Sports, Science, and Technology (MEXT) (to I.K.); ‘Grants-in–Aid for Scientific Research’ from Japan Society for the Promotion of Science (JSPS) (21370001, 19657002) (to I.K.); Medical Genome Science Program in Support Program for Improving Graduate School Education of JSPS (to I.K.) Funding for open access charge: The global COE project of ‘Genome Information Big Bang’ from Ministry of Education, Culture, Sports, Science, and Technology (MEXT).
Conflict of interest statement. None declared.
The authors thank Mikihiko Kawai, Ikuo Uchiyama, and Iwona Mruk for helpful discussions and suggestions. The authors thank an anonymous reviewer for explanation of Figure 6S (b).