|Home | About | Journals | Submit | Contact Us | Français|
A surprising result of comparative bacterial genomics has been the large amount of DNA found to be present in one strain but not in another of the same species. We examine in detail one location where gene content varies extensively, the restriction cluster in Escherichia coli. This region is designated the Immigration Control Region (ICR) for the density and variability of restriction functions found there. To better define the boundaries of this variable locus, we determined the sequence of the region from a restrictionless strain, E.coli C. Here we compare the 13.7 kb E.coli C sequence spanning the site of the ICR with corresponding sequences from five E.coli strains and Salmonella typhimurium LT2. To discuss this variation, we adopt the term ‘framework’ to refer to genes that are stable components of genomes within related lineages, while ‘migratory’ genes are transient inhabitants of the genome. Strikingly, seven different migratory DNA segments, encoding different sets of genes and gene fragments, alternatively occupy a single well-defined location in the seven strains examined. The flanking framework genes, yjiS and yjiA, display approximately normal patterns of conservation. The patterns observed are consistent with the action of a site-specific recombinase. Since no nearby gene codes for a likely recombinase of known families, such a recombinase must be of a new family or unlinked.
The Immigration Control Region (ICR) was defined in Escherichia coli K-12 (1) as a locus at 98.6 min specifying three different restriction systems within 14 kb of DNA. This cluster includes the hsdR, hsdM and hsdS genes encoding the type I system EcoKI, as well as the methylation-dependent restriction system genes mcrB, mcrC and mrr. Type I restriction genes resident here (linked to serB and thr) are known to be highly variable in specificity, both within E.coli (2–4) and among enteric bacteria (5,6). Some strains apparently lack restriction-modification (RM) systems at this locus. E.coli C has been a reference ‘restrictionless’ strain for many years (7,8).
Existing clues to the sequence evolution of the ICR of E.coli suggested a model of replaceable cassettes at a single location. These might be similar to the cassettes of integrons, which can be exchanged by means of site-specific recombinases (9); or to alternative prophages at an attachment site, as with phage 21 and the e14 excisable element (10); or they could resemble the DpnI/DpnII exchangeable cassettes found in Streptococcus pneumoniae (11,12), which are exchanged by means of homologous recombination in flanking DNA.
Three lines of evidence suggested that replaceable cassettes might occupy this location. First, many enteric strains have at least one restriction gene in this genetic location, but the restriction specificities are quite variable. At least 16 different specificities can be identified in 37 isolates of the ECOR collection (4,13), a strain set designed to represent the diversity of the E.coli species worldwide (14,15). This variability is found not only in E.coli but also among different genera of enteric bacteria (6).
Second, despite the residence of similar function (restriction) at the same genetic location in different strains, the DNA sequences determining that function are often highly divergent. The hsd genes have been grouped into four families (IA, IB, IC, ID) based on sequence similarity, functional complementation, and antibody cross-reactivity within but not between families (reviewed in 16–18). To appreciate the level of divergence between families, Sharp et al. (19) used divergence within type IA modification subunits (HsdM) of E.coli and Salmonella as a clock and suggested that IA and IB families were so divergent that the difference might distinguish bacterial phyla [roughly, comparing purple sulfur bacteria and spirochaetes (20)]. Despite this high level of divergence, members of three of the four families are found at the same location in E.coli, genetically linked to the thr locus (4). Analysis of four members of the IA family suggested that lateral transfer of hsd sequence has occurred (19).
The third line of evidence suggesting a cassette-swap model is the presence of conserved (or at least hybridizable) sequence flanking the locus in strains with divergent type I systems, by Southern blot (6). This suggests that these diverged genes occupy the same sequence environment, and that transposition or other site-independent mechanisms did not mediate acquisition.
The modification-dependent restriction genes flanking the type I systems have been less well-studied, but limited evidence suggests lateral transfer has affected these genes as well. The mrr gene of Salmonella typhimurium LT2 is relatively divergent from that of E.coli K-12 (71% DNA identity). Typical homologous genes serving the same function (orthologs) display DNA identity of ~84% in E.coli–Salmonella comparisons (21). The mcrB and mcrC genes are entirely absent from many E.coli strains (4,6).
In addition to a high frequency of segmental replacement, the region as a whole appears to be subject to higher levels of homologous recombination than is typical of most of the chromosome. Milkman and colleagues (22,23) designated the region containing the ICR as one of two ‘hypervariable regions’ in the E.coli genome. These regions show signs that homologous recombination of horizontally transferred DNA contributed significantly to the evolutionary history of the genes resident there. The other such region contains the O-antigen gene complex near 45 min (24,25). Taking another approach, Lawrence and Ochman (26) classify all three restriction systems, hsd, mcrBC and mrr, as laterally transferred genes using criteria of GC content and codon usage.
To clarify the boundaries of this putative cassette locus and to gain insight into the mechanisms of variation, we sought to obtain the sequence of the region from a strain thought to lack any ‘cassette’ at this location. E.coli C is such a strain by three criteria. The first is the absence of a classical RM system; it has long been used as a permissive host in the study of type I enzymes (7), and no restriction locus genetically linked to thr is identifiable (27). The second is the absence of modification-dependent restriction systems (8), two of which flank the type I system in K-12. The third is the absence of hybridizing sequence over much of the region by Southern blot analysis (6).
It is convenient at this point to enunciate the idea of ‘framework genes’ and ‘migratory genes’. ‘Framework genes’ are defined here as genes that remain constant in genomes of close relatives: they remain in the same order with respect to each other and in approximately the same chromosomal location, but may be locally separated by segments that come and go among strains. This idea has had several antecedents: Welch et al. (28) used ‘framework’, ‘backbone’ and ‘conserved synteny’ to refer to such segments. Perna et al. (29) used ‘backbone’, while others have referred to ‘collinear regions’ (30,31). We imagine that within a cell lineage, such genes diverge in sequence in concert, or are exchanged by homologous recombination. Such recombination depends on preservation of a high degree of DNA sequence identity [>96%; (32)]. The default assumption is that each copy of such a gene in the lineage is descended from a common ancestor of that gene in the same lineage. Such genes are termed ‘orthologous’ (33).
In contrast, we propose the term ‘migratory genes’ to refer to genes in sequence segments that come and go en bloc within a species or other taxon. Elsewhere, such segments have been called ‘loops’ or ‘islands’ (30), or ‘lineage-specific segments’ (29). In principle, these could be transposable elements; genes acquired by ‘illegitimate recombination’ (34) in an undirected fashion; or genes associated with prophages or other elements that mediate site-specific recombination at particular locations. Site-specific recombination requires particular short sequences (attachment sites) to be present on both recombining partners, and mediates rearrangement events with precise borders. Rearrangements can include insertion of one molecule into another, deletion between two such sites, or inversion between them. Such attachment sites would presumably occur at loci permissive for DNA acquisition, while insertions that disrupt other genes would be purged from the population by selection. Migratory genes found at the same locus in two cell lineages may be totally unrelated. We show below that these ideas accurately reflect the pattern of sequence variation observed in the vicinity of the ICR.
We find that the ICR is occupied by a variety of migratory genes in different strains, all flanked precisely by the same framework genes, consistent with acquisition by site-specific recombination. We also find evidence for reshuffling of these migratory genes by homologous recombination.
The wild-type E.coli strain C (AC3121 = CGSC 3121) was obtained from the E.coli Genetic Stock Center. The wild-type E.coli strain W (ATCC 11105) was obtained from the American Type Culture Collection. The K-12 strain ER2683 [fhuA2 glnV44 e14- rfbD1? relA1? endA1 spoT1? thi-1 Δ (mcrC-mrr) 114::IS10 Δ (lacI-lacA) 200/F′proAB lacIq ΔlacZM15 zzf::miniTn10 (KanR)], used for transformations, was constructed in this laboratory from an MM294 background strain. All strains were grown in Luria-Bertani medium at 37°C.
Qiagen kits were used for purification of chromosomal and plasmid DNAs according to the manufacturer’s instructions. All enzymes were from New England Biolabs. DNA amplification by PCR, restriction endonuclease digestions, DNA ligations and agarose gel electrophoresis were performed according to standard protocols (35). Transformation of E.coli with plasmid DNA was done using electroporation (36).
A size-fractionated E.coli C genomic DNA library was prepared by digesting purified chromosomal DNA with EcoRI, followed by electrophoresis on an agarose gel, and elution of DNA running at about 5 kb using a Qiagen gel extraction kit. A library of these fragments was ligated to EcoRI-digested, dephosphorylated pBR322. Following electroporation into ER2683, 10 pools of about 300 colonies each were prepared, and plasmid minipreps were made from each pool. These pools comprised the size-selected library. Glycerol stocks were also prepared from 1 ml of each wash.
We identified a pair of primers that amplify a region in the yjiY gene (downstream of mrr in E.coli K-12) from both K-12 and C genomic DNA. These yjiY primers were used to screen each of the 10 plasmid pools of the size-selected E.coli C DNA genomic library, and one of the 10 (9) showed the expected 1022 bp PCR product. The strain ER2683 was transformed with the 9 plasmid pool and 136 resulting colonies were patched on to an LB plate with ampicillin. Cells scraped from the patches were combined to make secondary pools of 13 patches each. Plasmid DNA was prepared from each secondary pool, and the yjiY PCR repeated. The pool containing patches 27–39 was positive. Cells from each of these individual patches were scraped from the plate, individual plasmid preps made, and the yjiY PCR again repeated. Colony 38 had a positive PCR and cells from the 38 patch remaining on the plate were scraped off and an overnight culture grown to isolate a stock of this plasmid DNA. The plasmid of colony 38 was designated pMS3. To sequence the ICR flanking region upstream of mrr in E.coli C, PCR primers in yjiS and yjiA were used to screen the size-selected library. The plasmid pool from plate 8 of the size-selected E.coli C library had the correct sized product from the yjiS and yjiA primers. The ER2683 strain was transformed with this prep and 208 resulting colonies were patched on to LB ampicillin. Successive yjiS-yjiA PCRs were performed as described above and the plasmid DNA from one final positive colony was designated pMS4.
Because of two closely spaced EcoRI sites in the K-12 sequence (193 bp apart), we expected a corresponding gap between the ends of the pMS3 and pMS4 sequences. Primers were used to amplify a PCR product across this gap, one primer in the Z5950 coding sequence (pMS4) and one primer in yjiX (pMS3) from E.coli C genomic DNA.
Genes hpaC and hpaB are not known from E.coli C; hpaI is known from C (37). To join sequence information revealed in this study with information from known E.coli C sequence, a PCR product was generated using primers in hpaC and hpaI with E.coli C chromosomal DNA as template; E.coli K-12 and E.coli W chromosomal DNA were used as negative and positive controls, respectively. A PCR product of about 5.5 kb was generated from these primers from E.coli C DNA. As expected, this product was also generated from W DNA and no product was generated from K-12 DNA. The PCR product from C DNA was cloned into the pPCR-Script Amp SK (+) vector (Stratagene). The sequencing of the hpaC-hpaI PCR product yielded a 5.5 kb sequence containing sequence similar to the E.coli W sequence spanning hpaC through hpaI, including the entirety of hpaB, hpaA and hpaX (GenBank Z37980).
The inserts of both pMS3 and pMS4 and the cloned hpaC–hpaI PCR product were sequenced using as templates random insertions generated with the GPS-1 Genome Priming System (New England Biolabs). Sequencing primers were Primers S and N of the GPS kit. The PCR product joining pMS3 to pMS4 was directly sequenced using the primers employed for the original amplification. Sequence was assembled using AutoAssembler (Applied Biosystems).
Sequence files were created containing 70 bp of sequence from each strain, joining 40 bp to the left of the yjiS stop with 30 bp to the right of the yjiA stop. The GCG program LINEUP created a consensus sequence. STEMLOOP was then used to predict dyad symmetry elements, with parameters of: minimum stem, 6; minimum bonds/stem, 12; maximum loop 20. The dyad shown in Figure Figure55 was the best of seven stems found.
Sequence files used were: E.coli K-12, complete genome NC_000913.1, nucleotides 4567731–4590882 (38); E.coli O157:H7, complete genome NC_002655.2, nucleotides 5463295–5481664 (29); E.coli CFT073 complete genome NC_004431, nucleotides 5159888–5178596 (28); E.coli W, M15950.1, AF109125.1, M17609.1 and Z37980.2 (39–41); and S.typhimurium LT2, complete genome NC_003197, nucleotides 4774451–4791743 (30). The E.coli C sequence files Z47799, AF036583, X81446, x55200, X53666, x81322, x75028 and s56952 (37) were used to join our sequence to the rest of the hpc (hpa) operon. Sequences for CFT073 and K1 were kindly provided by Guy Plunkett in advance of publication. Sequence of the pathogenic strain K1 will be published elsewhere. Protein family annotation in Table Table11 employed the Linkout facility of the Entrez web site at the National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov/Database/index.html), and the Conserved Domain Database there (42). For correlation purposes, the genetic and restriction maps of E.coli may be consulted (43,44).
GenBank files M15950.1 Penicillin acylase [ECOPAC; (45)], AF109125.1 Penicillin amidase precursor regulatory region (46), M17609.1 Penicillin G acylase, complete [ECOPGA; (39)] and Z37980.2 [EC4HPADNA; hpa genes (40)] combined to form the sequence across the E.coli W version of the ICR as far as tsr. The last 21 base pairs of ECOPGA were similar to yjiS of K-12. To verify connection of this contig to yjiR, primers were designed in pac and yjiR based on known sequence in W and K-12. End sequencing of this product confirmed linkage of pac and yjiR. The sequence used in Figure Figure44 was obtained from a composite of three sequences such that all the sequence was obtained from two sources. The entire region is present in AF109125 (bp 85–373). Two corrections to the AF109125 sequence were made based on other available sequences, since this allowed improved alignment in Figure Figure4.4. Discrepancies were: AF109125 contained a three base insertion (‘TCG’ bp 198–91 of that file) relative to our yjiS end sequence; second, AF109125 contained an insertion of G at position 277 of that sequence relative to ECOPAC (no G before position 21). Sequence in Figure Figure33 is contained in ECOPGA only and was accepted as presented. Linkage to EC4HPADNA rests on a 35 bp overlap between the end of ECOPGA and the end of EC4HPADNA and the consistency of this linkage with the stable framework of the E.coli genomes.
The region of the ICR in the E.coli genome, including the genes yjiS through yjiA (Fig. (Fig.1),1), is spanned by an 18 kb EcoRI fragment in K-12 but only by a 5.1 kb EcoRI fragment in C (6). We cloned and sequenced the 5 kb EcoRI fragment identified by Daniel et al. (6) by Southern blot, using a PCR-directed search strategy to identify suitable clones in a size-selected clone library. The adjacent 6.7 kb was obtained by bootstrap PCR and cloning of segments with similarity to the adjoining region in the K-12 genome, also identified by PCR. Sequencing employed a mixture of vector-primed end sequencing, transposon insertion sequencing and gap closure by sequencing of PCR products. This sequence was deposited in GenBank (accession number AY392450).
Table Table11 lists genes identified in our 13.7 kb segment, inferred from DNA sequence similarity found in GenBank (release 138) using nBLAST (47). Genes are listed in clockwise order with respect to the K-12 sequence, and have been given the name of the best match to another E.coli strain. All genes were uninterrupted in frame relative to the database version adopted as a name, except that the start codon and first six amino acids of Z5950 differ between C and the source CFT073.
As required by the sequencing strategy, one border of this region comprises two framework genes, yjiR and yjiS. The next three genes are candidate migratory genes: they are not found at all in K-12, but are found in other strains (see below). First is a fragment of an hsdR gene similar to a type IB family restriction enzyme, EcoA. EcoA is found genetically in this region in E.coli 15T–. We note that the 565 nucleotides encode an N-terminal fragment of hsdR that would not form a functional endonuclease. This sequence also would not have been present on the EcoA DNA probe used by Daniel et al. (6). Our result is thus consistent with the observed lack of restriction activity or hybridization. An ORF of unknown function is next. This is also found in E.coli CFT073, designated Z5950 (28); in O157:H7 a frameshift breaks this ORF into two parts, Z5949 and Z5950 (29). Similarity to K-12 then resumes, with more framework genes of unknown function found to the right of the ICR in K-12: yjiA, yjiX and yjiY.
At the right end of our E.coli C sequence are five more candidate migratory genes (Table (Table1).1). Again, these are not found in K-12 but are found at this sequence location in another strain, this time in E.coli W (40). These genes, hpaC, hpaB, hpaA, hpaX and hpaI (hpcH), form the left end of a 10-gene cluster found between yjiY and tsr in E.coli W. The gene cluster enables E.coli W to degrade 4-hydroxyphenylacetic acid. E.coli C also carries the other genes of this cluster: they were designated hpc (homoprotocatechuate degradation) by Roper et al. (37). Our sequence overlaps that of Roper et al., and we were able to construct a contiguous sequence as far as tsr by connecting those sequences with ours. This second locus for migratory genes will not be discussed further.
We compared our E.coli C ICR sequence with sequences around this region of the chromosome from six other strains of enteric bacteria. In addition to the well-studied E.coli laboratory strain K-12, these strains include another E.coli laboratory strain (W), the E.coli pathogens CFT073, O157:H7 and K1, and S.typhimurium LT2 (see Materials and Methods for references). For all seven strains, the same framework genes are present in the same order, interrupted by a variable region in the same location as the ICR of E.coli K-12 (Fig. (Fig.1).1). In all strains, the framework genes yjiS and yjiA constitute the left and right borders, respectively, of the ICR collection of migratory genes.
Figure Figure22 shows the migratory genes contained within the variable region in the seven strains (not to scale). ORFs similar at the DNA level [>65% identity (48)] are assumed to be orthologous and given names annotated in the earliest genome sequence. The number of genes present ranges from one (E.coli W) to 10 (E.coli K-12). In all, 45 ORFs representing 20 different genes are found here. Ten of these can be assigned a function. Collectively, RM genes comprise 21 of the 43 ORFs, and nine of the 10 assignable functions, consistent with the designation ‘ICR’. Six of the seven strains have at least one RM gene or gene fragment in this region. The RM systems are highly variable: two of the four type I families are represented, most likely representing five different specificities (see below); modification-dependent enzymes of two unrelated sorts are also present. The remaining gene with assignable function is penicillin acylase, in E.coli W. The remaining 10 genes (21 ORFs) are of unknown function.
Some groups of genes migrate together (Fig. (Fig.2).2). The first group comprises the type IA restriction genes hsdS, hsdM, hsdR and mrr (orange boxes) found in three strains—K-12, S.typhimurium and K1. Second are the type IB restriction genes, also hsdS, hsdM and hsdR, and two ORFs of unknown function, Z5949 and Z5950 (dark blue boxes) which are found in O157:H7, CFT073 and C. Third, yjiW (yellow box), a gene of unknown function that may be SOS-regulated (49), appears in all five of the strains with complete type I systems. Fourth, Z5943 and Z5944 (green boxes) are found together in the three E.coli pathogens. The strain-specific genes not found anywhere else in this dataset (white boxes) include: the K-12 genes yjiT, yjiU, mcrD, mcrC and mcrB; the Salmonella ORFs STM4522 (on the left) and STM4528 and 4529 (between mrr and yjiA); and penicillin acylase [pac (50)], the sole gene found in place of the ICR region in E.coli W.
The pattern of gene relationships described above is even more confusing than we had anticipated. To get a sharper sense of gene relationships, we derived PIP indices [percentage of aligned bases identical; as in Florea et al. (51)] for DNA segments spanning the ICR and its flanks, from yjiS to yjiY. E.coli C was compared with six of the other strains, K-12 with five, K1 with four, and O157 with CFT073. The results are shown in Tables Tables22 (for framework genes) and and33 (for migratory genes).
The levels of DNA identity in the framework genes are generally consistent with expectation (Table (Table2).2). In general, E.coli genes diverge at up to 5% of base pairs (52,53) (≥95% identical), while E.coli–Salmonella divergence is 10–20% (53). For yjiX and yjiY, we observe divergence of up to 2% within E.coli and 9–12% between E.coli and Salmonella. Divergence is higher in the framework genes yjiS and yjiA, immediately adjacent to the variable locus: up to 15% for E.coli comparisons, 13–33% for E.coli–Salmonella. In all pairwise comparisons, yjiS is least conserved and yjiX is most conserved among the four framework genes. This may reflect higher levels of homologous exchange immediately around the variable locus, mediated by selection acting on the genes found there (23,54,55). Below (next section) we discuss evidence for homologous exchange in the generation of the set of strains under consideration.
The pattern of sequence variation between similar migratory genes (Table (Table3)3) is consistent with previous studies. The type IB hsdRM genes of O157 and CFT073 are highly similar to each other (98–99% nucleotide identity) and so highly divergent from the type IA hsdRM genes of K-12 that meaningful alignments cannot be made. This divergence is in line with earlier characterization of these families (56). The hsdS genes of the two pathogens show the patchy similarity expected for hsdS genes encoding different sequence specificity: conserved regions of the protein can be recognized as segments with 85–97% DNA identity, while the inferred DNA-binding regions are <50% identical (56). The K-12/Salmonella comparison has been made before (19,57).
Figures Figures33 and and44 display nucleotide sequence alignments of the seven strains at the right (yjiA) and left (yjiS) borders of the ICR. These alignments display sequence differences in different colors: O157:H7 was taken as the reference sequence (black letters) to minimize the amount of color. Differences from this are coded in color hierarchically. Those differences found in K1 or CFT073 are in blue wherever they occur; additional differences found in C are in green; then further differences found in Salmonella are in red; then additional differences found in W are in yellow; and remaining differences found in K-12 are in pink.
Three properties of these border sequences are evident. First, extreme sequence divergence begins abruptly after the stop codon of the flanking framework gene (to the left of the yjiA stop codon in Fig. Fig.3;3; to the right of the yjiS stop codon in Fig. Fig.4),4), visually recognized as a high level of color variation.
Second, families of sequences can nevertheless be identified, and at the right border these are congruent with families identified by patterns of shared genes seen in Figure Figure2.2. At the right border, O157:H7, CFT073 and E.coli C form one family (reference sequence, black), K1 and K-12 form another (blue highlighted changes); while S.typhimurium (red) and E.coli W (yellow) are each unique (Fig. (Fig.3).3). This is consistent with gene groupings: O157:H7, CFT073 and C all share Z5950 immediately to the left of yjiA, K1 and K-12 share mrr, W carries pac, and Salmonella carries STM5428 and STM5429 interposed between mrr and yjiA. Sequence families are so dissimilar that they appear to be unrelated. This suggests that four independent DNA acquisition events may have occurred at this location, immediately adjacent to yjiA.
Third, the family structure observed at the right border is almost totally erased at the left (yjiS) border (Fig. (Fig.4).4). As at the right border, sequences associated with four different categories of genes are found adjacent to the border (Z5943, yjiT, STM4552 and penicillin acylase), but deletion and homologous recombination have obscured relationships, and two novel sequences generate further variation.
K1 appears to be a recombinant between strains similar to K-12 and CFT073. Markers flanking the recombination interval are a short deletion and the different hsd family genes. The deletion joins the yjiR and yjiS coding sequences in CFT073 and K1 (Fig. (Fig.4).4). (Note that these genes are coded on opposite strands, so no fusion protein will be expressed.) CFT073 and K1 also share similar sequence to the right of the yjiS stop codon, the only two strains to do so. These two are thus grouped together at the left border, although they were not grouped at the right border. Inspection of Figure Figure22 suggests that a recombination event between strains similar to K-12 and CFT073, occurring within yjiW, could generate the gene configuration found in K1. The CFT073-like ancestor must have already carried the deletion, since K1 carries this as well.
This proposed sequence of events in CFT073 and K1 does not account for all the variation at the yjiS border, however. Two anomalies are evident in comparing Figures Figures22 and and4.4. First, the pattern of shared genes seen next to yjiS (Fig. (Fig.2)2) would suggest that O157:H7 should belong with CFT073 and K1, yet O157:H7 carries 57 bp of unique sequence between the yjiS stop codon and common sequence to the left (upstream) of Z5943. Identity among the three strains begins abruptly after this 57 bp, 36 bp from the Z5943 start codon. In effect, O157:H7 has suffered a short replacement event, compared with K1 and CFT073.
The second anomaly concerns E.coli C. From Figure Figure2,2, we might have proposed that a simple deletion, occurring in a strain like one of the pathogens, had joined yjiS with the middle of hsdR. However, in Figure Figure44 we see that in E.coli C, 47 bp of unique sequence lies between the yjiS stop codon and the hsdR fragment. This sequence is unlike either that of O157:H7 or that shared by CFT073 and K1. Possibly some still-unknown cassette was acquired at some time to occupy the position of Z5943/4, followed by a deletion event. This brings the number of cassette acquisitions to five; in any case, some event occurred at this border to generate novel sequence.
In summary, there must have been six different events at the left border, and four different events at the right border. This suggests an active mechanism for generating sequence variation precisely at this locus. This mechanism is such that six of seven strains carry 50–60 base pairs of unique intergenic sequence at the yjiS border, even when genes immediately to the right of that unique sequence are shared. The border sequence (adjacent to the yjiS stop) also appears to form a barrier to sequence rearrangement, since deletion on the left (joining yjiR and yjiS in K1 and CFT073) and both deletion (in C) and replacement (in O157:H7) on the right do not cross this edge. Features of the right and left border sequences are considered in the Discussion.
We consider below (Discussion) the possibility that site-specific recombination mediates acquisition of elements of the ICR. Recombinases that mediate such events are usually found immediately adjacent to their sites of action. Therefore, the distribution and environment of the contiguous framework genes yjiA, yjiX and yjiY (Fig. (Fig.1)1) were considered in light of this possibility.
These framework genes are present in all complete E.coli, Shigella and Salmonella genomes but are absent from more distant enteric species: for example from the two complete Yersinia pestis genomes, the two complete Buchnera aphidicola genomes, or the partially completed Klebsiella pneumonia sequences. Nevertheless, strongly conserved homologs of these genes are found in the distant non-enteric relative Pseudomonas aeruginosa (PA4604, PA4605 and PA4606). They are present in conserved order and orientation, and are >50% identical at the amino acid level. The conservation of organization may indicate a functional relationship among these genes (58).
Interestingly, all five of the sequenced Pseudomonas have yjiA, but the sequences immediately downstream differ (Table (Table4);4); the five ORFs at this location are unrelated. This situation would be expected if the gene cluster acts as a recombinase at an attachment site downstream of yjiA.
Based on our sequence comparisons, the borders of the ICR are the yjiS and yjiA genes. A variety of migratory genes and gene fragments lie between these genes in the different strains. Between these border genes, the variability of the ICR begins immediately adjacent to their convergent stop codons. The identifiable genes found at the ICR have the common feature that they mediate interaction with external challenges of a sporadic nature: restriction systems (or remnants) to cope with invading DNA, or an enzyme for antibiotic modification (penicillin acylase). It is appropriate that genes with a sporadic distribution should respond to sporadic challenges. Below we consider some mechanistic models for generating this variation.
Any mechanism proposed for segmental replacements at the ICR must account for the abrupt transition from similarity to dissimilarity at the borders (Figs (Figs33 and and4).4). In addition, horizontal transfer must play a role, since divergence among the hsd genes of the E.coli strains is greater than that between those of K-12 and S.typhimurium.
Site-specific recombinase (cassette) model. Our preferred model invokes an attachment site for site-specific recombination. It is known that unrelated phages may carry related integrases and can use the same attachment site [e.g. phage 21 and e14; (59)]. Thus, the inserted sequence can be totally dissimilar except for the integration machinery. Such a model would allow the inserted segments to diverge over long periods, sufficient even to allow divergence of the families of type I restriction enzymes, while conserving the sequences required for recombinase action in the host genome by selection for function of the flanking genes. The precision of the borders is immediately accounted for. Cassettes of genes would be propagated to new hosts only if they retained sequences compatible with recombinase specificity. A pool of genes would then evolve that circulates specifically at this locus.
Recombinase candidate. The problem with this model is that for all site-specific mechanisms known to mediate DNA acquisition, the gene for the recombinase is closely associated with the recombining segment [this is not true for XerCD, which mediates resolution of circular dimers at unlinked sites of action; XerCD action is highly regulated and has not been reported to mediate sequence acquisition (60)]. For phages (or transposon Tn7 in site-specific mode), the gene(s) is (are) contained within the mobile segment, while in the case of integron integrases, the recombinase gene lies immediately outside the mobile segment. In the present example, no integrase/transposase candidate is immediately identifiable by sequence similarity. However, the neighboring gene cluster yjiAXY might code for a novel recombinase of episodic distribution. Consistent with this proposal, this cluster is found immediately adjacent to a variable locus in the distantly related genus Pseudomonas (Table (Table4).4). Interestingly, it is missing from several organisms more closely related to E.coli, suggesting that the cluster itself, as well as the cassettes adjacent, may be mobile, or at any rate dispensable.
Possible attachment site. Dyad symmetry elements (inverted repeats) are often prominent components of the DNA sites at which integrases act, although they can sometimes be difficult to detect (59). The alignment of sequences obtained by joining the left and right borders of the ICR in the seven strains reveals such an element (Fig. (Fig.5),5), identified using the program STEMLOOP (see Materials and Methods). This joined sequence would be the presumed empty attachment site if these segments were prophage-like or cassette acquisitions as in integrons. A 13 bp dyad symmetry element (with two mismatches) surrounds the joined stop codons, interrupted by 5 bp including the yjiS stop. Even though no lambda-integrase-related gene is present, the dyad symmetry could still be important for a novel site-specific recombinase. We note further that the structure of the occupied sites is not that expected from a lambda-like integrase enzyme. Occupied sites should conserve the dyad structure at each border—that is, Figures Figures33 and and44 should also exhibit the presence of dyads or sequences related to them, especially in any consensus sequence.
Illegitimate recombination/transposition model. An alternative model would invoke random insertion events introducing new genes, for example by non-homologous end-joining [possibly stimulated by the action of a restriction enzyme (61)], or by transposition. Purifying selection against those insertions in unfavorable locations would eliminate most events. In the present case, the problem is to account for the apparently precise nature of the insertion process.
Illegitimate recombination has been invoked to explain the tight packing of related and unrelated genes in phage genomes. In a comparison of the genomic sequences of 14 mycobacteriophage, Pedulla et al. (62) also find a mosaic structure with segments separated by sharp transitions in sequence similarity. As in our bacterial ICR sequences, they find that most mosaic boundaries in these phage sequences are located at or very near gene boundaries, but of the >500 boundaries found, three disrupted coding sequence. Pedulla et al. invoked an earlier suggestion (63) that such diversity must be generated by a combination of illegitimate and homologous recombination and mutational drift. In their model, recombination can occur virtually anywhere along the phage sequence, but the only phages that survive to be analyzed have undergone recombination events that do not diminish the function of any phage protein on which natural selection acts.
In the case under study, this seems an unsatisfactory model. In E.coli K-12, the entire region can be deleted [Raleigh et al. (64) and unpublished observations], so selective constraints are presumably less than is the case for the streamlined phage genomes. Thus, the insertion locations would be expected to vary somewhat in different events. Moreover, the model does not account for the high frequency of highly divergent restriction genes at this location. At least six different events must have occurred among the seven strains, at least at the left (yjiS) border (see Results: A fine scale view).
Differential deletion model. Differential deletion from a large original set of ‘migratory genes’ could in principle account for what we see. Selection for function of the framework genes would be required. This model would require an initial set of 20 different genes. Many of the deletion events would have had to occur in a highly precise fashion to generate the sharp borders observed. Trials on paper (not shown) suggest that 11 different deletion events and (contrary to the model) two small insertion events (of 47 and 57 bp, in E.coli C and O157:H7) would be required to generate the observed pattern. Even more genes and events would be required to generate the novel joints of C and O157:H7 by deletion only. Of those 13 events, eight (including both insertions) would involve one of the two borders in an event with a precise endpoint (five on the left and three on the right; not shown). Deletions would have had to bring the type I genes adjacent to yjiW in two different ways, with the type IA and type IB systems separately eliminated. This mechanism seems excessively baroque, and does not readily account for any of the properties of the locus.
Variation of RM systems within populations has been a topic of interest for other taxa also. In Mycoplasma genitalium, site-specific recombinase action enables reassortment of endogenous Type I specificity domains, such that any cell in the population may express one or two of at least eight possible specificities (65). Horizontal transfer is not needed for this.
In Helicobacter pylori, very large numbers of RM sequences are found in each isolate, with around two dozen candidate systems in each strain (66,67), scattered around the chromosome. Many of the type II system candidates turn out to be inactive or only partially active (68), with non-overlapping sets of four systems active in the two strains examined (67). Reactivation of inactive alleles by horizontal transfer and homologous recombination has been invoked as a regulatory mechanism (69). No locus has been described that contains multiple alternative specificities.
A closer analog of our situation is a variable restriction locus in Neisseria, found between two conserved genes, pheS and pheT (70). The variable region contains between zero and six genes, six of the nine strains contain RM-related genes, and divergence begins immediately after the start and stop codons of the flanking genes pheS and pheT. Saunders and Snyder invoke homologous recombination as a mechanism for reassortment of the contents of the variable sequences (70), much as we see with K1, an apparent recombinant between CFT073-like and K-12-like ancestors. There as here, however, homologous recombination could not account for the insertion of diverse sequence at precise locations. Neither dyad symmetry element nor candidate recombinase is identifiable in Neisseria.
The nature of selective pressures on restriction enzyme function has been considerably debated (see references in 13,17,71). All models for the evolution of these genes proceed from the diversity observed. Three general models have been proposed: that RM systems protect the host from invading DNA (phages or plasmids); that they enable the host to promote or regulate recombination (a sexual function) or that they have selfish features that promote their own spread. In all cases, there must be horizontal DNA transfer events to provide a selection substrate.
The site-specific recombinase model could account for the high frequency of restriction loci at this site, while not excluding that other functions, also advantageous when rare (72), might acquire the ability to circulate here (e.g. the pac gene). This could occur when such genes acquire sequence determinants that allow interaction with the putative recombinase.
Insertion/deletion variation has been extensively discussed in reports on enteric genome sequences, but most discussion has focused on the gene content of the variable islands, not the mechanism of acquisition. Phage-like islands or transposase homologues near island junctions have been noted in some cases, but these are present in only a fraction of examples. Clearly, some mechanisms remain to be discovered.
We thank Noreen Murray, Rich Roberts, Roger Milkman and Romualdas Vaisvila for extensive discussions and intellectual support during conception of this paper. Guy Plunkett shared sequence of the E.coli pathogens in advance of publication. Rob Edwards was sufficiently critical that we re-examined the sequence environment surrounding the proposed attachment site. We thank Don Comb and New England Biolabs for support.
DDBJ/EMBL/GenBank accession no. AY392450