Phage isolation and genome sequencing
The isolation of TM4, Angelica, and CrimD has been described previously 
, as well as their genomic sequences 
. Phages Anaya and Adephagia were isolated at Calvin College and the University of North Texas respectively as part of a freshman research-based course supported by the Howard Hughes Medical Institute (HHMI) Science Education Alliance (SEA). Pixie was isolated at the University of Pittsburgh as part of its Phage Hunters Integrating Research and Education (PHIRE) program 
. All were isolated by the plating of environmental samples on lawns of Mycobacterium smegmatis
155; Pixie, Adephagia, and Anaya were recovered after enrichment by growth in the presence of M. smegmatis
Following plaque purification, DNA was isolated, and each genome was shotgun sequenced using 454 technology to at least 25-fold redundancy (~4000 to ~8000 reads per genome). In the cases of Adephagia and Anaya, shotgun Illumina reads (100-fold redundancy) were also generated to strengthen any weak points in the 454 data. Remaining ambiguities and the nature of the genome termini were resolved by targeted Sanger sequencing with oligonucleotide primers using phage genomic DNA as a template. The general genomic features of these phages are shown in .
Genometrics of Cluster K mycobacteriophages.
Cluster and subcluster assignments
Comparison of the genome sequences of phages Anaya, Adephagia, Angelica, CrimD, and Pixie by dotplot analysis shows that they share extensive nucleotide sequence similarity (). This similarity is clearly to different degrees, but none show substantial DNA similarity to any other sequenced mycobacteriophages (data not shown). Anaya, Adephagia, Angelica and CrimD show strong similarity (>92% pair wise average nucleotide similarity; ANI, ) and constitute Subcluster K1. Both TM4 and Pixie share less than 73% ANI to the other phages in that TM4 constitutes Subcluster K2 
, and Pixie is the sole member of the new Subcluster, K3 ().
Dotplot comparison of Cluster K genomes.
Electron microscopy of the Cluster K phages shows that they have similar particle morphologies with long flexible non-contractile tails and isometric heads (). The heads of all six phages are approximately 55 nm in diameter and the tails are 185–200 nm long. The Cluster K phages are thus classified morphologically as members of the Siphoviridae. Short side tail fibers at the tip of the tail can be seen on many of the particles.
Virion morphologies of Cluster K phages.
Host-range of Cluster K phages
The host-range of TM4 has been described previously 
; it is reported to infect fast-growing mycobacteria such as M. smegmatis
as well as the slow-growing M. tuberculosis H37Rv
and M. ulcerans
. However, these reports differ in regards to the infection of M. avium
by TM4, with substrains M. avium
701; 6, M. avium
702; 7, M. avium
3746/02 being resistant to infection 
, whereas substrains M. avium
Bridge, serovar 2, M. avium
158, serovar 2, M. avium
TMC 1419, serovar 2, and M. avium
TMC 1461, serovar 2 are sensitive 
. Timme et al (1984) report that TM4 infects nine M. avium
strains, all of different serovars. Rybniker et al (2006) postulate that because TM4 was derived from a putative lysogenic strain of M. avium
6/8 serovar 4, the failure to infect some substrains of M. avium
may be due to superinfection immunity conferred by resident prophages.
We tested phages Adephagia, Anaya, Angelica and CrimD as examples of Subcluster K1 as well as TM4 and Pixie for plaque formation on M. tuberculosis mc27000, M. bovis BCG strain Pasteur, M. avium 104, and M. marinum strains M and 927. All six phages infected M. tuberculosis mc27000 efficiently, albeit with different plaque morphologies (); TM4 yields larger clear plaques while Angelica, CrimD, and Pixie produce smaller, turbid plaques. Adephagia and Anaya produce large turbid plaques, although Anaya only produces plaques when incubated at or below 33°C. Only TM4 showed infectivity on M. bovis BCG, although we observed a reduction of efficiency of plating relative to M. smegmatis by between five and six orders of magnitude. No infectivity of M. avium 104 was observed with any of the Cluster K phages tested here. Adephagia, Anaya, Angelica and CrimD showed no infection of either M. marinum strain, although both TM4 and Pixie did, albeit at a greatly reduced efficiency of plating (data not shown). Plaques picked from these plates and re-spotted on lawns of M. marinum did not show an increased ability to infect either M. marinum strain over the parent phages. These plaques were only observed using 0.35% top agar and incubating at room temperature. We do not yet know the basis for these observed reductions in plating efficiencies, although it could be the result of restriction, CRISPR's, abortive infection, or the need for mutations that would expand the host range.
All Cluster K phages infect M. tuberculosis.
Temperature-dependence of Anaya growth
Despite its genome sequence similarity to the other members of Cluster K1, Anaya does not share the same temperature growth range. No wild-type Anaya plaques were observed on M. smegmatis or M. tuberculosis lawns when plates were incubated at temperatures higher than 33°C; however, it was possible to isolate stable high-temperature resistant Anaya mutants from high titer M. smegmatis infections at 37°C. A wild-type Anaya lysate incubated at 37°C for one hour retained infectivity, indicating that the particles themselves do not dissociate at elevated temperatures. The nature of this temperature sensitivity during infection remains unclear.
Anaya, Adephagia, Angelica, CrimD, and Pixie are temperate phages
It has previously been reported that TM4 behaves as a lytic phage in infection of M. smegmatis
or M. tuberculosis
, and lysogens have not been reported 
. Furthermore, the genome of TM4 contains no readily identifiable features to suggest that it is competent to form lysogens 
. However, during the host range analyses described above, it was evident that all of the other Cluster K phages form turbid plaques on all the susceptible strains tested. The Cluster K1 phages consistently show uniform, medium-sized plaques (~2 mm dia.), although Pixie plaques are smaller, with more variation in size and less turbidity. Using these phages, we successfully recovered lysogenic derivatives of M. smegmatis
that both confer immunity to self-superinfection, and release phage spontaneously into culture supernatants. Integration of the genomes was confirmed by PCR across one of the putative attachment junctions (see below). In spite of further attempts, we were unsuccessful in recovering any TM4 lysogens.
We have determined the immune specificities of each of the Cluster K phages (). Interestingly, we observed patterns of reciprocal immunity of the K1 and K3 phages, and presumably this homoimmune group of phages has related repressor-operator systems. In contrast, TM4 efficiently infects all of the lysogenic strains tested (). We note that the Adephagia lysogen behaves somewhat differently to the other K1 phages and appears to confer at least partial immunity to all of the phages tested, including TM4 and Pixie (). These observations are especially revealing about TM4 and its previously characterized properties. A simple explanation is that TM4 is a relatively recent derivative of a temperate phage that was heteroimmune with other Cluster K phages, but which has lost its immunity functions. When this event may have happened is unclear, it could have occurred during passage of the phage between its isolation in 1984 and genome sequencing in 1998, during the process of isolation, or at some prior time as a naturally occurring event. This is discussed in greater detail below. We note the obvious parallels to the relationship between D29 and Cluster A phages such as L5 
. In D29, a 3.6 kbp deletion removes a segment that in L5 contains the repressor, and although D29 is lytic in nature, it is homoimmune with L5 immunity 
Immune specificities of Cluster K phages.
Revisions to the TM4 genome annotation
The development of improved bioinformatic tools and the advantages of comparative genome analysis facilitate a revision of the TM4 genome annotation, an important consideration given its widespread use in mycobacterial genetics (Table S1
; , , , , ). We propose that three formerly identified orfs – designated as genes 32
, and 71
, are removed. The first two are very small (90 bp and 150 bp respectively) and show no compelling evidence of coding potential. The third (71
) is somewhat larger (294 bp) but also shows little evidence of coding potential. In the central part of the genome, there was formerly a single small rightwards-transcribed orf (336 bp), designated gene 41
. We propose that this is replaced by three small orfs on the opposite strand, designated as genes 93
(to maintain the prior gene naming scheme). Although all three are small, they all show good coding potential as predicted by GeneMark 
. In addition, relatives of 93
(Pham 1847) are present in Pixie (as gene 74
, ) – although not in the Subcluster K1 phages – and distributed broadly among a diverse collection of different mycobacteriophages (). The closest relative of TM4 gp93 is Pixie gp74 although the proteins share only 53.7% identity, and the route by which TM4 gene 93
arrived at its current genomic location in TM4 is unclear.
Global comparison of Cluster K genomes.
Genome map of Mycobacteriophage TM4.
Genome map of Mycobacteriophage Anaya.
Genome map of Mycobacteriophage Adephagia.
Genome map of Mycobacteriophage Pixie.
In addition, there are five TM4 genes for which an alternative start codon is predicted, genes 12
, and 76
. Two of these were previously annotated to use either AUG (gene 76
) or GUG (gene 26
), and all have been re-annotated to use a UUG start codon, with better predictions for ribosome binding sites and better alignment with the predicted coding potential. Translation start sites for genes 12
, and 66
were changed to more closely reflect the coding potential (Table S1
Cluster K genome organizations
Five of the six Cluster K genomes (Anaya, Adephagia, Angelica, CrimD and Pixie) are of similar lengths (59.1–61.1 kbp) with TM4 (52.7 kbp) being approximately 7 kbp shorter (). All of the viral genomes are linear with defined ends having 3′ single-stranded DNA complementary extensions; all have 11-base extensions with the exception of TM4, which is reported to have a 10-base extension 
. The genomes contain between 90 and 100 predicted protein-coding genes and the four Cluster K1 genomes – Anaya, Adephagia, Angelica and CrimD – all encode a single tRNAtrp
near their left end (see Tables S1
To facilitate comparative genomic analysis, the three newly sequenced genomes were added to the 80 previously described mycobacteriophage genomes to create a database (Mycobacteriophage_83) for the genome comparison program, Phamerator 
. The total of 9,308 predicted protein-coding genes were compared with each other using ClustalW and BlastP, and assembled into 2,667 phamilies using previously published parameters (manuscript submitted). Of these, 1,120 (47.3%) are orphams (phams containing only a single gene member). The mean pham size is 3.932.
An overview of the relationships between the six cluster K phages is shown in , and several patterns emerge. First, the extent of nucleotide sequence similarities between the genomes are clearly illustrated, and emphasizes the close similarity among the Cluster K1 phages, and the more distant relationships between these and the subcluster K2 and K3 phages. The left parts of the Cluster K1 genomes are especially closely related, with greater deviations in the right parts (). Secondly, the overall genome architecture is shared by all six phages with a substantial number of shared genes, as seen from the commonality of the color-coded pham assignments (). Thirdly, the basis for the smaller size of the TM4 genome compared to both its subcluster K1 and K3 relatives is apparent, with reductions in size near the left end, in the middle, and at the extreme right end (; see below).
Genome maps of Anaya, Adephagia, TM4 and Pixie are shown in , , , [Angelica and CrimD were reported recently 
and maps are provided as Figs. S1
; the TM4 map () is a revision of that reported previously 
. In all of the Cluster K phages the virion structure and assembly genes occupy the leftmost 22–24 kbp and are transcribed rightwards. There is considerable departure among the genomes at their extreme left ends, with a variable number of small genes of no known function between the terminase large subunit gene and the left physical end. All of the K1 phages, but neither TM4 nor Pixie, contain a tRNAtrp
gene in this region. Within the virion structure and assembly genes there are a few notable differences between the genomes. First, the putative capsid assembly proteases of the K1 and K3 phages are larger than that of TM4 (, , , ) due to a central insertion of about 1.1 kbp. This central portion does not appear to be related to inteins, homing endonucleases, or other mobile elements, but does have weak sequence similarity to parts of methyl-accepting chemotaxis proteins of several bacteria including Planctomyces limnophilus
and Chromobacterium violaceum
; however, it is compositionally biased (rich in alanine) which could account for the weak sequence similarity. The tapemeasure proteins are similar in length with the exception of Pixie gp20, which is 114 amino acid residues longer than the others; Pixie has a correspondingly longer tail than the other Cluster K phages (). To the right of the tail genes are the lysis cassettes, each of which contains a Lysin A gene, a Lysin B gene, and a putative holin gene. However, there is substantial diversity among the Cluster K phages in these genes. For example, the Lysin A of Pixie (gp31) is unrelated to the other Cluster K Lysin A proteins, and is more closely related to the Lysin A proteins of Cluster E phages (sharing, for example, 65% amino acid identity with Cjw1 gp32). The putative holin genes are downstream of the Lysin B genes, each containing 4–5 putative membrane-spanning domains and are only weakly related to each other and not across their entire spans. The 7–8 rightwards transcribed genes to the right of the lysis cassettes (e.g. Anaya genes 34
, ) are of unknown function, although we note that Anaya gene 36
and its relatives in the other five Cluster K phages have relatives in distantly related phages including Propionibacterium acnes
phage PA6. This region is one of the most diverse among the Subcluster K1 phages ().
With the exception of TM4 (see below), integration cassettes containing putative integrase genes and attP
sites are located close to the center of the genomes; the integrases are of the tyrosine recombinase family and the attP
sites are located to the 5′ side of the integrase genes (, , ). The integration cassettes are flanked by a small number of genes transcribed in the leftwards direction, whose function is unknown. Putative Xis genes encoding proteins with MerR-like DNA binding domains are located to the right within an apparently long rightwards-transcribed operon that extends to the right end of the genomes. This region contains WhiB-related proteins, e.g. TM4 gp49, a protein that has been shown to be non-essential for TM4 growth 
although it is well-conserved among the Cluster K phages. Other genes whose functions can be predicted from database similarity searches are those related to SprT (e.g. Pixie gp78), RusA (e.g. Adephagia gp75), HNH homing proteins (e.g. TM4 gp92), glutaredoxin-like NrdH proteins (e.g. TM4 gp67) and a large Primase/Helicase protein (e.g. TM4 gp70). The Subcluster K1 genomes also encode relatives of RtcB (e.g. Anaya gp88), a putative RNA ligase component 
. Because only the Subcluster K1 genomes encode both tRNA and the RtcB proteins, we speculate that these phage-encoded RtcB proteins play a role in protection against a host-mediated tRNA cleavage defense against viral infection 
. The remainder of the proteins encoded in these regions are of unknown function, and we note that about 30% of the Pixie genes in this region are orphams, reflecting its high genomic diversity from all other mycobacteriophages.
TM4 is a derivative of a temperate parent
TM4 was originally isolated by recovery from a strain of M. avium
, although understanding its origin is complicated by the observations that it is able to infect the original M. avium
strain and does not appear to be temperate in any mycobacterial host 
(). Because the related Cluster K phages are all temperate, we have investigated potential genes that are deleted in TM4 and that could contribute to a temperate lifestyle.
Because none of the other phages are closely related to TM4 at the nucleotide sequence level (), the most informative comparisons emerge from comparing shared genes with amino acid sequence similarity (). We have focused on two regions of the genomes. The first is at the center of the genomes where the integration cassettes are found in the Subcluster K1 and K3 phages (). TM4 genes 40 and 42 correspond to CrimD genes 38 and 44 such that the three leftwards transcribed TM4 genes, 93, 94, and 95, occupy the location corresponding to CrimD genes 39 and 44 (). Thus a simple explanation is that TM4 has lost a DNA segment approximately 3.5 kbp in length from a temperate parent that included the integrase gene and attP site. Interestingly, TM4 retains the predicted Xis function encoded by gene 43, consistent with this interpretation ().
Putative deletions giving rise to phage TM4.
The second region of interest is at the right end. The comparison between CrimD and TM4 is perhaps the most informative. CrimD contains homologues of TM4 gp84 and gp85 (CrimD gp83 and gp90), but they are separated by a 3.3 kbp DNA segment containing six predicted open reading frames (). This suggests that TM4 has undergone a deletion of approximately 3.3 kbp between genes 84 and 85 from its putative temperate parent. It is plausible that one of the lost genes corresponds to a phage repressor, consistent with TM4's clear plaque phenotype. We note that the L5 repressor (gp71) is encoded near the right end of its genome, so this is a not an unusual genomic position for a repressor gene. Although none of the genes in these regions of the Cluster K1 or K3 genomes have sequence similarity to known repressors, all the K1 and K3 phages are homoimmune and are thus expected to share similar repressors. Pixie is quite different from the K1 genomes in this region, and there is only a single gene that they share in this interval, corresponding to Pixie gp85, Anaya gp90, Adephagia gp86, CrimD gp87, and Angelica gp84. However, preliminary analysis suggests that expression of CrimD gp87 from a plasmid in the host cell does not confer immunity to any of Cluster K phages and it is therefore an unlikely repressor candidate.
All of the Cluster K genomes contain a member of Pham2518 with putative DNA binding motifs. We therefore tested whether a member of this phamily, Pixie 81 () confers immunity to superinfection. Expression of Pixie gp81 strongly interferes with Pixie infection (), but has only modest effects on infection with TM4, CrimD, and Adephagia, supporting infection but yielding plaques with increased turbidity. If Pixie gp81 and its relatives encode phage repressors, we would expect to observe immunity to other Subcluster K1 and K3 phages; thus we propose that these proteins are involved in gene regulation but not as phage repressors. Unsuccessful attempts to delete Pixie 81 suggest it is likely to be an essential gene, consistent with this interpretation. TM4 gene 72 has a small internal deletion compared to its relatives but its functionality is unknown.
Expression of Pixie gp81 interferes with infection by Cluster K phages.
Characterization of integration functions
The Cluster K1 mycobacteriophages are unusual in that they are predicted to integrate into a chromosomal attB
site that overlaps the host tmRNA gene. Each contains a 24 bp common core segment corresponding to the extreme 3′ end of the tmRNA, suggesting that strand exchange occurs within or close to the segment corresponding to the tmRNA TψC stem 
. The M. tuberculosis
tmRNA gene differs from both M. smegmatis
and the phages by a single base within the TψC loop, although this does not appear to interfere with integration since these phages form stable lysogens in M. tuberculosis
. These attP
common cores are located to the 5′ sides of the integrase genes in each of the K1 genomes (). A search for potential integrase arm-type DNA binding sites in CrimD reveals two pairs of 11 bp repeats, each flanking the common core (), which we have labeled P1, P2, P3 and P4. Sites P3 and P4 are inverted in orientation relative to P1 and P2 (); Anaya, Adephagia, and Angelica have similar organizations. We note that both mycobacteriophages Giles and L5 also contain pairs of putative arm-types 
although in these examples they are in direct orientation and L5 has several additional arm-type sites 
Organization of the attP sites of Cluster K genomes.
The CrimD four arm-type sites are not identical and vary in two positions (). Angelica contains identical sites to CrimD, but interestingly both Anaya and Adephagia have potential arm-type binding sites with different consensus sequences (). For example, whereas the consensus position 7 in CrimD is a T residue, in Anaya it is a G, and in neither case is there any departure from the consensus (); in contrast, two of the Adephagia sites have T residues, and two have G residues. Because the arm-type sites are recognized and bound by the N-terminal domains of the tyrosine integrases we have compared these regions of the Subcluster K1 integrases (). They are very closely related but contain amino acid substitutions at positions 17 and 19, which are thus candidates for involvement in recognition of the site features that differ between these genomes This is consistent with the model for arm-type site recognition by the lambda integrase 
. An intriguing possibility is that Adephagia represents a transitional state between the evolution of a CrimD-type specificity and an Anaya-type specificity (). It is also interesting to note that these Cluster K1 phage integrases are close relatives of some of the Cluster F1 integrases, including Fruitloop gp40 (44% amino acid identity), although these integrate into a different attB
site that overlaps a tRNAala
Because the putative Subcluster K1 attB
site is distinct from those reported for other mycobacteriophages, this presents an opportunity to construct integration-proficient vectors that are compatible with those derived from L5 
, Tweety 
, Giles 
, Bxb1 
, and Ms6 
. To construct a new integration-proficient vector we PCR amplified a segment of the Adephagia genome containing the int
) and attP
site and inserted it into a mycobacterial non-replicating plasmid to generate pWHP02. Introduction of pWHP02 into electrocompetent M. smegmatis
yielded transformants at a frequency of 5×105
transformants/µg DNA. PCR analysis of four independent transformants showed that plasmid integration had occurred at the predicted attB
site within the M. smegmatis
tmRNA gene (data not shown).
The Subcluster K3 phage Pixie codes for an integrase more distantly related to the K1 integrases, although it shares substantial similarity to other mycobacteriophage integrases including the Tweety (Subcluster F1) integrase (44% amino acid identity) that was characterized previously 
. The Pixie 47 bp attP
common core () is similar to that of Tweety and they are predicted to integrate into the same attB
site overlapping a tRNAlys
gene (Msmeg_4746). The Pixie attP
site has an unusual array of potential arm-type binding sites with a pair to the left of the core (P2 and P3), and a set of three to the right (P4, P5, P6), all in direct orientation (). A sixth site (P1) is located to the left of the core but oriented in the opposite orientation. These correspond closely to a consensus sequence with few departures (). Phages with related integrases (e.g. Tweety) that use the same attP
site do not share these arm-type sequences and presumably have different recognition specificities, although this remains ill-defined.
Identification of Start Associated Sequences (SASs)
BlastN comparison of each of the Cluster K genomes against a database of all sequenced mycobacteriophage genomes reveals the presence of short repeated sequences located throughout the Cluster K genomes. The arrangement of these repeats is complex, and although their function is not known, their locations and orientations suggest a possible role in translation initiation. There are fundamentally two types of repeats. The first is a 13 bp asymmetric sequence present in between 11 and 19 copies in each Cluster K genome. The second is a pair of imperfect 17 bp inverted repeats located just upstream of a subset (about 50%) of the 13 bp repeats.
The locations of the 13 bp sequence 5′-GGGATAGGAGCCC
repeats are shown on the genome maps represented in , , , , S1
, and alignments of the sequences are shown in and S3
. There are several striking features. First, it is apparent from the genome maps (, , , ) that these sites are restricted to the right halves of the genomes containing non-structural protein genes. Second, virtually all of the repeats are located within a few nucleotides of the predicted translation start codons of downstream genes, typically 3–7 bp (, S3
), and the start codon most commonly associated with SASs is ATG (80 of 93 sites identified) though ATG in general is only used by about 55% of mycobacteriophage genes. Third, the sequence is non-palindromic notwithstanding the symmetry of the outer parts of the sequence (i.e. 5′-GGGNNNNNNNCCC
), and is typically present in one orientation only (Anaya, Adephagia, Angelica, and CrimD all have a single site in the opposite orientation that is not obviously associated with a gene start; and S3
). Fourth, these sequences are predominantly associated with genes that are separated from their upstream gene neighbors by more than 50 bp, relatively large intergenic regions within the context of typical phage genome organization (Tables S1
). Finally, this sequence is not common among mycobacteriophages, and outside of Cluster K genomes, only Corndog has a single copy with two deviations from the consensus. There is not a single copy of the consensus 13 bp sequence in M. smegmatis
and only four when permitting a single deviation. Likewise there are no exact copies in M. tuberculosis
H37Rv and only two with a single deviation.
Location of Start Associated Sequences (SASs).
This conserved sequence is in the position typically occupied by the Ribosome Binding Site (RBS). Indeed, the repeat contains the 5′-AGGAG
sequence that is a core component of the Shine-Dalgarno sequence that pairs with the 3′ end of the 16S rRNA during translation initiation, and positions 2–4 of the conserved sequence have the capacity to extend the pairing with 16S rRNA (). However, it seems unlikely that this repeat simply corresponds to just a favorable translation initiation site. First, the starting base of the sequence is extremely well conserved () but has no corresponding base to pair with in 16S rRNA. Second, positions 10–13 are also highly conserved, but do not have pairing potential with rRNA (). Nonetheless, the positioning of these repeats suggests a role in translation initiation – in contrast to the 13 bp stoperator sequences in L5 and other Cluster A phages that play a role in transcription regulation 
– and we therefore propose that they be called S
equences (SASs). Whether these act independently or represent binding sites for either a host- or phage-encoded gene product (either RNA or protein) remains to be determined. The conservation of these sites across the three subclusters – often associated with genes of different phamilies () – strongly suggests that they play important roles for these phages.
Conservation of mycobacteriophage gene phamilies containing SAS sequences in Cluster K phages.
Approximately one half of the genes with an SAS also contain a second sequence feature composed of imperfect 17 bp inverted repeats (IRs) separated by a variable spacer (, S4
). Because these are tightly associated with SASs, we refer to these as extended SASs (ESAS); in one notable exception the inverted repeat upstream of TM4 gene 79
does not appear to be associated with an SAS (). For each genome a consensus sequence can be derived () from the left and right IRs, although the left IRs typically have a closer correspondence to the consensus than the right IRs (, S4
); the spacer region between the IRs is variable, but is 4–13 bp in the vast majority of sites (, S4
). Interestingly, the consensus sequence of the IRs is different for phages of the three subclusters. The four Subcluster K1 phages have very similar IR consensus sequences (, S4
), but differ from those of the Subcluster K2 (TM4) and K3 (Pixie) at positions 11, 12 and 13. For example, at position 11, there is predominantly a C in Anaya (in 15 of 16 IRs), but a T in both Pixie and TM4 (16 of 18 and 10 of 12 IRs respectively). At position 12, the C residue is strongly conserved in both Pixie and TM4, with no departures in any of the 30 constituent IRs, but this site is predominantly an A residue in Anaya (two of the 16 IRs have a C). At position 13 Pixie and TM4 have a consensus A residue, with no departures in any of the 30 IRs, whereas in Anaya this site is predominantly a T (two IRs have a G, and one has a A) ().
The ESAS sites are well conserved among the Cluster K genomes, in that if a gene of a particular phamily is associated with an ESAS in one genome, then other Cluster K genomes containing a gene member of that phamily also have an associated ESAS (). A notable exception is TM4 gene 80
(Pham 1364), which lacks an ESAS (it has an SAS), whereas all other phamily members have an ESAS (). Inspection of the TM4 sequence shows that the site is completely lacking, rather than having more highly diverged but related IRs. The conservation of these sites strongly suggests that they serve important functions for the phages, although it is not clear what they are. Because these are closely linked with the SASs that in turn are associated with translation initiation sites, it is tempting to assume that they also play a role in translation initiation. However, there is little support for the possibility that the two IRs form hairpin structures in mRNA, in that departures in the left and right IRs do not generally support RNA base-pairing. Therefore, it seems more likely that these represent binding sites for DNA-binding proteins and that the differences in consensus sequences represent different specificities in the three subclusters. One possible role might be in transcription initiation (i.e. promoters), but alternatively they could be operator sites for phage repressors. This latter explanation is attractive except that the K1 and K2 phages are homoimmune (), which is not consistent with the consensus differences. Furthermore, it is unclear why in virtually every occurrence the IRs are closely associated with translation initiation signals if they are operator sites. Finally, we note that in Pixie and TM4 each 17 bp IR itself has a symmetrical character, and can be considered as a 6 bp half site (5′-TGTTGA
) separated by a 4 bp spacer from the inverse complement (). However, this is not true for the Subcluster K1 phages because of the consensus differences at positions 11–13 (, S4
), as discussed above.
Characterization of a conditionally-replicating mutant of TM4
Bardarov et al. (1997) described a conditionally replicating mutant of TM4 that fails to form plaques and fails to kill infected cells at temperatures of 37°C or above. This mutant – ph101 – is the basis for the construction of conditionally replicating shuttle phasmids used for delivery of reporter genes, transposons, and allelic exchange substrates to mycobacterial hosts 
. The mutant was isolated using two rounds of hydroxylamine mutagenesis with the goal of isolating mutants that revert only at very low frequencies 
. Because the functions of so few TM4 genes are known, we characterized the mutations in ph101.
Sequencing of the complete ph101 genome reveals a total of 23 differences (). One of these is a one base insertion in a non-coding region at the extreme right end of the genome; the others are all base substitution transitions, consistent with the mutagenic spectrum of hydroxylamine (). The large number of mutations reflects the heavy mutagenesis employed to recover the non-reverting mutants. Twelve of the base substitutions do not alter the predicted coding sequences, whereas the other ten do and are therefore candidates for contributing to the temperature-sensitive phenotype. Because the reversion frequency of ph101 is low (<10−8) it is likely that more than one mutation contributes to this phenotype. Three of the affected genes are predicted virion structure genes (8, 20, 23) and are unlikely to be involved in DNA replication ().
Mutations in the ph101 genome relative to TM4.
To gain insight into which of the mutations contribute to the temperature sensitive phenotype we isolated five independent revertant mutants (C, D, F, G, and J) that are able to grow at 37°C, followed by PCR amplification and sequencing of the regions containing the ten non-synonymous mutations (). Revertants D, F, G, and J each contains nucleotide changes back to the wild-type sequence at mutations #10 and #14 (, ), suggesting strongly that TM4 genes 48 and 66 contribute to the temperature-sensitive conditionally replicating phenotype. The involvement of gene 48 was somewhat surprising because the deletion in phasmid phAE159 removes the C-terminal 12 codons of gene 48 (see below). However, this region is poorly conserved among the TM4 gp48 relatives and is presumably not required for their function.
Mutations contributing to the conditionally replicating phenotype of TM4 mutant ph101.
The fifth mutant (C) also has the reversion to wild-type sequence in gene 66, but retains mutation #10 in gene 48. However, it contains an additional mutation in codon 186 in gene 48 that presumably provides intragenic suppression of the first gene 48 mutation. This mutant also contains an apparent single base insertion at 31,784 within gene 42, and although this is unlikely to contribute to the temperature sensitive phenotype, it suggests that 42 is not an essential gene. The specific roles of gp48 and gp66 are not known and neither has known non-mycobacteriophage homologues, but these data strongly suggest that both are required for normal replication of TM4.
All of the five mutants isolated at 37°C form plaques at 42°C with an efficiency of plating of approximately 10−4, suggesting that reversion of a third mutation is required to restore the full wild-type TM4 phenotype. Two independent mutants recovered at 42°C were analyzed as described above and both were found to contain a single additional base change that restores the wild type sequence at mutation #4 in gene 20 (, ). Because gp20 is a putative virion structural protein (), the mutation in gene 20 is likely to contribute to the temperature-sensitive phenotype, but not to the conditional replicating property of ph101.
Sequencing of the cosmid-phage junctions in the shuttle phasmid phAE159 shows that a 5.8 kbp region between coordinates 33,877 and 39,722 is deleted and is therefore non-essential for TM4 growth (). This deletion removes all of genes 49 to 63 and the first sixteen codons of gene 64, presumably rendering it non-functional. The deletion also removes the last 12 codons of gene 48; but the extreme C-terminus of TM4 gp48 is not well conserved, and this modestly truncated product may retain its functionality. However, the proline residue altered in ph101 is absolutely conserved among all six related protein sequences.
We have described here the genomes of the Cluster K group of mycobacteriophages, providing insights into the origins of the widely-used mycobacteriophage TM4, the genetic basis of a conditionally-replicating mutant of TM4, and a variety of enticing genomic features indicative of interesting but as yet not understood biological behavior. The presence of short repeated sequences suggests interesting regulatory features that have yet to be fully understood, but these also could be targets for homologous recombination and thus contribute to the mosaic nature of these genomes. The Cluster K phages clearly have a combination of features that make them particularly attractive for advances in tuberculosis genetics. First, all of the Cluster K phages infect both M. tuberculosis
as well as M. smegmatis
and appear to have relatively broad host ranges. Second, apart from TM4, all of them are temperate and form stable lysogens. Third, the genomes are relatively small – all are shorter than the average mycobacteriophage genome size of ~70 kbp – and are amenable to manipulation using shuttle phasmid and recombineering strategies