|Home | About | Journals | Submit | Contact Us | Français|
We report the genome sequence of Bacillus subtilis phage SPO1. The unique genome sequence is 132,562 bp, and DNA packaged in the virion (the chromosome) has a 13,185 bp terminal redundancy, giving a total of 145,747 bp. We predict 204 protein coding genes and five tRNA genes, and we correlate these findings with the extensive body of investigations of SPO1, including studies of the functions of the 61 previously defined genes and studies of the virion structure. 69% of the encoded proteins show no similarity to any previously known protein.
We identify 107 probable transcription promoters; most are members of the promoter classes identified in earlier studies, but we also see a new class that has the same sequence as the host sigma K promoters. We find three genes encoding potential new transcription factors, one of which is a distant homologue of the host sigma factor K. We also identify 75 probable transcription terminator structures. Promoters and terminators are generally located between genes and together with earlier data give what appears to be a rather complete picture of how phage transcription is regulated.
There are complete genome sequences available for five additional phages of Gram-positive hosts that are similar to SPO1 in genome size and in composition and organization of genes. Comparative analysis of SPO1 in the context of these other phages yields insights about both SPO1 and the other phages that would not be apparent from the analysis of any one phage alone. These include assigning identities and probable functions for several specific genes, and inferring evolutionary events in the phages’ histories. The comparative analysis also allows us to put SPO1 into a phylogenetic context. We see a pattern similar to what has been noted in phage T4 and its relatives, in which there is minimal successful horizontal exchange of genes among a “core” set of genes that includes most of the virion structural genes and some genes of DNA metabolism, but there is extensive horizontal transfer of genes over the remainder of the genome. There is a correlation between genes in rapid evolutionary flux through these genomes and genes that are small.
Bacteriophage SPO1 is one of a group of large, lytic, tailed bacteriophages of Bacillus subtilis, which contain hydroxymethyluracil (hmUra) in place of thymine in their DNA1;2. It was isolated in 1964 from soil near the Genetics Department of the University of Osaka, Japan3. This group includes B. subtilis phages SPO1, SP82, Φe, 2C, SP8, H1, Φ25, and SP5C, all of which are strikingly similar, but not identical, with respect to all parameters examined to date1; 4;5. In addition, five phage genomes from a wider range of Gram-positive hosts have been sequenced and have a significant level of similarity to SPO1 (Staphylococcus phages K, G1 and Twort, Listeria phage P100 and Lactobacillus phage LP65). SPO1 is by far the most intensively studied of all these phages, and it was the subject of pioneering studies in (1) the first demonstration of a cascade of sigma factors for regulation of a program of sequential transcriptional activation of gene expression6–9, (2) the first split gene discovered in Gram-positive bacteria, whose intron encodes a homing endonuclease that catalyzes the transfer of the intron to related genes lacking the intron10; 11, (3) formation of concatemers from overlapping 13.2 kb terminally redundant sequences, which encode many of the proteins necessary for host-takeover12–14, and (4) the first example of a prokaryotic type II DNA binding protein encoded by a bacteriophage genome, which binds preferentially to DNA of that phage, causing DNA bending and facilitating diverse elements of the phage infection program, including the transition from middle to late gene expression2;15. In addition, SPO1’s virion structure and morphogenesis have been studied extensively16–18.
As a tool for further understanding the biology of this wonderfully complex bacteriophage, we determined the complete nucleotide sequence of its genome. Among the new insights from the sequence are the following: (1) We identify three new putative transcription factors encoded by the SPO1 genome, each of a different type, and each offering promise of explanations for hitherto unexplained complexities in SPO1’s program of gene activity. (2) Despite the similarity between the events of SPO1 infection and those of other large lytic phages, many SPO1 gene products show no detectable similarity to any other previously known protein, which emphasizes the extreme diversity among the extant tailed phages. (3) We clarify the phylogenetic position of SPO1 in relation to other large lytic tailed phages and to the more closely related large Gram positive phages with isometric heads and contractile tails. (4) We identify 58 predicted transcription units that cover the SPO1 genome, each terminated by a rho-independent transcription terminator, and initiated by one or more promoters of the three previously identified or the one newly-identified classes of promoters, accounting for transcription of all but three of the identified genes. (5) We identify genes specifying some, but not all, of the enzymes previously identified biochemically in studies of infection by SPO1 or hmUra-containing close relatives. (6) We show that early genes are unusually small, like those previously identified in the host-takeover module. Small genes are also over-represented among genes that are in relatively rapid flux through the genome. (7) By N-terminal sequencing, we identify the genes that encode four of the major structural proteins of the virion, including the major capsid protein, which is processed by removal of the N-terminal 23 amino acids.
The complete nucleotide sequence of the 132,562 bp SPO1 genome was determined; the DNA in the virion (the chromosome) has a 13,185 bp terminal redundancy, making the size of the linear virion DNA 145,747 bp. The sequences of several genome segments were determined previously9; 10; 13; 19–29, some of which differ slightly from the present sequence. The G+C content of the SPO1 unique sequence is 40.0%, similar to the 43.5% of its B. subtilis host. The sequence has been deposited in GenBank, with Accession number FJ230960.
Two hundred and four putative protein-coding genes and five tRNA genes were identified in the SPO1 genome, counting only once the thirty genes present in each copy of the terminal redundancy of the chromosome. These predicted genes are listed in SUPPLEMENTARY MATERIAL Table S1, and the genome map is shown in Figure 1. These 209 SPO1 genes are tightly packed in the genome; 45 genes overlap with the preceding gene by amounts ranging from 1 to 68 bp. Eighty-nine percent of the genome sequence is occupied by protein-coding genes, which is similar to, but a few percent lower than has been seen in other phages. Most intergenic spaces are occupied by putative transcriptional regulatory sequences (promoters and terminators), and almost all of the genome appears to be transcribed (see below). Promoters and terminators are more numerous in SPO1 than in most characterized phages, and this accounts for the slightly lower fraction of protein-coding sequence in SPO1. Gene names were assigned in order to retain, where possible, the numbering of previously identified genes2; 13; 30, which were numbered 1 through 60, ordered from left to right on the genome, beginning at the left end of the non-redundant portion of the sequence. Other genes identified in the sequence were assigned decimal numbers, so as to maintain numerical order across the genome. This scheme preserves as much as possible the connections between the gene names and more than 40 years of experimental work on SPO1.
Okubo et al.30 isolated nonsense mutations in SPO1, sorted them into 36 complementation groups, and mapped them by recombination frequency, thereby defining genes 1 to 36. For many of these genes, several mutants were isolated, suggesting that a significant fraction of the essential genes in the SPO1 genome had been identified. Glassberg et al.31 searched extensively for temperature-sensitive and replication-deficient mutants, and all of these mutations were found to be in the genes already defined by Okubo et al.30, supporting the notion that a large majority of essential genes have been identified. In addition, nonsense mutations have been introduced, by site-specific mutagenesis, into a number of SPO1 genes that had not been found in the searches for conditional lethal mutations. In every case except one, the nonsense mutation did not prevent plaque formation on non-suppressing strains, indicating that the encoded protein is not essential to SPO1 growth in the laboratory27; 32 (Stewart, C., Myles, B, Laughlin, L., Gupta, N., unpublished observations). The one exception we are aware of is the gene encoding TF1, in which nonsense mutations were so detrimental that they prevented growth even on suppressing strains33, which accounts for the absence of TF1 mutants in the Okubo et al. collection. It thus appears that possibly as few as 20% of the SPO1 genes are essential. This may indicate either functional redundancy of genes (e.g., gp44 and gp51, below), and/or that many SPO1 gene products are needed only under conditions different from those used in the laboratory. Table S2 lists these 37 essential genes. Eight of these genes had been sequenced previously, and thus could be located unequivocally in the genome sequence. With one exception, the other essential genes could not be located unequivocally in the genome sequence, so they were assigned decimal numbers as discussed in Materials and Methods. Table S2 shows, for each of the essential genes, which of the genes in the sequence could be its possible location.
One of the earliest examples of a prokaryotic intron was discovered in gene 31 of SPO110. While this group I intron in its DNA polymerase gene is the only apparent intron in SPO1, the set of six completely sequenced SPO1-related phages contains at least eighteen group I introns, of which fifteen are distinct sequences (these are listed in Table S3). Nine have been previously noted4; 10; 34; 35. Including the SPO1 DNA polymerase intron, about half of the introns (eight of fifteen distinct cases) encode recognizable ORFs encoding putative homing endonucleases, mostly of the HNH endonuclease family36–38. Six introns reside within DNA polymerase genes of four phages (SPO1, P100, with two identical intron pairs in G1 and K). The P100 DNA polymerase intron is clearly related to the first intron of G1 and K, although it lacks the internally encoded protein. Four related introns occur in the large terminase subunit genes, with two in Twort and one each in LP65 and P100. Three introns reside in ribonucleotide reductase genes of P100 and Twort39, with two in the Twort gene. Three related small introns occur in the putative tail tube gene of Twort, Orf106 (also called Orf142)40. Finally the lysin genes of K and G1 also contain an intron34. Twort has the most with seven introns; P100, G1, and K have three each; while LP65 and SPO1 have only one each. Inteins are relatively rare in these phage genomes, with one example in SPO1 gene 34.33 and another appearing in Twort gene 6. Curiously, both genes encode a probable DNA helicase, but of different types.
All of these completely sequenced SPO1-like phages contain multiple putative tRNA genes (Table S8). P100 has eighteen41, LP65 has seventeen42, SPO1, G134 and K35 have five each, and Twort has two34. The two tRNAs identified in Twort are quite similar to a pair of adjacent tRNAs in K and G1, which are however in a different position in the gene order in these phages. The tRNAs in SPO1 encoded by genes 2.32, 2.41 and 2.61 are closely related to bacterial tRNA genes, while the others make no significant database matches.
We have employed a variety of strategies to determine the putative functions of the 209 SPO1 gene products, and the predicted assignments are listed in Table S1 and indicated in Figure 1. These include database comparison information, phenotypes of SPO1 mutants, and biochemical analysis of encoded proteins. In total, 64 genes are similar to genes in the extant sequence databases, and an overlapping set of 63 genes have been assigned experimentally determined or predicted functions; the functions of nineteen of these gene products (gp6.1, gp16.1, gp16.2, gp27, gp28, gp29, gp29.2, gp30, TF1, gp31, gp33, gp34, gp38, gp39, gp40, gp44, gp50, gp51, and gp56) have been studied experimentally, and the others have homology to proteins with known functions. (This list does not include those of the essential genes, discussed above, that were defined only by genetically mapped mutations that could not be located in a single gene in the genome sequence. The phenotypes caused by those mutations are listed in Table S2.) Thus 141 of the 204 SPO1 protein gene products (69%) show no similarity to any known protein. This high proportion of novel genes reflects the findings in other recently sequenced bacteriophage genomes43; 44 and supports the idea that bacteriophage gene discovery is still in its early stages 45. There is considerable clustering of genes with related functions in the SPO1 genome, but the clustering is far from absolute.
SPO1 virions are large and complex. They have an isometric T=16 icosahedral head 108 nm in diameter17; 46, a contractile tail about 140 nm long, a hexagonal baseplate, and bushy tail fibers or spikes18. The complex virions may contain as many as 53 different types of protein molecules16. Yet mutations in only 21 of the genetically identified complementation groups affect assembly, and only eleven genes in the SPO1 genome encode proteins with detectable similarity to putative morphogenetic proteins of other phages (Table S1). Furthermore, we find that the virion assembly gene cluster described below contains only 32 genes; thus, even though some assembly genes may be outside this cluster, the earlier estimates of the number of virion structural genes seem likely to be somewhat high (we note the possibility that some of the virion proteins could be present in multiple post-translationally processed forms).
To identify some SPO1 virion protein genes unambiguously, we separated the proteins of purified virions by SDS polyacrylamide gel electrophoresis and attempted to determine the N-terminal amino acid sequence of five of these proteins. Figure 2 shows such a gel with the amino acid sequences that were obtained; three sequences clearly identify genes 6.1, 16.1, 29.2 as encoding major virion components. A fourth sequence from a gel band of ~95 kDa gave a weak signal of NH2-XXXNK. This matches four N-terminal sequences in the SPO1 proteome, but only the product of gene 16.2 has a large enough predicted mass (110 kDa) to account for the mobility of the gel band, with the other three possibilities ranging in size from 73 to 171 amino acids. We therefore tentatively identify gp16.2 as a virion component. A fifth protein of about 18 kDa did not give a signal, despite its high abundance, and it may have a blocked N-terminus.
The product of gene 6.1 is by far the most abundant protein in virions and so is certainly the major head shell (coat) protein; its T=16 geometric arrangement46 predicts that the virion contains 955 coat protein molecules if five of the predicted subunits are replaced at the tail vertex, as is the case with other tailed phages. It is 26% identical to a protein (named Cps) similarly identified as the coat protein of Listeria monocytogenes phage A51147, and these two have similar levels of identity to the annotated (putative) coat proteins of Listeria phage P10041, Staphylococcus phages Twort34, K35 and G1 (Kwan et al., 2005), and Lactobacillus phage LP6542. Leptospira phage LE1 coat protein is recognizably similar but is somewhat less closely related48. The SPO1, K and A511 putative coat proteins have N-termini that do not correspond to the beginning of the gene that encodes them, and the mature protein’s N-terminal amino acid is encoded by codon 24 of the coat protein gene in all three cases. Parker and Eiserling16 previously concluded that the major coat protein of SPO1 is proteolytically cleaved after its synthesis, that about 2 kDa are removed from the primary translation product, and that like other cleaved phage coat proteins, the cleavage is dependent upon head assembly16. Our results agree in that 2326 Da are removed from the N-terminus. The cleavage is between residues K23 and A24 in SPO1 and residues K23 and S24 in K and A511. Comparison of their coat proteins suggests that, like SPO1, K and A511, the other five SPO1-like coat proteins likely also undergo a similar coat protein cleavage: all five have a potential cleavage site K-S sequence at amino acids 24 (P100, Twort, G1), 42 (LP65) or 51 (LE1); however their coat proteins have not been studied in this regard. Proteolytic removal of the N-terminus of phage coat proteins is common, and many phage coat proteins enjoy assembly-dependent cleavage during virion assembly (reviewed in ref. 49). Such coat protein cleavages are typically performed by phage-encoded proteases, and interestingly, the SPO1 gene 3.2 product (and its homologues in P100, Twort, K, G1 and LP65) has sequence similarity to the N-terminal region of tailed phage procapsid proteases of the HK97 family; this corresponds to the catalytic domain of these proteases.50 The SPO1-family protease gene is located two genes transcriptionally upstream from the coat gene, a typical location for such genes.
The similarity between the SPO1 coat protein and that of phage LE1 is of interest because the 73,623 bp LE1 genome48; 49 is about one-half the size of the SPO1 genome. The heads of the tailed phages are typically packed with DNA to similar densities49, and the reported 85 nm diameter of the LE1 head51 would give it about half the internal volume of the SPO1 head. The observation that lattice constants for tailed phage capsid lattices are very similar in different phages, strongly suggests that SPO1 and LE1 have different triangulation (T) numbers52. Current evidence suggests that all tailed phage coat proteins may be homologous, including those of phages T4, HK97, P22, ε15 and 29 (T=13 [prolate], 7, 7, 7, and 3 [prolate] respectively), since they have similar folds53; 54; 55; 56, but it is not always possible to infer homology between these coat proteins on the basis of sequence alone. SPO1 and LE1 appear to provide an example of capsid protein sequences that have diverged recently enough that they retain detectable sequence similarity, and yet have evolved to assemble capsids with different T numbers. Other pairs of phage capsids with such a relationship are only recently coming to light; perhaps phages T4 and Syn9 provide the clearest example. The capsid protein sequences of these phages are 38% identical but the triangulation numbers are prolate T=13 and isometric T=16, respectively57. The existence of such pairs of phages informs how we can think about the evolution of capsid size, and therefore genome size, in phages.
The head genes, and to a slightly lesser extent the tail genes, are strongly conserved among the six phages in the SPO1-like phages (Figure 3), and this means that features of the genes or of the gene organization may stand out because of their conservation even though they might escape notice in the examination of a single genome sequence. A surprising feature of the head gene region is that the genes for the head maturation protease and the major head subunit have sequence homology (are classified in the same sequence Phamily, see below), a relationship that has not been seen in other phages. Examination of these sequences shows that there is a short region of rather weak sequence similarity between the two proteins, as shown in Figure 4a. This sequence similarity is slight enough that it would likely have been discounted if it had been detected in a single genome, but the relationship is conserved across all six of these phages, lending support to its significance. The part of the SPO1 protease sequence that matches most other phage maturation proteases does not overlap with the part of the sequence that is homologous to the head protein (Figure 4a). The part of the sequence that matches other proteases corresponds to the N-terminal portion of those enzymes, the locus of the catalytic activity50. This is reminiscent of the case of the proteases of phages λ and Mu, which contain not the head protein sequence but the scaffolding protein sequence C-terminal to the protease sequence58. It has been suggested in these cases that the shared sequence between the protease and scaffolding protein may facilitate assembly of the protease into the procapsid through homotypic interactions between protease and scaffolding protein49. In an analogous way, we can imagine that the SPO1 protease might assure its assembly into the procapsid through homotypic interactions between the protease and the major head protein. A somewhat different similarity to the SPO1 case is seen in a sequence relationship found in phage T4 and some of its relatives (R. Hendrix and C. Georgopoulos, unpublished). Here a short sequence (in gene 61.1) matching the corresponding part of the head protein is found not as part of the protease protein but encoded as a separate short protein (Figure 4b). What the functional relationships are, if any, among these different examples of sequence relationships remains to be determined, but we note that comparative sequence analysis as described here is a productive method for bringing such relationships to light for further examination.
Portal proteins form the hole through which DNA enters and leaves the capsid, and the genes encoding portals in other phages are typically located immediately upstream of the procapsid protease gene49. SPO1 gene 3.1, which is in this position, has convincing portal protein homology. Other than a large terminase subunit gene (see below), there are no other matches in the SPO1 genome to known head assembly genes; however, an approximately 28 kDa protein was shown to undergo assembly-dependent proteolytic degradation, and it is not present in virions16. We find that gene 3.3 lies in the position that is typical of scaffolding protein genes (between the portal and coat protein genes), and like other known scaffolding proteins gp3.3 (MW=26.5 kDa) is hydrophilic and predicted to have a high α-helix content; it is common for scaffolding proteins to be degraded by a phage-encoded protease49. Thus, gene 3.3 seems likely to encode the head assembly scaffolding protein.
Proteins from several substructures of the contractile tail of SPO1 have been analyzed16. It was concluded that there might be as many as 34 proteins in the tail. Fourteen of the genetically defined SPO1 genes (“7” through “20”) have an effect on tail assembly; these original genes all lie in the interval from gene 3.2 through 19.4 (Table S2). Twelve SPO1 gene products have similarity to tail proteins of other phages (Table S1). Most convincing is that gene 9.1 encodes the tail sheath protein. Although phage tail proteins are notoriously variable in sequence, sheath protein is one of the most conserved (presumably because it carries out a critical structurally-conserved rearrangement during DNA injection), and SPO1 gp9.1 has similarity to proteins in many known members of the Myoviridae, including weak but convincing similarity to the well-known sheath proteins from phages T4, P2 and PBSX (14–16% identity over most of the protein in these cases). In addition, Parker and Eiserling16; 18 showed that the SPO1 sheath protein is about 60 kDa, and the sequence predicts gp9.1 to be 61.3 kDa. Gene 11.2 has similarity to known tapemeasure proteins (proteins that determine tail length) including 17% identity over 418 amino acids of the experimentally studied tapemeasure of phage TP901–159. We note that there is a reasonable correlation between the length of this protein and tail length in the known SPO1-like phages, ranging from tail lengths of 140 nm and 890 amino acids in SPO1 to 210 nm and 1351 amino acids in phage K60. The ratios of these numbers imply ~1.5Å of tail length per amino acid of tapemeasure protein, in agreement with what has been seen for other phages61; 62; 63, and so is consistent with the inference that the tapemeasure protein acts as an α-helix.
In phages with contractile tails, the tail tube subunit protein is typically encoded by a small gene following the tail sheath gene. SPO1 gene 10.1 fits this description, despite its lack of a sequence match to any experimentally verified tail tube subunit, and there are homologues at the corresponding position in all six SPO1-like phages (Figure 5). The observations that SPO1 gene 10.1 is in the expected position and has the expected size for a tail tube gene, that there are homologues in the corresponding positions of the other five phages, and that the other two SPO1 genes in this region that might be candidates for the tail tube gene can be assigned different, position-appropriate functions (see below), lead us to propose that SPO1 gp10.1 and its homologues in the other phages are the tail tube subunits. We suspect that this protein is the abundant ~18 kDa virion protein from which we were unable to obtain an N-terminal sequence (above).
In many phages, the tail tube subunit gene is followed by a pair of overlapping open reading frames whose expression includes a programmed translational frameshift64; 65. This frameshift transfers a fraction of the ribosomes translating the upstream reading frame into the second open reading frame (ORF). The proteins produced have an essential role in facilitating assembly of the tail tube around the tail length tapemeasure protein. The frameshifting genes are typically transcribed immediately upstream of the tail length tapemeasure gene. In SPO1 there are just two ORFs (genes 10.2 and 11.1) between the putative tail tube gene (10.1) and the tapemeasure protein gene (11.2) (Figure 5), and these overlap in such a way that a +1 frameshift would shift between them, so they are attractive candidates for the SPO1 frameshifting genes. There is in fact a potential +1 “slippery site” 7 codons upstream from the termination codon of the upstream gene where a tRNALys reading AAA could slip forward to AAG (Figure 6). Because a +1 frameshift has a smaller signature in the sequence than the more common -1 frameshift66, this identification is weak in the absence of corroborating information. However, the other five phages in the group also have two overlapping ORFs immediately upstream from their tapemeasure protein genes, in the same frame relationship as the SPO1 genes, and all have a similar potential +1 slippery site near the termination codon of the upstream gene (Figure 6). Another feature shared by all six phages is that the downstream ORF of the putative frameshifting pair lacks a convincing Shine-Dalgarno translational start sequence. This is striking in these phages which otherwise have Shine-Dalgarno sequences with a very strong match to consensus; this buttresses the contention that ribosomes enter these ORFs by a mechanism other than standard initiation. We conclude that all of these phages very likely use the frameshift mechanism with a +1 frameshift between the two genes identified in this analysis. The comparative analysis is especially helpful for phages K, G1, Twort, and LP65, since these phages have insertions of one to several extra genes between their putative tail tube genes and frameshift genes.
SPO1 gene 12.1 protein and its LP65 homologue have a domain (from about amino acid 400 to 600) that has similarity to domains that are thought to be involved with cell wall binding; searches with the parallel domains of Twort or K yield proteins that contain lysozyme/peptidase/amidase domains; such proteins have been reported to be part of the tail’s injection machinery in several phages67. Genes 14.1 through 16.1 encode proteins that have weak but convincing similarity to phage baseplate and tail fiber proteins, including gpW and gpJ of phage P2 and gp8 of T4. The product of gene 16.2 has similarity to proteins of unknown function in the tail regions of some other phages, and we gave evidence above that it is likely a virion component.
It is not clear which SPO1 genes encode the brushy tail fibers or spikes, but there are several possible candidates. Gene 17.2 and 18.1 proteins have short patches of similarity to other phage tail fiber proteins, in particular to phage 29 tail appendage (gp12) protein and the tail fibers of Mu-like phages, and the products of genes 18.3 and 19.1 have longer regions of similarity to 29 tail appendage protein. Some phage tailspikes have the ability to cleave bacterial surface polysaccharides, and these proteins have an unusual β-helix protein fold68. We find that both gp18.3 and gp19.1 have very high β-helix predictions over most of their lengths (estimated at the http://betawrap.lcs.mit.edu/ website69). Gp19.1 has strong similarity to phage SPP1 gp33 and to a phage SPβ protein. In addition, the products of genes 2.43, 2.8, 12.1, and 34.16 show sequence similarity to proteins known to bind to cell walls and/or to cleave macromolecules that are components of cell walls (Table S1; gene 12.1 appears to have a particularly variable domain structure among the SPO1-like phages, see Figure S1), but it remains unknown at this time whether these are virion components involved in cell attachment and/or DNA entry, important for cell lysis, or have other functions.
The 145,747 bp, non-permuted DNA molecule of the SPO1 virion has a very long 13,185 bp direct terminal redundancy (TR). Since the concatemers that result from SPO1 DNA replication have only one copy of the TR between flanking copies of the non-TR part of the genome sequence, the SPO1 terminal redundancy must be generated in concert with DNA packaging, but the mechanism by which it is generated remains unclear12; 13; 70; 71.
Phage terminases are the enzymes that recognize and cleave the DNA concatemer to virion chromosome length and are a critical part of the motor that inserts the chromosome into the preformed procapsid during packaging72. DNAs within the particles of tailed phages have a variety of different end structures (reviewed in ref. 73), and the large subunits of the terminases that create similar DNA end structures typically form related sequence groups that correlate with the types of DNA ends they create74. The putative large terminase subunits of SPO1 (encoded by gene 2.11), K, LP65, G1, Twort, A511 and P100, when compared to other phage terminases in a neighbor-joining tree, form a single well-separated branch which has very high bootstrap support (the large terminase subunits have 30–40% identical amino acid sequence within this group; Figure S2). The seven phages listed above have other similarities (below) including genome sizes, and this terminase similarity suggests that all these phages’ virion DNAs may have similar long, terminally redundant end sequences. This has in fact recently been demonstrated experimentally for phages A511, P100 and K75. We note that phage LE1 has a terminase gene that does not fall within the SPO1 group (see below), even though some of its other head and tail assembly genes are similar. The putative large terminase subunits of E. coli phage T5 (encoded by its gene 144)76 and Bacillus thuringiensis phage 03058–36 (encoded by its gene 117)77, which have 10,139 bp and 6,479 bp long terminal redundancies, respectively, are not close relatives of the terminases of the SPO1 group. Thus, there appear to be multiple types of terminases that generate long terminal repeats, just as there are multiple types that create some other end structures74.
Most of the genes with identifiable roles in structure and assembly of the virion map in a contiguous block of 32 genes bounded on the left by gene 3.1 (portal) and on the right by gene 19.1 (possible tail fiber). As with many other phages49; 78, these genes are not only clustered separately from genes with non-assembly functions, but they are grouped within that cluster into head genes (on the left) and tail and tail fiber genes (on the right). Within those groupings we are able to recognize several features of gene order that are widely conserved among tailed phages, most notably the portal–protease–scaffolding–coat genes in the head gene region and the sheath–tube–chaperones–tapemeasure genes in the tail gene region. This section of the genome is known to be transcribed late in the infection cycle, as expected by analogy with other tailed phages14;79. However, this region does not encompass all of the virion protein genes. A clear example of an assembly gene outside the main cluster is gene 2.11, encoding the large terminase subunit, which appears to have been separated from the main cluster by a ~15 kb insertion (discussed below). Mutants in genetically defined genes 1 through 20 are deficient in virion assembly2; 80, and this implies an assembly gene, corresponding to original gene 1, to the left of the large terminase subunit gene, and another corresponding to original gene 20, beyond the right end of the main assembly/structure cluster. Further, we found that gene 29.2 encodes a 19.8 kDa protein that is a fairly abundant component of the virion (above). The role of gp29.2 is not known, but in T4, the only large phage where such proteins have been characterized in detail, decoration (external surface accessory) protein and internal protein genes are not clustered with the rest of the morphogenetic genes. SPO1 does appear to have a decoration-like protein on its surface that is present in fairly large numbers46, and it could be gp29.2. We note that it is common for the presence of decoration protein genes to be variable in closely related phages81, so it is not surprising that the other SPO1-like phages do not have a recognizable gene 29.2 homologue. Related phage LP65 also encodes a major virion protein outside the head and tail gene clusters, which does not have a homologue in SPO142. Finally, a mutation in original gene 35 causes a defect in virion assembly2; it may correspond to gene 35.1, 36.1, 36.2 or 36.3. These genes are located in a putative late transcription unit, suggesting that one or more of them may also specify a virion structural protein.
Genes 19.2 and 19.3 specify proteins with sequence similarity to holins and lysins, respectively, of other phages that infect Gram-positive bacteria. The putative lysin, gp19.3, also makes sequence matches to a number of cell wall hydrolyzing enzymes of Gram-positive bacteria. The patterns of sequence matches with all of these enzymes argue that the N-terminal portion of gp19.3, roughly the first 160 amino acids, has the cell wall hydrolytic activity and the remainder, amino acids ~160 to 336, forms a peptidoglycan binding domain. A mutant in original gene 19 is deficient in both tail formation and production of lytic enzyme2; 30. Possible locations of gene 19 mutations are in 19.1–19.5, but 19.5 is unlikely since it specifies a putative helicase. If gp19.3 were also a tail component, that might account for the deficiency of the gene 19 mutant in both lytic enzyme and tail formation. Possible alternatives involve polar effects in the 19.1–19.4 region.
The SPO1 genome is replicated from at least two origins, one in the 105–108 kbp region, the other to the left of about 45 kbp82 The specific origin sequences have not been identified, and there is no sequence in the SPO1 genome that is very similar to the B. subtilis replication origin or termination sites83; 84.
Conditional lethal mutations in nine original genes, 21, 22, 23, 27, 28, 29, 30, 31, and 32, completely prevent SPO1 DNA synthesis and plaque-formation under restrictive conditions30. The gene 28 mutation has its effect by preventing expression of middle genes, which include genes specifying the replication enzymes. The gene 27 mutation also prevents expression of certain late genes, and it is not established whether either of its effects is indirect.
The products of SPO1 original gene 21 (located in any of genes 19.5 – 21.95) and original gene 32 (located in 32.1, 32.2 or 32.4) are necessary for initiation of replication (and gene 32 also for resolution of concatemers)85. Original gene 21 most likely corresponds to gene 21.1, since gp21.1 has moderate similarity to phage T4 helicase gp41, which is an essential component of the T4 replication machinery. There is no functional information about any of the three genes in which the original gene 32 mutation might be located.
The products of genes 22, 30, and 31 are required for the elongation stage of SPO1 DNA synthesis85. A gene 22 mutation also prevented expression of a subset of late genes86, possibly as an indirect effect of the replication block. Possible locations of the original gene 22 mutation are in genes 21.9, 21.95 or 22.1. Gene 21.95 is most likely, since 22.1 has been tentatively assigned to original gene 23 (below), and gp21.9 shows sequence similarity to a protein needed for recombination and not for replication; gp21.95 shows sequence similarity to primases known to participate in replication. Nothing further is known about the activity of gp30, which has no recognizable sequence similarity to known proteins. Finally, gene 31 encodes a protein shown to have DNA polymerase activity in vitro87, with similarity to the DNA polymerase and 3′-5′ exonuclease domains of E. coli DNA polymerase I, plus a domain similar to a bacterial uracil-DNA glycosylase. We suppose that the latter domain may be involved in discriminating between the hmUra-containing SPO1 DNA and thymine-containing host DNA. Also, gp32.85 is homologous to the 5′-3′ exonuclease domain of E. coli DNA polymerase I.
Enzymes involved in synthesis of hydroxymethyluracil nucleotides include dCMP deaminase, dUMP hydroxymethylase and HMdUMP kinase. Gp21.8 is about 30% identical to Gram-positive bacterial dCMP deaminases, and no other SPO1 gene is related to dCMP deaminases; but we note that gp21.8 is about 25% smaller than those bacterial enzymes. The SPO1 dCMP deaminase is known to be expressed at middle times 88, whereas only an early and a late promoter (see below) are in be position to transcribe gene 21.8. Gene 29 encodes dUMP hydroxymethylase (deoxyuridylate hydroxymethyl-transferase), as shown experimentally by Okubo et al.30 and Wilhelm and Ruger28. Gene 23 likely encodes HMdUMP kinase, since the original gene 23 mutant failed to complement an SP82 mutant that is deficient in that enzyme89 (Stewart, C. R. and Franck, M., unpublished results). Possible locations for this mutation are gene 22.1 or 23.1. Gp23.1 shows no recognizable similarity to any known protein; however, gp22.1 has moderate (up to 30%) amino acid sequence identity to bacterial dUTPases (deoxyUTP pyrophosphatases) and about 30% identity to T5 dUTPase76; 90. It seems possible that gp22.1 is actually an HMdUMP kinase, and its similarity to dUTPases arises because the reactions that the latter catalyze are similar to the reverse of the HMdUMP kinase reaction. Thus we believe that gene 22.1 most likely encodes the HMdUMP kinase.
Proteins involved in diminution of the thymidine nucleotide pool, which have been observed in infection by SPO1 or other closely-related hmUra-containing phages, include dTTPase/dUTPase, dTMPase, and an inhibitor of thymidylate synthetase2; 91.
Since intensive searches have failed to find additional genes with conditional lethal, replication-deficient mutations, it is likely that at least most other SPO1 gene products are not essential for DNA replication. Several functions that might be expected to be required for DNA replication have not been identified in our analysis of the SPO1 genome. None of the proteins specified by SPO1 shows significant similarity to any single-strand DNA binding protein, sliding clamp, or sliding clamp loader, all of which are specified by the genome of bacteriophage T4, and participate directly in T4 DNA replication (Ref. 92 and references therein), nor does any SPO1 gene specify a DNA topoisomerase. These functions are likely either performed by novel phage proteins or by host proteins.
The products of genes 21.3 and 21.9 are similar to phage T4 gp47 and gp46, respectively, which are subunits of its recombination nuclease, and to various other nucleases involved in DNA repair and/or recombination. Mutations inactivating the corresponding T4 genes are recombination-deficient92, but no similar studies have been done with SPO1.
Other genes whose putative products might plausibly be involved in replication or other aspects of DNA metabolism, but which have not been shown to be essential for DNA replication, include genes 2.9 and 4 (ribonucleoside diphosphate reductase alpha and beta subunits, respectively), 19.5 (helicase), 32.85 (exonuclease), 32.95 (DNA ligase), and 34.33 (UvrD/Rep family helicase). The two ribonucleoside diphosphate reductase genes have sequence affinities that suggest an evolutionary history different from those of the homologous genes of other phages in the SPO1-like group. Curiously, a nonsense mutation unequivocally located in gene 4, based on marker rescue activity of restriction fragments93, was deficient in head formation2. Since that is an unlikely effect of ribonucleoside diphosphate reductase deficiency, especially since the host genome should also provide this activity, this surprising phenotype may be due to polar effects on downstream gene(s) 5.1 and/or 5.2.
SPO1 infection directs a major remodeling of the host cell, subverting its biosynthetic machinery to the purposes of the phage. Many of the 24 genes of the “host-takeover module,” which occupies most of the terminal redundancy, have been identified as playing major roles in redirecting metabolism to phage production13; 27; 32; 94. Specific gene products have been shown to: inhibit host cell division; shut off host DNA and RNA synthesis; regulate the timing of those shutoffs; and regulate the expression of the genes causing host-takeover. The genes of the host-takeover module (almost all of which are early genes) are unusual in that they are smaller than average and have promoters and ribosome-binding sites characteristic of very highly expressed genes. In SUPPLEMENTARY MATERIAL, we discuss these unusual features in comparison with other groups of genes.
Sequences characteristic of SPO1 early, middle, and late promoters were identified as described in Materials and Methods. Since some genes had either no recognizable promoter, or only a promoter that was highly divergent from the consensus, and since SPO1 gp2.21 showed sequence similarity to B. subtilis sigma K, we searched for sigma K promoter-like sequences upstream of each such gene and found several. The positions of all of these 107 putative promoters are shown in Figure 1 and Tables S4 and S6, and their sequences are listed in Table S4. Among these are 16 early, 21 middle, and 5 late promoters whose activity had been demonstrated experimentally in vivo and/or in vitro 9; 23; 26; 29; 32; 95; 96.
Figure 1 and Tables S5 and S6 indicate the positions of 75 putative rho-independent transcription terminators that have been identified in the SPO1 genome, including 15 that had been identified previously 13; 97, and their sequences are shown in Table S5. Most of these terminators are positioned to cause termination shortly after the end of a gene, and, in most cases, there is at least one promoter overlapping, or just downstream of, the terminator, placed so it would begin transcription before the next gene. As noted in SUPPLEMENTARY MATERIAL, several of these putative terminators have unusual features, such as exceptionally long loops in their predicted stem-loop structure.
Figure 1 and Table S6 show the assignment of the above promoters and terminators to 58 transcription units, using the criteria discussed in SUPPLEMENTARY MATERIAL. Since many have multiple promoters, these “units” may include several lengths of mRNA. Only four genes, 1.4, 2.43, 5.9, 28.2, are not included in the transcription units identified in this way. Transcription of these genes must depend on either leakage through the preceding terminator or initiation at unrecognized promoters.
The SPO1 genome sequence suggests that three new transcriptional regulatory factors, beyond those previously studied, may be encoded by the SPO1 genome. This is not surprising, since the previously known transcription factors were insufficient to account for all of the complexities of SPO1’s transcription program.
SPO1 has been unusual among characterized bacteriophages in that it is known to encode two sigma factors: the product of gene 28, which directs the transcription of middle genes, and the products of genes 33 and 34, which together direct the transcription of late genes8. The nucleotide sequence reveals a third apparent sigma factor, gp2.21, which shows homology to sigma factor K of B. subtilis and other bacteria. The sequence also shows sequences identical to those of sigma K promoters, located at positions such that the initiation of transcription at those positions, possibly by the action of gp2.21, could account for several experimentally observed transcripts that cannot presently be explained by the known sigma factors and their cognate promoters. These considerations are discussed in detail in SUPPLEMENTARY MATERIAL.
SPO1 has also been unusual among characterized bacteriophages in that it is known to encode a chromatin-like DNA-binding protein. This protein, called TF1, binds preferentially to double-stranded hmUra-DNA, such as SPO1’s, causes DNA bending15, and is required for shutoff of expression of certain middle genes, and for activation of expression of certain late genes33. The genome sequence identifies a second chromatin-like DNA-binding protein, which, to the best of our knowledge, is unprecedented among bacteriophages. SPO1 gp34.25 shows very high identity to α/β-type small acid-soluble spore proteins, including the B. subtilis sspB gene product. These proteins bind to spore DNA, modifying its conformation, and their effects are modulated by HBsu, the major host chromatin protein, which is homologous to TF1 (Ross and Setlow, 2000). Thus, it is plausible that gp34.25 might bind to SPO1 DNA, modifying its structure and, thereby, its gene expression. In SUPPLEMENTARY MATERIAL, we discuss how gp34.25 might account for a previously unexplained gene regulatory event.
The products of SPO1 genes 44, 50 and 51 are required for the normal transition from early to middle gene expression, and for other gene regulatory events32. A double mutation inactivating genes 50 and 51 has the same effect on gene expression as a single mutation inactivating gene 44, and gp44 and gp51 are 46% identical over a 117 amino acid segment that constitutes most of gp51. The genome sequence reveals a third related gene. Gp25.1 has 34–35% identity to gp44 and gp51 over most of the length of each of those proteins, suggesting that it might also play a role in regulating transcription. In SUPPLEMENTARY MATERIAL, we discuss how gp25.1 might account for the currently unexplained delay in the onset of delayed-early gene expression.
The relationships among the large, lytic tailed phages are complex, and many aspects of their life cycles are not well understood. Among such phages, the genomes of a number of phages similar to the well known E. coli phage T4 (166 kbp) have been sequenced, and these phages form the best characterized group of related, large phages, termed the “T4-like superfamily”98–105. To date, sequenced phages in this group infect hosts in the Proteobacteria and Cyanobacteria. About 45 “core genes” are found in all members of the superfamily, each of which has about 300 genes103. (The exact number of core genes, so defined, can vary somewhat depending on the phylogenetic breadth of the group of phages being considered). The core genes encode mostly proteins that are required for virion assembly and DNA replication. Genome comparisons and phylogenetic analyses have shown that the core genes experience little or no horizontal exchange among phages in this group; it is likely that exchange of genes encoding parts of complex protein machines is limited by the intimate interactions among the parts98; 106. In contrast, the remainders of their genomes are occupied by genes that appear to exchange in and out of the genomes frequently, as shown by the fact that in most cases they occur in only one member or a subset of the most closely related members of the group. These exchangeable genes most likely confer a selective benefit on the specific phages in which they occur, as for example the photocenter genes of some of the cyanophages57; 107; 108, but in most cases the functions of these accessory genes are not known, and the possibility that some of them confer no selective advantage cannot be ruled out.
There are no doubt many other such superfamilies of large tailed phages which remain to be elucidated with further analysis of phage diversity. For example, such a group may be represented by the two Pseudomonas aeruginosa phages, KZ109 and EL110, whose genomes are over 210 kbp in length and whose related coat protein sequences (for example) are not recognizably similar to those of T4 or SPO1. The coat proteins of Yersinia enterolitica phage R1–37111 (genome 270 kbp), B. thurnigiensis phage 03058–3677 (218 kbp) and Thermus thermophilus phage YS40112 (152 kbp) are each not obviously similar to any other phage coat protein and so could be prototypical of other such groups. We argue below that SPO1 and its relatives constitute a newly identified superfamily analogous to the T4 superfamily.
The current database contains complete genome sequences for five phages that have substantial similarity to SPO1 – Listeria phage P10041, Staphylococcus phages Twort34, K35 and G134, and Lactobacillus phage LP6542. As this report was being written, the genome sequences of two SPO1-like phages were reported, the 142 kbp genome sequence of Enterococcus phage EF24C113 (Accession No. NC_009904) and the 134 kbp genome of Listeria phage A51175 (Accession No. DQ003638). These genomes were not included in our comparative analysis with the exception of information previously available for A51147. Other phages that likely fit into this group, as judged by genome sequence similarities, are Staphylococcus phages 812, SK311, U16 and 131114. Comparisons among these genomes place SPO1 in a phylogenetic context and give examples of evolutionary events in the past history of these phages. Such comparisons also reveal which genes are most strongly conserved among members of the group of phages and therefore are most likely to have functions critical to their lifestyle.
To facilitate these comparisons we used the computer program Phamerator, which will be described in detail elsewhere (S. Cresawn, in preparation); in brief, it carries out all pairwise comparisons among the amino acid sequences of the annotated protein-coding genes of (in this case) the six completely sequenced SPO1-like phages and classifies them into related groups called “Phamilies” that share significant sequence similarity as determined by the BLASTp and CLUSTAL W algorithms. This analysis calculates that these six phage genomes encode 611 different protein types (Phamilies) among the 1,114 genes; the members of each Phamily are listed in Table S7. Of these Phamilies, 444 have only one member (i.e., the gene is unique among the six phages), 167 have more than one member, and 12 have seven or more genes. In the last category, this analysis identifies a few Phamilies of proteins that contain members which appear on other grounds not to be orthologous — for example, they differ substantially in size or domain structure, or they have numerous members at different locations. The major examples of such complex Phamilies are discussed in more detail in SUPPLEMENTARY MATERIAL (see also Figure S1). Manual inspection of possible orthologies within these complex Phamilies was performed in all cases, and we believe this does not confound any conclusions drawn below.
The physical maps of the six putative SPO1-like phages are aligned on the major capsid genes (gene 6.1 for SPO1) in Figure S3. It is clear from this alignment that their gene orders have very significant similarities. For SPO1, 26% of its protein-coding genes (53 out of 204) have homologues in one or more of the other phages in the group. Pairwise comparisons show that the different phages have between 27 and 141 gene Phamilies in common (Table 1). All six phages share 26 genes (manually parsed from 21 Phamilies, see above and SUPPLEMENTARY MATERIAL) in common, and nearly all of these homologous genes have similar organization within the SPO1-like phage group. The genes that are shared by all or most of the phages are concentrated over somewhat more than half the length of the genome and consist primarily of morphogenetic (head, tail, tail fiber) genes and DNA metabolism genes. Additional genes are present in each genome that match a smaller fraction of the six genomes, and each genome contains genes that match none of the other five phages (indicated in Figure S3 by white gene boxes). Table 1 also shows that the six SPO1-like phages are not equidistantly related to one another; G1, Twort and K have more genes in common with each other than they do with the others. This closer relationship is perhaps not surprising, since these three phages infect the same Staphylococcus host.
The SPO1-like phages clearly do not fall into the only previously defined superfamily of large lytic phages, the T4 superfamily. Their gene order and gene relatedness characteristics are well outside the boundaries of the T4 superfamily of phages. For example, there is little to no statistically significant sequence similarity between the SPO1-like phage coat proteins and the T4-like coat proteins; the most divergent coat proteins within the T4 superfamily are 20–25% identical, as are the most divergent members of the SPO1-like group (SPO1 and LP65 coat proteins are 24% identical). Among the SPO1 genes, only gene 21.1 has its closest relatives in the T4 superfamily (and they are weak matches at <27% identity). Thus, these two groups appear, at our present state of understanding, to form largely self-contained and non-overlapping tailed phage types, which nonetheless each occupy similarly wide spans of diversity. In addition, we have found no close homologies or overall organizational similarities between SPO1-like and any other large phage genomes. Thus, SPO1-like phages clearly form a definable superfamily; they have similar genome sizes and virion morphologies, and they share many more homologous genes (in largely the same organization) with other members of the group than they do with any other phage whose genome sequence is known.
The six SPO1-like phages discussed above occupy a substantial span of diversity in spite of the fact that we group them into a single superfamily. This diversity is manifest at three levels – sequence per se, gene content and gene arrangement. First, individual genes have different nucleotide (and encoded protein amino acid) sequences due to simple sequence divergence. We do not discuss this divergence in detail, but note that (for example) for the two most divergent members, SPO1 and LP65, the coat proteins are only 24% identical. This extent of divergence is typical between SPO1 and LP65 orthologues, suggesting that the SPO1 phage superfamily has existed and been undergoing divergence for a long time. Second, as discussed above, there are very significant differences in gene content among the superfamily members; indeed, a majority of genes are different between the more distantly related members.
On average, the SPO1 genes that have matches in one or more of the other five superfamily members are larger than those with no matches. Figure 7 shows the size distributions of the SPO1 genes with and without such matches as a function of gene size. The genes without matches are strongly skewed toward smaller gene sizes. Of the 141 SPO1 genes with 200 or fewer codons, 23 (16%) match at least one gene in the other five phages, while 37 of the 61 genes larger than 200 codons (61%) make such a match. The genes without matches in the other phages are, we suggest, genes that exchange in and out of the phage genomes on an evolutionarily rapid time scale, since a gene uniquely present in SPO1, for example, was either absent from the last common ancestor of the group of phages and acquired more recently by SPO1 or present in the common ancestor and lost more recently in the other phages. Thus we conclude that the genes that exchange on a rapid evolutionary time scale are smaller on average than those with more permanent residence. (A caveat for this sort of analysis is that the method for detecting matches must not itself have a length bias or it can introduce a result such as that shown in Figure 7 artifactually. The BLAST algorithm does have such a length bias, but PHAMERATOR also uses CLUSTAL W in such a way—scoring percent amino acid identity in pair-wise alignments—that we believe it should identify most pairs without substantial length bias). As noted in SUPPLEMENTARY MATERIAL, there is also a rough correlation between small genes and genes that are transcribed early.
Third, a few genome rearrangements can be recognized when the genomes are compared. The great majority of genes that match between any two genomes in the SPO1 superfamily have similar locations, but there are examples of genes that have moved in one genome relative to another. An example of such a difference is seen in a cluster of genes corresponding to SPO1 genes 12.1, 13.1, 13.2 and 14.1. They are present in the same positions in SPO1, K, G1, Twort and P100, but in LP65 they are in a position about 36 kbp leftward from their homologues in the other phages and have suffered an inversion as well as a permutation of order.
There are also several examples among these phages of apparent, large, relatively recent (non-intron) insertion/deletion in one phage relative to another. These features are especially well-defined in a comparison between phages K and G1, since they are more than 99% identical in nucleotide sequence over 93% of the phage K genome. Against this background, 233 bp from the middle of phage K gene 108 are replaced in phage G1 by 7,381 bp; these 7,381 bp contain 22 predicted small genes, only four of which match genes in the other phages. These 22 genes fit the pattern described above for genes that exchange in and out of the genomes rapidly, but it is not clear whether they represent a recent addition to the G1 genome or a recent subtraction from the K genome.
The most clearly defined apparent insertion in the SPO1 genome is between the terminase and portal genes (genes 2.11 and 3.1); we suggest that this feature illustrates how SPO1 has acquired novel functions in its recent evolutionary past. In phage LP65 the homologous terminase and portal genes are adjacent, as in many phages, but in SPO1 these genes are separated by 14,929 bp of sequence which contains fifteen predicted protein-coding genes and the five SPO1 tRNA’s. (Similarly, phages Twort, K, G1 and P100 have related 3–5 kbp sequences in this interval that are different from the SPO1 sequence.) The SPO1 protein-coding genes in this interval include four that have inferred functions based on sequence matches to the database but that do not make sequence matches to any of the other phages in this comparison group. These genes include the putative σK-like transcription factor gp2.21, the ParB-like (gp2.12) and PhoH-like (gp2.62) proteins, and a protein (gp2.52) with sequence similarities to the “Hep-Hag family” of bacterial outer membrane carbohydrate-binding proteins. This stretch of putatively inserted DNA in the SPO1 genome might therefore plausibly be inferred to be responsible for some phenotypic properties of SPO1 that differentiate it from other members of the superfamily. A striking feature of the eight protein-coding genes in this region that make significant matches to sequences in the databases is that seven of them make their strongest (and numerous) matches to bacterial sequences; the same is true of three of the five tRNA sequences. Matches to phage sequences, when present, are less strong than the matches to bacterial genes, with the one exception of gp2.43, which also has very strong bacterial matches. These observations argue that much of this fifteen kbp segment of SPO1 DNA was most likely acquired in SPO1’s ancestry primarily from bacterial rather than phage sources. However, the fact that there are eleven phage-specific promoters annotated in this region argues that this sequence has adapted to become a functional part of the phage genome in the time since its entry into the phage universe.
The overall pattern of relationships among the genomes of SPO1 and its relatives discussed here is reminiscent of what has been described for the T4 superfamily of phages (above). Our analysis of the SPO1 superfamily, as well as preliminary phylogenetic analysis, shows that the presence of a core set of virion assembly genes and a few DNA metabolism genes appears to be constant, and they have remained in approximately constant, but not always contiguous, locations (Figure 8), while the presence of other (often smaller) genes varies widely among the group members. It is of interest to note that, as with the T4 group, there is no evidence of exchange of core genes among these SPO1-like phages, since the trees of the core genes we have examined are strikingly similar (Figure 9). Thus, the two characterized superfamilies of large, virulent tailed phages for which there are enough genomes to compare profitably, typified by T4 and SPO1, are only extremely distantly related and yet have similar relationships between gene sequences and gene organization within each group.
There is growing evidence that some virion structural proteins have common ancestries extending across all tailed phages43. Current sampling of phage genome sequences is extremely sparse, particularly for large phage genomes, and it covers a very limited range of host species, but the evidence cited above for the T4 and SPO1 groups strongly suggests that they are now essentially genetically isolated from each other. However, there are already indications that these groups may not be as monolithic nor as isolated as they may seem from our current vantage point. In the large cyanophages, which have been included in the T4-type superfamily as the “Exo T-evens” sub-type, less than the full set of the T4 core genes can be recognized. Whether the unseen core gene functions are missing from these phages, are present but were acquired horizontally from a very distant gene pool, or are simply diverged past recognition, remains to be determined. For the SPO1 group there is the curious case of Leptospira phage LE1 mentioned above, in which the major head protein and seven other putative virion structural proteins are clear homologues of proteins encoded by the SPO1 group (Figure 3), in the context of a phage with a genome half the size of SPO1’s that infects a spirochete and has no other clear homologies to the SPO1 group. As we learn more phage genome sequences, including those of phages infecting more diverse hosts, it will be interesting to see how many more such superfamilies emerge and especially, as sampling of the population becomes denser, whether the boundaries separating them become sharper or more blurred.
Equally interesting, we suggest, will be the non-core genes. These genes are clearly exchanging horizontally at a relatively rapid rate over evolutionary time, and their diversity is such that there is not yet any reliable estimate of how many different kinds of these genes are potentially available to any one phage. We imagine that the flux of these genes through the large phage genomes on an evolutionary time scale allows the phages to adapt to new ecological situations. As we have proposed for phages with smaller genomes115; 116, such as the lambdoid phages, we suggest that the differential rate of flux of the core and non-core genes through the large phage genomes is not the result of different rates of recombination in the two groups of genes. Rather we believe that there is rampant and indiscriminant recombination across the entire genomes of these phages (on an evolutionary timescale), but that the resulting recombinant phages survive only when the associations of the core genes mandated by natural selection remain intact.
Phage SPO1 from C. Stewart’s strain collection was plaque-purified and propagated in B. subtilis CB1085. Approximately 10 μg of purified phage SPO1 DNA were sheared hydrodynamically using a Hydroshear instrument (Gene Machines) and repaired using E. coli Klenow DNA Polymerase and T4 phage DNA Polymerase according to the manufacturers’ recommendations. These 1–3 Kbp fragments were then ligated into plasmid pBluescript, cleaved with EcoRV and transformed into E. coli strain XL1BlueF’. Individual plasmid clone inserts were sequenced using outside primers with ABI377 and ABI3100 instruments, and the resulting sequence was assembled with CONSED117. When approximately 8-fold redundancy in bulk sequencing was attained, oligonucleotide primers were used with genomic template to resolve sequence ambiguities and join contigs to generate a single accurate genome sequence. The boundaries of the terminal redundancy were determined by sequencing toward the ends of the virion DNA, using appropriate primers. Such sequencing reactions are expected to give a mixture of two sequences, one from one copy of the terminal redundancy that terminates at the physical end of the DNA, and another from the other terminal redundancy that continues into the abutting non-redundant sequence. We took advantage of the fact that Taq polymerase makes a large untemplated A peak at the position following the last base of the sequence and searched for positions near the predicted ends where the signal strength dropped by about half following a large untemplated A peak.
The SPO1 genome sequence was scanned for open reading frames, using computer programs Glimmer118; 119, GeneMark120, and DNA Strider121, and for tRNA genes using tRNA Scan-SE with a low threshold score setting (http://lowelab.ucsc.edu/tRNAscan-SE/), as well as ARAGORN (http://bioinfo.thep.lu.se). The predicted gene locations are listed in Table S1. Because of the stringent requirement for substantial ribosome-binding sites (RBS) able to pair with the 16S rRNA in B. subtilis122, the appropriate positioning of a sufficiently strong RBS was considered to be a positive identification of a translation start (RBS sequences are listed in Table S6). A few “genes” with questionable RBSs are discussed in the SUPPLEMENTARY MATERIAL.
Nearly half of the previously identified genes (genes “1–3”, “5–26”, and “32”) had been located only by their position on the genetic map as determined by recombination frequency2. In some cases, further positional information was obtained from the size of truncated proteins produced by nonsense mutants123; 124) or from the relative effectiveness of complementation or marker rescue by specific cloned fragments93; 94. Many of these genetically identified genes could not be correlated unequivocally with a particular ORF in the genome sequence. For instance, gene 11 must be one of the two ORFs between genome coordinates 50,414 and 51,365. In such a case, we have called those two ORFs 11.1 and 11.2, with no gene being called simply gene 11. Similarly, we only know that gene 10 lies between genes 9 and 11; thus, it could be ORF 9.1, 10.1, 10.2, or 11.1. Thus, ORF 11.1 could be either of original genes 10 or 11. There was a good, but not definitive, argument for assigning gene 2 to a single ORF, so that ORF was called gene 2.
Possible rho-independent transcription terminators were identified by the STEMLOOP program (Wisconsin Package Version 9.0, Genetics Computer Group (GCG), Madison, WI) as stem-loop structures with stems at least five contiguous base pairs long. These were screened manually for the presence of a downstream run of T’s. Some additional putative terminators were found by visual examination of intergenic regions for strings of T’s preceded by a stem-loop. 75 putative terminators were thus found in the SPO1 genome, and are listed in Tables S5 and S6. Those not previously named13 were numbered T1 to T61 from left to right across the genome. Evaluation of the effectiveness of specific terminators and unusual features of certain terminators are reported in the SUPPLEMENTARY MATERIAL. Since the rho gene of B. subtilis is dispensable (unlike that of E. coli), and since only the rho gene itself has been shown to have a Rho-dependent terminator125, it is likely that few, if any, SPO1 terminators are Rho-dependent, and we did not search for such terminators.
Three types of promoter had been identified previously in the SPO1 genome, early, middle and late (reviewed in ref. 2). Identification of promoters in the newly sequenced portions of the genome was complicated by the fact that many promoters diverge substantially from their consensus sequences, but still have demonstrable promoter activity. SPO1 early promoters (consensus TTGAC(A/T) – 17–19 bp spacer – (T/A/C)ATAAT) are utilized by host RNA polymerase with host sigma factor A. Many of the B. subtilis A promoters whose activity has been demonstrated have −35 and −10 sequences that diverge significantly from this consensus126–128. The same is true for the previously known SPO1 middle promoters (consensus AGGAGA, 17–19 bp spacer, TTTNTTT9; 23, and late promoters (consensus CGTTAGA, 17–19 bp spacer, GATATT29). For example, the newly sequenced promoter PE14, which has three divergent elements, was shown to have promoter activity95,and a demonstrably active middle promoter, PM8, also has three divergent elements 23; 95.(Each base within the −35 or −10 regions that was different from the consensus base for that position was counted as one divergent element. If the spacing between −35 and −10 was <16 or >19, that was counted as one divergent element, and spacings of <14 or >21 were counted as two.) The five previously described late promoters29 are all substantially divergent; one has four divergent elements. (Such divergence among late promoters is not unusual; T4 late promoters do not have a recognizable −35 consensus92). Because of this extensive divergence, it was not possible to comprehensively identify all promoters of each type simply by computational means (see also SUPPLEMENTARY MATERIAL). Therefore, we focused our analysis on intergenic regions, scanning each such region visually for each type of promoter.
New putative early, middle and late promoters are named PE—, PM—, and PL—, respectively. We have included early and middle promoters having up to three divergent elements, and late promoters with as many as five. Since there is no evidentiary basis for limiting the acceptable divergence, and since highly divergent promoters can be difficult to recognize, it is quite possible that we have not found all functional promoters or that some of the proposed promoters actually have no promoter activity. As discussed in SUPPLEMENTARY MATERIAL, we also identified eight putative sigma K promoters, which we hypothesize are utilized by the newly identified putative sigma factor, gp2.21. A total of 107 putative promoters were thus identified and are listed in Table S4.
RWH, GFH and SRC were supported by NIH grants GM51975, AI28927 and AI074825, respectively. MLP was supported by Grant Number P20-RR-16455-05, -06, -07, and -08 from the National Center for Research Resources (NCRR), a component of the National Institutes of Health (NIH). CRS was supported by grants 003604-0020-1999 and 003604-0035-2001 from the Texas Advanced Technology Program.