General Features of the SpaA1 Genome
The genome of phage SpaA1 consists of 42,784 bp flanked by complementary 9-bp single stranded cohesive (cos
) ends (5′-…TGGAGGAGG -3′ and 3′-CCTCCTCCA…-5′). Using GeneMark.hmm 
, 63 open reading frames (ORFs) were identified as probable protein-coding genes. The predicted proteins encoded by these 63 ORFs were compared to the non-redundant protein sequence database (National Center for Biotechnology Information, NIH, Bethesda) using PSI-BLAST 
and the Conserved Domain Database using RPS-BLAST 
. Analysis of the most similar proteins (best hits) for all predicted gene products of SpaA1 reveals three major regions of apparent different origins suggesting a modular architecture of the genome (; ).
Architectures of SpaA1, BceA1 and MZTP02 genomes: comparison with BLAST protein matches to phage proteins in four Bacillus genomes.
Open Reading Frames in the genomes of SpaA1 and BceA1.
The nucleotide sequence of the first module (left and coloured red in ) of the SpaA1 genome is almost identical to the sequence of the entire 15,717 bp genome of another bacteriophage, MZTP02 (apart from its 5′ - and 3′- terminal regions of 41 bp and ~370 bp long, respectively) that was isolated from Bacillus thuringiensis
, strain MZ1 in China 
(). Unlike SpaA1 DNA which contains terminal cos
ends, MZTP02 DNA contains 40-bp terminal inverted repeats and its 5′-terminus is covalently bound to a terminal protein presumably encoded by ORF9 (according to our annotation; 
). Interestingly, an almost identical sequence is present as a prophage in the genome of B. thuringiensis
BGSC 4AJ1 (locus IDs: bthur0007_34460 to bthur0007_34660, accession no. NZ_CM000752.1) and B. cereus
Rock4-2 (locus IDs: bcere0023_35280 to bcere0023_35430, accession no. NZ_ACMM01000283.1). The 19 potential ORFs located in this region encode predicted structural proteins and proteins involved in assembly of SpaA1 and thus form the “structural” module of the genome. The architecture of this module in SpaA1 shows features that are typical of other bacteriophages of the family Siphoviridae
. In particular, there is clear synteny among genes encoding virion subunits and proteins involved in virion assembly 
. The genes for head and tail assembly are encoded in the same transcriptional orientation, with the head genes located upstream of the tail genes ( and ). The predicted head genes include the large and small terminase subunits (ORF3 and ORF4, respectively), the portal protein (ORF5), the minor capsid subunit (ORF6), the scaffold protein (ORF8), gp-like tail connector (ORF1) and head-tail adapter (ORF11); the tail genes include the major tail subunit (ORF12) and the tape measure protein (ORF17), followed by the tail fiber protein (ORF18) and the minor tail protein (ORF19) (). The length of the tape measure protein gene corresponds to the length of the phage tail and is thus commonly the largest gene in the genome 
. In SpaA1, however, the tape measure protein (979 aa) is only the second largest protein, the largest being the minor tail structural protein (1569 aa). Bacillus phage TP21-L also has a minor structural protein that is larger than the tape measure protein 
. For most of the known phages, the size of the tape measure protein corresponds to a fairly constant 0.15 nm of tail length per amino acid residue 
. However, the tail length-to-amino acid ratio for SpaA1 is ~0.20 nm per amino acid residue, suggesting that this protein might be somewhat more extended than those in other known phages.
The gene arrangement in the second SpaA1 genome module (coloured green in ), which consists of genes with functions in DNA integration, replication, transcription, cell entry and exit (ORF20–ORF46), and may be denoted the ‘replication module’, is very similar to the organization of the corresponding regions in several prophages of B. thuringiensis Kurstaki strain (, ). The longest conserved gene array (locus_ID: bthur0006_5910 to bthur0006_6000; accession no. NZ_CM000751.1) contains the first 10 ORFs in this region. In particular, the replication module encompasses five predicted transcriptional regulators (ORFs 25, 33–35 and 45) and four putative DNA-binding proteins (ORFs 24, 28, 31, and 46). Other ORFs related to replication in this module include ones encoding a FtsK/SpoIIIE- like protein (ORF27), and three proteins containing HTH and DnaB domains (ORF29), a DnaD domain (ORF41) and a predicted ATPase related to DnaC (ORF42). The module also encodes an antirepressor (ORF37), two proteins involved in cell lysis (ORFs 22 and 23) and two integrases, ORF20 which shows 95% amino acid sequence identity with the integrase of prophage lamdaBa02 (accession number EEM54966.1), and ORF30 which shows 80% amino acid sequence identity with an integrase from B. thuringiensis (accession number EAO53934.1).
The third genomic module (coloured yellow in ) of SpaA1 is similar to a portion of B.cereus
AH676 prophage and contains additional regulatory and recombination related genes including a potential recombination protein U (ORF53) and a potential DNA-binding protein (ORF54). ORFs 55 and 56 are similar to the N-terminal and C-terminal parts of an RNA polymerase sigma 70 factor, respectively. The last nucleotide of the TAA termination codon of ORF55 is also the first nucleotide of the ATG initiation codon of ORF56 within a TAATG sequence. However, the reading frame of ORF56 extends 5′ without an initiation codon to nucleotide 39374 in SpaA1, and a -1 frameshift in the region of nucleotides 39385–39390 during translation of ORF55 could result in a single protein of 206 amino acids which is similar to an intact RNA polymerase sigma factor from B. cereus
(accession number ACM16007.1). Interestingly, approximately 70% of dsDNA long-tailed phages including siphoviruses exploit the programmed frameshift mechanism for gene expression and the majority of frameshift candidates appear to use a -1 frameshift 
. However, no canonical -1 frameshift signal has been detected by KnotInFrame, a tool for the prediction of ribosomal frameshift events 
. Alternatively, ORF55 and ORF56 might encode two distinct proteins possibly forming a two-subunit complex. ORF40 of SpaA1 encodes a second RNA polymerase sigma 70 factor that is not closely related to the ORF55/56 sigma factor and is most similar to a homolog from B. thuringiensis
(accession number EEM99580.1). The longest region of synteny conservation between SpaA1 and AH676 contains 6 ORFs (locus_ID: bcere0027_53380 to bcere0027_53450; accession no. NZ_CM000738.1).
Phage terminase genes can be used to construct phylogenetic trees which correlate with the structure of the phage DNA termini 
. However, we have detected evidence of recombination in the MZTP02 region that encompasses at least the gene for the large terminase subunit of SpaA1. The majority of the ORFs within the ORF1-ORF18 region (the MZTP02sequence) show best hits into several Bacilli genomes (), and the tree for phage portal protein SPP1, taken as a typical example, clearly demonstrates clustering with sequences from these organisms (). In contrast, the tree for ORF4, the large subunit of phage terminase, shows very different topology (), suggesting that notwithstanding the synteny in this region (), ORF4 appears to have been acquired from a different, unknown source. The topology of the tree for ORF3, the small subunit of phage terminase, was compatible with the typical, SPP1-like topology (). Thus, the large subunit gene apparently was displaced via
‘in situ’ recombination 
, an observation that further emphasizes the mosaicism in the phage genomes.
Phylogenetic analysis of selected SpaA1 genes.
Neither the second nor the third genomic modules of SpaA1 completely match any known prophages or phages. Even with the most closely related phages, such as Cherry 
, EJ 
, phBC6A51 
and the deep-sea thermophilic phage D6E, 
there are only a few significantly similar predicted proteins ( and ) indicating that SpaA1 represents a novel group of tailed phages.
The overall G + C content of the phage is 35.63% strongly resembling its host S. pasteuri
) as well as the host for MZTP02 (B. thuringiensis
, 35.3%, 
). No significant differences in the GC content were detected among the three genomic modules of SpaA1.