|Home | About | Journals | Submit | Contact Us | Français|
We report the first accurate genome sequence for bacteriophage P22, correcting a 0.14% error rate in previously determined sequences. DNA sequencing technology is now good enough that genomes of important model systems like P22 can be sequenced with essentially 100% accuracy with minimal investment of time and resources.
Since its discovery 50 years ago, bacteriophage P22, a double-stranded DNA tailed phage of Salmonella enterica serovar Typhimurium (32), has been a prominent model system used in investigations of numerous facets of molecular biology (5, 16, 28). Because of its importance as an experimental system, its genome was originally sequenced in 27 different fragments by many different laboratories starting 20 years ago, but mostly with early technology that was less accurate than current methods (1-4, 6, 8-13, 15, 17-27, 29-31; M. Kroeger and G. Hobom, unpublished data [GenBank accession no. X78401]; M. Sranko and M. Susskind, personal communication). Because of the continuing relevance of both past results and ongoing investigations, such as comparative genomic studies, evolutionary studies, structural studies, etc., we believe that it is important to know the P22 genome sequence accurately. Vander Byl and Kropinski (30) resequenced a few parts of the P22 genome and reported an updated P22 sequence and the resolution of 17 ambiguities among the previously reported sequences. Our independent comparison of the original 27 sequence fragments revealed 28 sequence discrepancies. All 28 were resolved by reanalyzing the original data from our laboratories and by designing oligonucleotide primers to program sequencing reactions across each of the discrepancies by using wild-type P22 virion DNA as the template. Our wild-type P22 came directly from David Botstein's strain collection.
These 28 sequence discrepancies came from disagreements in areas where the individual published sequences overlapped. Since these overlap regions in aggregate covered only 21% of the genome, it seemed likely that there would be additional errors in the remainder of the published sequence. Thus, when we accidentally obtained sequence information from an extremely close relative of P22 (P22-pbi [see below]) which contaminated a preparation of another Salmonella phage, we decided to continue collecting sequence information for this phage until its genome sequence was complete. Shotgun sequencing, performed as previously described (14) with an ABI 3100 capillary sequencer, was continued to 10.2-fold coverage of the whole genome, with complete coverage on both strands, yielding a circular sequence 41,724 bp long. This sequence had 48 differences from the “resolved” P22 genome sequence (Table (Table1).1). Our previous experience with contemporary sequencing methodology led us to expect that this newly determined sequence of P22-pbi should be essentially error free as a result of the inherent advantages of shotgun sequencing and the ease of collecting data to high redundancy. The fortuitous availability of two independent determinations of the same genome sequence provided an efficient way to test that assumption, since it allowed attention to be focused specifically on the quality of data at the sites of disagreement. Careful reexamination of the P22-pbi data at each of the 48 sites of disagreement showed that the new sequence was unambiguous in each case.
Phages are so extremely diverse that individuals this similar are very rarely, if ever, independently isolated, so we suspect that P22-pbi is almost certainly a feral version of the original P22 that escaped into the laboratory. If this is true, then some of the differences between P22-pbi and wild-type P22 might be the result of errors in the previously reported sequence. We therefore designed oligonucleotide primers to program sequencing reactions across each of the above-mentioned 48 differences and several other regions of particular interest to us and determined the sequences in these regions by using authentic wild-type P22 virion DNA as the template. In this way, we obtained unambiguous sequence for about 59% of the wild-type P22 genome (mostly outside of the previous overlaps) and found that 44 of the 48 apparent differences between the sequences of our resolved wild-type P22 and P22-pbi were in fact not actual differences at all. These 44 differences are therefore errors (which could be due to sequencing errors, strain differences, or cloning artifacts) in the previously reported P22 sequence. The four authentic differences between P22-pbi and wild-type P22, at bp 13508, 15038, 21030, and 31519 (Table (Table1),1), appear to have arisen since P22-pbi's escape.
Are there other errors in the 41% of the wild-type P22 sequence that we did not resequence? We do not think so, since there is no sequence difference between wild-type P22 and P22-pbi in this 41% of the genome and we found no additional discrepancies in the 59% of the wild-type genome that we resequenced. Any other actual error in the previously reported P22 sequence is prohibitively unlikely to have mutated by chance to the same nucleotide in P22-pbi. It is therefore extremely likely that the identity between P22 and P22-pbi represents the true wild-type P22 sequence in these regions and that the sequence we report here is the correct wild-type P22 sequence. This sequence accurately predicts all 258 experimentally mapped cleavage sites for 46 different restriction enzymes (reference 7 and references therein; S. Casjens, unpublished data); the one previous discrepancy, the absence of an experimentally observed XmnI site at bp 35025, is resolved by the creation of this site in the corrected sequence. Thirteen of the 44 corrections to the wild-type P22 sequence are between genes, and 31 are within genes; among the latter, 23 are changes to nonsynonymous codons and therefore alter the predicted amino acid sequences of the encoded proteins. This amounts to corrections to the amino acid sequences of 15% of the proteins encoded by the P22 genome. It seems likely that most of the differences between the wild-type P22 sequence reported here and the original P22 sequence are the result of errors in the original sequence rather than strain differences, since a majority of the corrections result in changes to nonsynonymous codons. If coding sequence differences among P22 strains are like those among other closely related organisms, then most would be synonymous. For example, the DNA sequences of gene 23 of P22 (where a number of the differences between P22 and P22-pbi lie) and gene Q of phage λ differ by 30 nucleotides, but the encoded protein sequences differ by only six amino acids. We have submitted this corrected bacteriophage P22 sequence to GenBank as an updated and fully annotated complete genome sequence.
The corrected bacteriophage P22 sequence has been assigned GenBank accession no. TPA:BK000583, the 41,724-bp circular sequence of phage P22-pbi has been assigned GenBank accession no. AF527608, and the unambiguous sequence for approximately 59% of the wild-type P22 genome has been assigned GenBank accession no. AY121859, AY121860, AY121861, AY121862, AY121863, and AY121864.
We thank David Botstein for wild-type P22 and Miriam Susskind for access to unpublished information.
This work was supported by NSF grant 990526 to S.R.C. and NIH grants GM51975 to R.W.H. and G.F.H. and GM51609 to A.R.P.