Isolation and Characterization of Mycobacteriophages
We previously reported the genome sequences of mycobacteriophages L5 [28
], D29 [30
], TM4 [31
], and Bxb1 [29
], along with a comparative genomic analysis of these and ten additional mycobacteriophages: Barnyard, Bxz1, Bxz2, Che8, Che9c, Che9d, Cjw1, Corndog, Omega, and Rosebush [6
]. We have isolated an additional 16 mycobacteriophages using the methods described previously [6
] from a variety of sources; Che12 was isolated in Chennai, India; Bethlehem and U2 are from Bethlehem, Pennsylvania, United States; 244 is from Connecticut, United States; and Catera, Cooper, Halo, Llij, Orion, PBI1, PG1, Pipefish, P-Lot, PMC, Qyrzula, and Wildcat are from Pittsburgh, Pennsylvania, United States, and the surrounding areas. Each of these 16 new phages was plaque-purified, grown in quantity, banded through an equilibrium CsCl density gradient, and used to isolate virion DNA. DNA was hydrodynamically sheared, cloned into plasmid vectors, sequenced, and assembled as described previously [6
]. Many of these phages were isolated, named, and characterized by undergraduate and high school students, taking advantage of phage discovery and genomics as a platform for integrating scientific research and education as described in detail below.
Our view of the general features of mycobacteriophage genomes changes only modestly with a doubling of the number of genome sequences available (). The total sequence information is increased from 979,434 bp to 2,071,001 bp but the average genome length has not changed significantly. The largest of the newer genomes is Catera (153.7 kbp), slightly smaller than Bxz1 (156.1 kbp), the largest previously sequenced mycobacteriophage. One of the newly sequenced genomes, Halo, is only 42.3 kbp long, substantially smaller than any other mycobacteriophage genome (). Neither the base composition (average %GC content, 63.7%) nor the average ORF length (600 bp) has changed significantly (). These genomes also encode 101 tRNAs and three tmRNAs. The distribution of these small translation-system RNAs is quite uneven, with three phages (Bxz1, Wildcat, and Catera) contributing 80% of the total tRNAs and all three tmRNAs. Since the primary focus of this paper is to report on the diversity of mycobacteriophage gene phamilies and the use of phage genomics as a discovery-based educational platform, further details of the individual phages and their genomes will be reported elsewhere.
Features of Completely Sequenced Mycobacteriophage Genomes
Nucleotide Sequence Diversity of Mycobacteriophages
Comparison of the 30 mycobacteriophages with each other at the nucleotide level reveals considerable overall diversity, with small groups having recognizable sequence similarity (). The most numerous phage genome cluster contains seven that are more closely related to each other than to other phages; these include L5, the first sequenced mycobacteriophage genome [28
], D29 [30
], Bxb1 [29
], Bxz2 [6
], Che12, Bethlehem, and U2. The next most numerous group contains six members, including the previously described Rosebush [6
], Orion, PG1, Cooper, Qyrzula, and Pipefish. Phages PMC, Che8, and Llij form another cluster, with parts of Che9d and Omega having similarity to these; two-member groups are formed by phages P-Lot and PBI1, Cjw1 and 244, and Bxz1 and Catera. Phages showing little or no nucleotide sequence similarity to any of the others are TM4, Halo, Che9c, Barnyard, Corndog, and Wildcat.
Figure 1 Nucleotide Sequence Comparison of 30 Mycobacteriophage Genomes as Illustrated in a Dotter Plot Using a Sliding Window of 25 bp 
The profile of nucleotide sequence similarities differs in several notable ways from the previously reported comparison of only 14 genomes [6
]. While the overall diversity remains very high, 11 of the newly sequenced genomes have recognizable nucleotide sequence similarity to at least one of the previously reported genomes. A particular surprise is the finding that five of the newly sequenced genomes have detectable nucleotide sequence similarity to the previously sequenced genome of phage Rosebush (of which no closely related genomes had been described), although, with the exception of Qyrzula, the degree of similarity is low. In contrast, the Bxz1/Catera and Cjw1/244 groups are more closely related than any other pairs of mycobacteriophage genomes. Both pairs exhibit greater than 90% nucleotide sequence identity with a number of small insertions or deletions accounting for much of the difference. In contrast to other groups of phages that have been analyzed, the diversity at the nucleotide sequence level of the mycobacteriophages appears to be greater than that of either dairy phages [14
] or staphylococcal phages [17
], although it is not yet clear whether this reflects underlying biological diversity differences or just the relatively small numbers that have been analyzed and the isolation approaches utilized.
There is no clear correlation between membership in these clusters of similar phages and the phages' geographic origins. Although all the members of the Rosebush group came from the environs of Pittsburgh, the seven members of the L5 cluster came from Japan, New York City, California, Chenai, India, and Bethlehem, PA. The smaller groups are also mostly geographically diverse. Our view at the current level of data collection is that the clusters of similar phages are widely distributed geographically.
Mycobacteriophage Genes and Gene Phamilies
The 30 mycobacteriophage genomes encode a total of 3,357 open reading frames (). As expected, the newly sequenced genomes possess a mosaic architecture similar to those described previously, with modules—frequently containing just a single gene—shared by otherwise distantly related phages. In order to better understand this genetic diversity, we have assembled all 3,357 open reading frames into gene phamilies, i.e., groups of related sequences, using the criteria that an encoded protein must share amino acid sequence similarity at an E-value of 0.001 or better or 25% amino acid identity across its length with at least one other member of the phamily (Table S1
). This generates a total of 1,536 phamilies, with a mean phamily size of 2.19 genes (). However, more than half of the phamilies (774, 50.3%) contain just a single constituent gene, and 88% of the genes are contained within phamilies containing three or fewer members ().
Size and Distribution of Mycobacteriophage Phamilies
Surprisingly, there are only three phamilies that contain members in all 30 of the mycobacteriophage genomes. Since these phamilies have the potential to contain genes or domains that are present in all mycobacteriophage genomes—and may thus correspond to mycobacteriophage signatures—we have examined these more closely. The first of these, Pham7, contains Lysin A genes, one of two putative lysins encoded by the mycobacteriophages [47
]. The second lysin, Lysin B (Pham9), is also present in 27 of the 30 genomes (); however, both of these phams are highly diverse, and they appear to be composed of subgenic modules with reasonably defined boundaries (A). While each of the Pham7 members has sequence similarity to at least one other member of the phamily, no single sequence element within Pham7 is present in all 30 genomes (A).
Complex Relationships within Highly Abundant Mycobacteriophage Phamilies
Both of the remaining 30-genome phamilies correspond to tail proteins. One of these encodes the tape-measure protein (Tmp) (Pham23) that plays a role in tail assembly and determines the lengths of noncontractile tails [6
]. This phamily is also highly diverse, but the relationships among the members are complicated, and no well-defined boundaries between shared modules could be recognized; this complexity is illustrated by the long branch lengths within a phylogenetic analysis (B). The second tail protein phamily (Pham28) is equally complicated, with ill-defined module boundaries (C), and a total of 81 genes fall into this phamily (A). Nevertheless, as seen in Pham7, no single sequence element in either of these tail protein phamilies is present in all 30 genomes.
It is noteworthy that the six most abundant phamilies (Phams 28, 23, 7, 9, 25, and 109; A) appear to be highly represented not just because of the conservation of essential functions but also because of their highly divergent and modular natures. These phamilies correspond to phage functions—predominantly tail proteins and lysis proteins—that are expected to be intimately involved in interacting with their bacterial hosts, and high diversity among phage tail proteins has been described previously [48
]. Although the Pham109 members are related to carboxypeptidases (A), Pham109 genes are typically located among tail genes [6
]. Thus, we postulate that they are likely to be structural components of tails.
Phamily-Based Clustering of Mycobacteriophage Genomes
The comparison of mycobacteriophage genomes at the nucleotide level () reveals not only considerable genetic diversity but also small groups of phages that appear to be more closely related to each other than they are to other mycobacteriophages. To explore this further, we examined the relationships among these phages by asking whether or not each genome contains a member of each gene phamily by using the program Splitstree [50
], which accommodates alternative phylogenetic relationships, to express these data (A). This analysis reveals six clearly defined groups of genomes that we have termed clusters A through F. Cluster A contains the previously characterized phages L5, D29, Bxb1, and Bxz2, plus the newly sequenced genomes of Bethlehem, Che12, and U2. The second most numerous cluster (cluster B) includes six genomes, of which only one (Rosebush) was previously characterized [6
]; the remaining four clusters are closely related pairs of genomes (A).
Representation of Mycobacteriophage Clusters Using Splitstree
The relationships presented in A help to organize the discussion of the general features of these genomes, and we note that this approach bears some resemblance to the phage proteomic tree described previously [51
], although the Splitstree presentation enables at least partial inclusion of phylogenetic ambiguities that arise from the comparison of genomes that are pervasively mosaic in their architecture. But while these representations reflect the global relationships of the phages, it is important to note that they represent aggregate representations of the evolutionary histories of these viruses, and this ignores the numerous constituent phamilies that have phylogenies that are distinct from that shown in A. The six phamilies (Phams 61, 137, 58, 1,072, 216, and 933) shown in B illustrate this problem, underscoring the important role of horizontal genetic exchange in phage evolution.
An alternative representation of phage genome relationships utilizes phamily circles to identify the participants and indicate the strength of relationships within each phamily (). For example, Pham58 is present in eight genomes, with the strongest relationships between the phamily members in four of the seven components of cluster A and both members of cluster E. In contrast, Pham61 members are also in three of the same four members of cluster A as Pham58, but these are all closely related to a Pham61 member in Omega. This representation also illustrates the phylogeny of the intein present in the Pham216 member in Omega, which differs substantially from the remainder of the phamily members. While the phamily circles shown in obviously represent only a subset of all of the possible genetic relationships in the bacteriophages, a hypothetical extension to all of the 762 phamilies that contain two or more members would constitute a complete and accurate representation of the phylogenies of the protein-encoding genes of these genomes. The future development of automated circle drawing software should greatly facilitate this.
Phamily Circle Representations of Phamily Relationships
How Many Mycobacteriophage Phams Exist?
To estimate the total number of phams shared among mycobacteriophages, we calculated the number of phams found among randomly chosen subsets of the 30 phages described above (). A hyperbolic curve was fit to the data, where PhamMax is the maximum number of phams in the population being sampled, and K_Phage is the number of phages that must be sampled to uncover one-half of the phams. We found that a curve with PhamMax = 3,064 and K_Phage = 29 fit the data well (). That is, this sample of 30 phages contains representatives of only half of the predicted total number of 3,064 phams to be found among mycobacteriophages, and sequencing a further 100 mycobacteriophage genomes would probably uncover members of an additional approximately 1,000 phams, assuming that phages that are closely related to known phages are not excluded from the sample. We note that on average the addition of a thirtieth phage to a randomly selected group of 29 phages leads to identifying approximately 25 new phams.
Estimating the Number of Mycobacteriophage Phams
Genetic Novelty of Phage Phamilies
The organization of mycobacteriophage genes into phamilies simplifies the process of determining how these are related to genomes of other phages and their bacterial hosts. For example, of the 1,536 phamilies, only 230 (15%) have recognizable sequence similarity (Blast E value of 0.001 or better) to existing (nonmycobacteriophage) database entries (). Remarkably, over 85% of these mycobacteriophage phamilies, a total of 1,306, represent previously unidentified genetic sequences, consistent with the idea that phages may represent the largest reservoir of unexplored genes in the biosphere [6
]. Interestingly, only 126 (54.8%) of the 230 phamilies with matches correspond to genes found in other phages, whereas 104 (45.2%) are related to nonphage (predominantly bacterial) genomes; 57 (24.8%) of the 230 phamilies are found in both phage and bacterial genomes (). Some the phams listed as exclusively matching bacterial genomes may be identifying unrecognized or unannotated prophages.
Relationships between Mycobacteriophage Phams and Previously Sequenced Proteins
It is clear from this analysis as well as that of other phage genomes that there is little overlap between the metaproteome of phages and that of prokaryotes. Moreover, typically, greater than 80% of genes within bacterial genomes can be assigned to clusters of orthologous groups, and in some cases (e.g., Buchnera
) virtually every gene can [52
]; this is in contrast to the high proportion of mycobacteriophage genes (approximately 50%) that are unrelated to other genes (i.e., ORFans). Grouping of the phage genes into phamilies is therefore justified similarly to the separate ordering of eukaryotic genes into eukaryotic clusters of orthologous genes [53
]. We also note that only a small proportion of mycobacteriophage phamilies (approximately 20%) would qualify as adding to or expanding the current cluster of orthologous groups database.
Examination of the phams that match other phage or nonphage genes, or both, reveals qualitative differences among these groups, particularly in regard to the numbers of mycobacteriophage genomes within each pham. For example, the average pham size (defined as the number of mycobacteriophage genomes containing a phamily member) of all phams with matches to nonmycobacteriophage genes is 3.88, compared to 1.83 for those with no matches (). Thus, if a mycobacteriophage gene has sequence similarity to any nonmycobacteriophage gene, it will have more relatives within the group of mycobacteriophages than one that does not.
The average pham size is different for the groups of phams that match phage and nonphage genes (). However, the average pham size for the 126 phams matching all other phage genes is 4.67, substantially larger than the average size for all phams (2.19) or those with matches (3.88) (); the subset of these that specifically matches both phage and nonphage sequences is even larger, with an average pham size of 5.12 (). On the other hand, the average pham size for the 104 phams that exclusively match nonphage genes is only 2.91.
What is the basis for different distributions of pham sizes? While the dataset remains relatively small and the 30 genomes are not all equally different from each other ( and ), we postulate that the differences in average pham sizes reflect the extent to which these phams provide functions of general utility to the mycobacteriophages (i.e., high pham size) or provide specialist functions to smaller numbers of mycobacteriophages that are required to infect specific hosts to or survive within particular environmental or biological niches. Examples of the pham group that match other phages include both structure and assembly genes (e.g., capsids, portals, terminases) as well as nonstructural genes (e.g., ruvC, integrases, excises), but these phams presumably provide important functions to both mycobacteriophages and other phages. In contrast, the phams matching nonphage gene products with smaller pham sizes include gene products such as WhiB, Glutaredoxins, FtsK, Lsr2, DinD, DinG, Ro, and PurA; these presumably provide more specialist functions. It seems likely that these specialist functions have been acquired directly from host genomes, and they are relatively rare because of their restricted utility to a subset of phages, rather than their lack of opportunity to spread more broadly among this group of phages that infect at least one common host (i.e., Mycobacterium smegmatis).
Constraints on the Modular Construction of Phage Genomes
If there are approximately 3,000 different sequence phamilies among phages that infect M. smegmatis, how many different ways can these be combined to make a functional phage genome? If we assume that an average genome contains 100 genes, then there are approximately 10350 possible ordered phamily combinations. Since there are estimated to be 1031 phage particles in the biosphere and mycobacteriophages represent only a fraction of these, it is clear that only a miniscule fraction of the conceivable gene combinations and organizations have been used. More important, we expect that there are strong functional constraints imposed on the generation of competent phage genomes. These constraints can be grouped into four main categories. First, each genome must encode all necessary functions not provided by the host, mandating inclusion of virion structural and assembly genes, lysis genes, and possibly some DNA replication functions. Thus, a specific subset of phamilies with required functions must be present in each genome. Second, specific gene combinations are required when a gene can provide functional benefits only in combination with another gene. One example might be the joint action of holin and lysin components of the lysis machinery. Third, some phamilies may encode alternative functions, where the inclusion of a member of one phamily (e.g., a virion capsid) precludes the inclusion of any members from another phamily (e.g., a nonhomologous capsid protein). Fourth, gene organization is likely to be an important constraint, for example, allowing cotranscription of genes that need to be coregulated.
Clearly, the 30 mycobacteriophage genomes are not simply the result of random assortment of phamilies. At both the nucleotide () and protein () levels, we can identify clusters of phages that exhibit greater levels of similarity to each other than to the larger population; for example, 21 of the 30 phages fall into six separate clusters at the nucleotide sequence and the phamily inclusion levels used here (). While it is likely that this clustering reflects in part the particular host and the isolation procedures used, it also suggests that there are pools of related phages that enjoy success within the current environment. This is reminiscent of what has been seen for phages of other hosts, including those of Escherichia coli, where separate clusters are represented by phages lambda, T4, T7, and P2. The types of phages defined by the clusters of mycobacteriophages are not obviously correlated with the types seen in the coliphages, suggesting the possibility that each host type has a distinct set of phage types associated with it. We suspect instead that each of the clusters defined by the current collection of genome sequences is part of a continuum of types extending over a range of hosts, with no sharply definable boundaries. However, more data will be required to clarify this question.
It seems likely that an additional constraint on the assortment of phamilies is that their distribution within genomes is not independent of each other. For example, examination of the data set reveals that Pham56 is found in diverse bacteriophages but only in those that carry the Pham25 minor tail protein, suggesting both a role for Pham56 in virion assembly and a lack of function in genomes lacking Pham25. The nine members of Pham297 are not only found in Pham25-bearing genomes, but they are adjacent to the Pham25-encoding genes, suggesting an even more intimate association between these proteins. The coassociation of genes among bacteriophage genomes can only be assessed using a large collection of diverse genomes, and this demonstrates how datasets such as this one can be used to uncover plausible gene functions and interactions.
Phage Discovery as an Educational Tool
The newly discovered phages described here were isolated and characterized by junior members of the laboratory, including high school and undergraduate students (). The educational benefits of performing scientific research have been reported previously [54
], although identifying suitable research projects and laboratory environments represents a significant challenge to the research community. The Phagehunters Program developed at the University of Pittsburgh—in which students discover and genomically characterize their own bacteriophages—provides a particularly strong combination of attributes that maximize the educational benefits within a research environment; we note that a successful program with some common features also was developed by R. Young and colleagues [56
]. While there may be many others, we have identified seven key attributes of this project that facilitate successful outcomes, and these are described in . Identifying the key features of projects that are well suited for student research should facilitate the identification and development of other effective programs.
Isolation, Sequencing, and Annotation of Mycobacteriophage Genomes
Seven Helpful Attributes for High School and Undergraduate Research Projects
Two particularly important features of this educational platform are the strong emphasis on scientific discovery and project ownership. As evident from the comparative genomic analysis described above, there are three main discovery elements. First, there is an excellent prospect of each student isolating a phage from the environment that is different to all previously described viruses. Second, within each genome there is an opportunity to discover new genes that are distinct to all previously identified genes. Third, many of these phage genomes surprisingly contain homologs of known genes that have not been previously found in phage genomes. The high diversity of the mycobacteriophage population (), the preponderance of novel genes (), and the mosaic architecture of these genomes provide a high promise of discovery for each participating student.
The opportunity for students to discover novel genes and viruses is important since it is stimulating and highly motivating, providing a strong impetus for students to become engaged in scientific research and to maintain their involvement even through the more challenging aspects of their projects. The modest genome size of the phages is such that a single student can reasonably manage an individual phage, and project ownership adds motivation and commitment. The prospect of naming their new phage generates considerable interest among these novice scientists while also offering opportunities to learn about the justification of a nonsystematic nomenclature for viruses that contain genomes with highly individualized mosaic architectures. This educational platform is clearly not restricted to mycobacteriophages and can be readily extended to the isolation and characterization of environmental phages for other nonpathogenic hosts; comparative genomic analyses of dairy, staphylococcal, and pseudomonal phages show that these also are genetically diverse and typically harbor a high proportion of novel genes [14
The development of multiple projects with parallel structures as described here brings both advantages and disadvantages. The main disadvantage is that since the research path is well established, students generally have only limited opportunities for experimental planning and design. However, there are also considerable advantages to parallel project structures. First, it provides opportunities for a greater number of students to participate than if each were independently structured. Second, it facilitates the training of students through peer and near-peer mentoring systems, and we find that mentorship of high school students by undergraduates is a particularly effective combination. Moreover, undergraduate student mentors can be readily trained using the Entering Mentoring program developed by Pfund and colleagues [57
]. The parallel project structure, coupled with this mentoring system and the technical simplicity of the initial project stages, represents essential ingredients for enabling students in the early stages of their educational development to engage in scientific research.
Finally, this phage discovery educational platform requires only modest prior comprehension of biological facts and concepts. This simplifies access of young students to scientific research and provides opportunities to students who do not necessarily excel in more traditional classroom settings. The platform offers numerous opportunities for students to learn concepts in microbiology, ecology, genetics, computational biology, and evolution within an inquiry-driven environment and is fully inclusive of a diverse variety of learning styles. Additionally, the significant bioinformatics component of the program appeals to students with computer science and engineering backgrounds, and in doing so, it creates a diverse research group that offers advantages both to the participants and the research agenda itself. Detailed protocols are available at http://www.pitt.edu/~gfh
The comparative genomic analysis of 30 mycobacteriophage genomes provides important new insights into the diversity and architecture of phage genomes and offers insights about gene exchange between phage genomes and between phages and their hosts. It is likely that these general features will be shared by most other phages, and the recent comparative analysis of 27 Staphylococcus
and 18 Pseudomonas
phages also shows relatively high genetic diversity [16
]. Phage isolation and genomics is a powerful educational platform that provides research opportunities to students from diverse educational backgrounds. The high diversity of the phage population offers the excitement that each student can isolate a unique virus and uses genomic approaches to understand the relationship of the newly discovered phage to the broader biological world. The ability of students to contribute successfully to achieving the key scientific goals of understanding viral diversity, and the underlying evolutionary mechanisms that give rise to it, suggests that phage isolation and characterization can be used broadly for educational purposes.