The data set: viruses and genomes.
The query used for retrieval from the NCBI Nucleotide database included the following: Viruses[Organism] NOT cellular organisms [ORGN] NOT srcdb_refseq[PROP] AND vhost bacteria[filter] AND “complete genome”[All Fields]. It was followed by curation to remove entries that matched the query but which were not actually complete genomes, including merely complete coding sequences, partial genomes, single genes, and mutants. The nucleotide database contains several genomes labeled as a “complete sequence” rather than a “complete genome,” most notably over 30 mycophages, that were added separately, again followed by manual curation.
As described previously, single-linkage clustering was used to join groups of very closely related (essentially the same) phages. The resulting group contains the union of all proteins from each member phage genome. For the larger genomes containing ≥20 genes, viruses sharing ≥90% of their genes were joined, whereas for the smaller genomes of <20 genes, viruses must share all of their genes to be joined. Shared genes were defined as symmetric best matches between a pair of phage genomes. However, no procedure using gene content could be found to work for the genus Microvirus
, because the phage species ϕX174
, G4, and alpha3 share the same set of 11 proteins (44
), so the viruses of the genus Microvirus
were specifically exempted from this procedure in order to allow these species to remain distinct despite their identical gene content.
Identification of signature genes for viral taxa.
The complete genomes of viruses belong to 1,158 distinct taxon groups at various levels of their hierarchy (all the way from all viruses through orders, families, genera, and individual species). Individual virus species only represented once make up the majority of these groups, 991 (85%), and 1,072 (93%) are represented by fewer than 3 distinct viruses. To aim for higher-level clades, these groups were discarded, leaving 86 taxa represented by at least 3 distinct viruses. However, 25 of these represent temporary collections of unclassified or unassigned viruses or environmental samples, and since these do not represent bona fide clades, they were removed to be analyzed separately (although descendant groups below them in the hierarchy were still retained if they met the other criteria). Four more groups were also redundant (containing only a single descendant taxonomic node; these are ssRNA viruses/Leviviridae, dsRNA viruses/Cystoviridae, Rudiviridae/Rudivirus, and Tectiviridae/Tectivirus) and thus were collapsed, leaving a data set of 57 taxa for which we attempted to find signatures. These are listed in File S5 in the supplemental material and include 6 clades above the family level (the 4 genomic types dsDNA, ssDNA, dsRNA, and ssRNA, the order “Caudovirales; tailed phages,” and “all viruses”), 9 clades at the family level, 6 at the subfamily level, 28 at the genus level, and 8 groups of individual viruses. As an example for the latter, the Enterobacteria phage ϕX174 species clade consists of the 19 genomes of the Enterobacteria phages S13, ID1, ID22, ID34, ID45, NC1, NC5, NC7, NC11, NC16, NC37, NC41, NC51, NC56, WA4, WA10, WA11, S13, and ϕX174 sensu lato.
The POGs to be used as signatures for particular groups of viruses were chosen using a 3-tiered procedure. In each case, only individual POGs were considered, and the ability of compound signatures consisting of multiple genes to represent a taxa (e.g., gene A or gene B or genes A, B, and C) was not evaluated.
First, candidate signatures were chosen from the information contained only within the POGs themselves. Among the POGs appearing only within a given group of viruses and never outside that group, candidates were chosen that had the highest recall (found in the most genomes) and/or the highest VQ. Because taxonomy is hierarchical by nature, whenever a POG could serve as a signature for multiple taxonomic groups at different levels of the hierarchy, the group maximizing precision and recall, in that order, was chosen, with ties broken by assignment to the highest taxonomic level available. For instance, the RNA-dependent RNA polymerase (46
) is found in all dsRNA viruses, and because in the present data set all dsRNA viruses are assigned to the family Cystoviridae
and also to the genus Cystovirus
, the precision and recall were both tied at 100% for all 3 of these clades, thus this POG was assigned to dsRNA viruses at the highest taxonomic level. In another example, the major coat protein present only within the genomes of members of the genus Inovirus
could also serve as a signature for the higher-level clades of the family Inoviridae
or for all ssDNA viruses. Doing so would yield 100% precision, because this protein is not observed outside Inoviridae
or ssDNA viruses but would reduce recall because it is not present in other viruses within Inoviridae
(plectroviruses) or in other ssDNA viruses (Microviridae
Second, precision and recall was evaluated against the protein sequences of viral genomes. Once candidate signatures were chosen for each taxon, a sequence profile was constructed (multiple sequence alignment constructed by MUSCLE [47
] and PSI-BLAST [45
]) and used to search for matches among the protein sequences of the 1,027 viruses with completely sequenced genomes. An E value threshold of 1e−5 was used, as it was found to provide the highest recall and precision in small-scale tests with several signatures (data not shown). Matches were also required to have a bit score of at least 40 and a region of sequence similarity extending over at least 40 amino acids. In some cases, the use of a profile significantly enhanced the recall of the candidate signature compared to simply testing the POG membership. For instance, the maturation protein and RNA replicase beta protein are found in 10 of the 12 (recall of 83%) ssRNA viruses by the POG-making procedure, but matches to the respective profiles were identified in all 12 (recall of 100%). In other cases, the recall and/or precision were reduced by the use of profiles. For instance, although the RNA polymerase subunit of N4-like viruses never appears outside that clade in the POGs, the profile apparently raised the sensitivity enough to find matches in phages in other genera and even another family, thus lowering its precision to only 50%. However, this does not translate to a loss of a signature for the N4-like viruses, because 7 other signature candidates exist for it that each have 100% recall, 100% precision, and a VQ of 1.0.
Third, because the diversity of viruses is undersampled, in order to test for bias presented by the set of virus genomes, these profiles were also tested against all known virus proteins present in the NCBI nr nucleotide database. Occasionally this reduced the precision of a signature candidate; for instance, 2 of the 3 signatures in CBA120-like viruses (in the family Myoviridae) had precision lowered from 100 to 80% due to matches among virus proteins for which complete genomes were not yet available. In order to use a conservative estimate of precision, the lowest value among the two profile searches was reported.
Unclassified viruses represent a complication to the search for signatures, because only partial taxonomic information is available for these. Nearly 80% of the 1,027 genomes in the POG data set are listed as unclassified or unassigned at some level of the taxonomic hierarchy, with 78 (8%) even unclassified at the root (although all but one of these was found to be dsDNA by its genome size or manual literature search; see File S1 in the supplemental material). For all of these, matches were counted as far as the available partial information would allow. For instance, a virus unclassified at the root could match any signature gene for any clade without penalty to precision, because it is possible that the unclassified virus legitimately belongs to any clade. However, an unclassified ssDNA virus could only match (without penalty) signatures to clades below ssDNA viruses in the hierarchy, such as the family Inoviridae or the genera Inovirus or Plectrovirus, but could not legitimately match a clade within dsDNA, ssRNA, or dsRNA viruses, because the partial information would conflict in the latter cases. A further complication arises from taxonomic groups that appear as descendants of an internal taxonomic node that contains an unclassified label; for example, the 936 group of lactococcal phages, whose full taxonomic classification is the following: dsDNA viruses, no RNA stage→Caudovirales; tailed phages→Siphoviridae; phages with long noncontractile tails→unclassified Siphoviridae→936 group of lactococcal phages. By the procedure listed above, these viruses would be allowed to be matched without penalty by a potential signature for any other clade within the Siphoviridae family, such as lambda-like viruses, despite the fact that the 936 group forms a distinct clade on its own. To prevent this from occurring, all internal nodes containing an unclassified label were collapsed, effectively promoting groups such as this to become a distinct group within the Siphoviridae family, thereby penalizing any matches to it from another Siphoviridae virus, such as lambda-like virus. Finally, in addition to the 78 taxa that include at least 3 virus genomes present in the POG data set, an additional 26 unclassified or unassigned taxon nodes were found in the taxonomy information supplied in the GenBank entries of these genomes. Since these are not true taxa, they were not included in ; however, occasionally signature genes could be found for them, and both the list of these taxa and those signatures are included in File S5 in the supplemental material.
Number and percentage of taxa that can be represented by at least one signature gene, with precision fixed at 100% and recall (x axis) allowed to vary. (a) The dependence of signatures on VQ value. (b) Breakdown of signatures into taxonomic levels.