|Home | About | Journals | Submit | Contact Us | Français|
With the advent of metagenomics approaches, a large diversity of known and unknown viruses has been identified in various types of environmental, plant, and animal samples. One such widespread virus group is the recently established family Genomoviridae which includes viruses with small (~2–2.4kb), circular ssDNA genomes encoding rolling-circle replication initiation proteins (Rep) and unique capsid proteins. Here, we propose a sequence-based taxonomic framework for classification of 121 new virus genomes within this family. Genomoviruses display ~47% sequence diversity, which is very similar to that within the well-established and extensively studied family Geminiviridae (46% diversity). Based on our analysis, we establish a 78% genome-wide pairwise identity as a species demarcation threshold. Furthermore, using a Rep sequence phylogeny-based analysis coupled with the current knowledge on the classification of geminiviruses, we establish nine genera within the Genomoviridae family. These are Gemycircularvirus (n=73), Gemyduguivirus (n=1), Gemygorvirus (n=9), Gemykibivirus (n=29), Gemykolovirus (n=3), Gemykrogvirus (n=3), Gemykroznavirus (n=1), Gemytondvirus (n=1), Gemyvongvirus (n=1). The presented taxonomic framework offers rational classification of genomoviruses based on the sequence information alone and sets an example for future classification of other groups of uncultured viruses discovered using metagenomics approaches.
Viral metagenomics, fostered by powerful high-throughput sequencing methods, has recently revolutionized our perception of virus diversity in the environment. Many novel groups of uncultivated viruses have been discovered during the past decade, including viruses with small, moderately-sized, and even large genomes (Yau et al. 2011; Roux et al. 2012; Labonte and Suttle, 2013; Dutilh et al. 2014; Yutin et al. 2015; Zhou et al. 2015et al.; Dayaram et al. 2016; Steel et al. 2016). Many of these virus groups remain unclassified. To embrace the constantly growing output from viral metagenomics studies, virus taxonomy is increasingly switching from the traditional classification guided by biological features, such as serology, virion morphology or host range, to predominantly sequence-guided practices (Simmonds et al. 2017). Sequence-guided virus classification is relatively straightforward when the new viruses fall into existing taxa, with well-defined demarcation criteria. However, in the absence of isolated representatives and established taxonomic framework, rational definition of appropriate taxonomic ranks, such as families, genera, and species, for novel groups of uncultured viruses might be considerably more complex. Solutions to this problem are perhaps most urgently needed in the case of single-stranded (ss) DNA viruses, which are extremely widespread in nature. Due to their small genomes sizes, high mutation and recombination rates (Duffy and Holmes 2008; Duffy and Holmes 2009; Firth et al. 2009; Harkins et al. 2009, 2014; Grigoras et al. 2010; Martin et al. 2011; Streck et al. 2011; Nguyen et al. 2012; Cadar et al. 2013; Roux et al. 2013), and relative ease of genome amplification, an incredible diversity of these viruses has been discovered through metagenomics studies in all conceivable habitats. ssDNA viruses infect cells from all three domains of life and are currently classified by the International Committee on Taxonomy of Viruses (ICTV) into eleven families and one unassigned genus. Members of the families Microviridae and Inoviridae infect bacteria, viruses of the families Spiraviridae and Pleolipoviridae prey on archaea, whereas eukaryotes host viruses classified into the families Anelloviridae, Bidnaviridae, Circoviridae, Geminiviridae, Genomoviridae, Nanoviridae, and Parvoviridae, and the unassigned genus Bacilladnavirus. In addition, several widespread groups of uncultured viruses discovered by viral metagenomics remain unclassified, predominantly those that are circular replication-associated protein encoding single-stranded (CRESS) DNA viruses (Simmonds et al. 2017).
The Genomoviridae family is one of the most recently established families of ssDNA viruses (Adams et al. 2016; Krupovic et al. 2016). The family currently includes a single genus Gemycircularvirus, which contains a single species, Sclerotinia gemycircularvirus 1, encompassing a single isolate, Sclerotinia sclerotiorum hypovirulence-associated DNA virus 1 (SsHADV-1). SsHADV-1 was isolated from a plant–pathogenic fungus Sclerotinia sclerotiorum and is the only ssDNA virus known to infect fungi (Yu et al. 2010, 2013). Recently, Liu et al (2016) have shown that SsHADV-1 is able to infect a mycophagous insect (Lycoriella ingenua) which acts as a transmission vector. SsHADV-1 virions are non-enveloped, isometric, 20–22 nm in diameter, and assembled from a single capsid protein (CP) (Yu et al. 2010). The genome is a circular ssDNA molecule of 2,166 nucleotides and contains two genes—for CP and rolling-circle replication initiation protein (Rep). Like in many other ssDNA viruses with circular genomes, the large intergenic region of SsHADV-1 contains a potential stem-loop structure with a nonanucleotide (TAATATTAT) motif at its apex, which is likely to be important for rolling-circle replication initiation. The CP of SsHADV-1 is not recognizably similar to the corresponding proteins from viruses in other taxa. Although SsHADV-1 remains the only isolated and classified member of the Genomoviridae, 121 viral genomes with varying degree of similarity to that of SsHADV-1 have been recovered and sequenced from various environmental, plant- and animal-associated samples, indicating that these viruses are widespread and abundant in the environment (Table 1). However, a proper taxonomic framework and demarcation criteria necessary to accommodate these viruses within the family Genomoviridae are lacking. Here, we explore the diversity and evolution of uncultured SsHADV-1-like viruses and attempt to establish a framework for their classification based on sequence data alone.
At the time of the analysis (August, 2016), there were 121 SsHADV-1-like genome sequences in the GenBank database. Each of these genomes encodes two putative proteins homologous to the CP and Rep of SsHADV-1, highlighting strong coherence of this virus assemblage. Nevertheless, there is a considerable sequence divergence within the group (Supplementary Fig. S1). To investigate the extent of genomoviral sequence diversity, we analyzed the distribution of genome-wide pairwise identities (one minus Hamming distances of pairwise aligned sequences with pairwise deletion of gaps) across all 121 available genomes (Fig. 1A) using SDT v1.2 (Muhire, Varsani, and Martin 2014). Most of the virus genomes in our dataset share 56–66% genome-wide pairwise identities and only a handful contained nearly identical relatives (≥98% identity), indicating that sequence diversity among SsHADV-1-like viruses remains largely unexplored.
Pairwise comparison of the Rep and CP protein sequences revealed a broader distribution of identity values (Fig. 1B and C). Notably, the CPs were considerably more divergent that the Reps, with the highest proportion of pairwise identities being ~33% (versus ~48% for the Rep). This observation is in line with functional differences of the two proteins and the fact that viral CPs often encompass host recognition determinants which are under constant pressure to co-evolve with the cellular receptors (Kolawole et al. 2014; Shangjin, Cortey, and Segales, 2009). Based on the analysis of distribution of the pairwise identities across genomes, CPs and Reps, we consider a threshold of 78% to be a conservative value for species demarcation. Thus, all viral genomes showing identities higher than this value should be considered as variants of the existing species. Nonetheless, there may be situations where it is difficult to assign species because a particular new sequence is
To resolve the above conflicts, we suggest adopting a similar approach proposed for geminiviruses (Muhire et al. 2013; Varsani et al. 2014a, b; Brown et al. 2015). To resolve conflict 1, we suggest that the new sequence be classified within any species in which it shares>78% identity to any one variant formerly classified as belonging to that species, even if it is<78% identical to other viruses within that species. To resolve conflict 2, we suggest that the new sequence be considered as belonging to the species with sequences with which it shares the highest degree of similarity.
Maximum likelihood phylogenetic analyses based on the Rep of 121 genomoviruses revealed several well-supported clades that could be considered as genera within the family (Fig. 2). We note that the clades obtained in the Rep-based phylogeny are not fully consistent with those obtained in the phylogenetic analysis of the full genome or the more diverse CP sequences (Figs 3 and 4). This is most explicit in the case of the newly proposed genus Gemykolovirus (see below). In the Rep-based tree corresponding sequences form a sister clade to the single representative of the genus Gemyduguivirus (Fig. 2). In contrast, in the whole-genome-based phylogeny, gemykoloviruses form a sister group to members of the genus Gemycircularvirus (Fig. 3). The reason for this incongruence is likely to be intra-familial recombination between different genomovirus genomes resulting in chimeric entities encoding Rep and CP with different evolutionary histories (Kraberger et al. 2015a). Indeed, in the CP-based tree gemykoloviruses are firmly nested within the large clade including the majority of gemycircularviruses (Fig. 4). Given that CP sequences of genomoviruses are considerably more divergent than the Rep sequences (Fig. 1), it appears reasonable to establish a higher (i.e., above the species level) taxonomic framework using the Rep (Fig. 2). The latter protein is also conserved in other eukaryotic ssDNA viruses (which is not the case for the CP) and can thus be used to assess the place of genomoviruses within the larger community of ssDNA viruses.
To evaluate the taxonomic structure of the Genomoviridae, we took advantage of the fact that in Rep-based phylogenetic analyses, genomoviruses consistently form a sister group to members of the Geminiviridae (Krupovic et al. 2016), a comprehensively characterized family of plant viruses with circular ssDNA genomes (Varsani et al. 2014b). Thus, using the established taxonomic framework of the Geminiviridae overlaid on the Rep-based phylogeny as a guide, we could define five clades and four additional singletons within the Genomoviridae branch (Fig. 2). The defined groups displayed equivalent intra-family divergence as the established genera within the family Geminiviridae (Varsani et al. 2014b). The nine groups were supported in both nucleotide and protein sequence inferred phylogenies (Supplementary Fig. S2). Consequently, in addition to the existing genus Gemycircularvirus, we propose establishing eight new genera within the family Genomoviridae. The details of the nine genera are summarized in Fig. 5 and briefly outlined below.
This genus has the largest number of new species (n=43; seventy-three genomes; Table 1) and includes SsHADV-1, the founding member of the family. Members of the genus display 44% diversity. Viruses within the forty-three species cluster with 99 and 96% branch support values in phylogenetic trees constructed from either Rep or full genome sequences, respectively (Figs 2 and 3).
This is the second most populated genus (n=16; twenty-nine genomes; Table 1) in the family with 43% diversity among its members. The name of the genus is an acronym of words geminivirus-like and myco-like kibi virus (kibi means circular in Amharic). Sequences within the fifteen species cluster with 93% branch support within phylogenetic trees constructed from Rep (Fig. 2) and two well-supported clades (100 and 96%) within trees constructed from full genome sequences (Fig 3), suggesting that recombination has played an important role in the evolution of this group.
Members of this genus (n=5; nine genomes; Table 1) display 49% diversity. The name of the genus is an acronym of words geminivirus-like and myco-like gor virus (gor means round in Hindi). Sequences within the five species cluster with 100 and 99% branch support within phylogenetic trees constructed from either Rep or full genome sequences, respectively (Figs 2 and 3).
Members of this genus (n=2; three genomes; Table 1) display 37% diversity. The name of the genus is an acronym of words geminivirus-like and myco-like kolo virus (kolo means round in Czech). Sequences within the two species cluster with 100 and 89% branch support within phylogenetic trees constructed from either Rep or full genome sequences, respectively (Figs 2 and 3).
Members of this genus (n=3; three genomes; Table 1) display 33% diversity. The name of the genus is an acronym of words geminivirus-like and myco-like krog virus (krog means round in Slovenian). Sequences within the three species cluster with 99 and 100% branch support within phylogenetic trees constructed from either Rep or full genome sequences respectively (Figs 2 and 3).
The name of the genus is an acronym of words geminivirus-like and myco-like vong virus (vong means circular in Lao). The single species Human associated gemyvongvirus 1 (Table 1) within the genus shares between 56 and 62% genome-wide sequence similarity with viruses in other genera and is a divergent taxon in the phylogenetic trees constructed from either Rep or full genome sequences (Figs 2 and 3).
The name of the genus is an acronym of words geminivirus-like and myco-like tond virus (tond means round in Maltese). The single species Ostrich associated gemytondvirus 1 (Table 1) within the genus shares between 53 and 61% genome-wide sequence similarity with viruses in other genera and is a divergent taxon in the phylogenetic trees constructed from either Rep or full genome sequences (Figs 2 and 3).
The name of the genus is an acronym of words geminivirus-like and myco-like krozna virus (krozna means circular in Slovenian). The single species Rabbit associated gemykroznavirus 1 (Table 1) within the genus shares between 56 and 61% genome-wide sequence similarity with other sequences in other genera and is a divergent taxon in the phylogenetic trees constructed from either Rep or full genome sequences (Figs 2 and 3).
The name of the genus is an acronym of words geminivirus-like and myco-like dugui virus (dugui means circular in Mongolian). The single species Dragonfly associated gemyduguivirus 1 (Table 1) within the genus shares between 57 and 62% genome-wide sequence similarity with viruses in other genera and is a divergent taxon in the phylogenetic trees constructed from either Rep or full genome sequences (Figs 2 and 3).
CRESS DNA viruses replicate through the rolling circle replication (RCR) mechanism which is similar to that used by bacterial plasmids (Khan 1997; Chandler et al. 2013; Ruiz-Maso et al. 2015). RCR is initiated by the Rep, encoded by CRESS DNA viruses, cleaving the dsDNA between positions 7 and 8 of a nonanucleotide sequence located at a putative stem-loop structure at the origin of replication (Heyraud-Nitschke et al. 1995; Laufs et al. 1995b; Timchenko et al. 1999; Rosario, Duffy, and Breitbart, 2012). In the case of genomoviruses, this nonanucleotide is variable (‘TAWWDWRN’) with ‘TAATWYAT’ being the consensus nonanucleotide for gemycircularviruses, whereas gemykibiruses display the greatest variation in this motif—‘WATAWWHAN’ (Fig. 6; Supplementary Data S1). In contrast, we note that within the Geminiviridae family, including all recently described geminiviruses (Varsani et al. 2009; Briddon et al. 2010; Krenz et al. 2012; Loconsole et al. 2012; Bernardo et al. 2013; Heydarnejad et al. 2013; Ma et al. 2015; Bernardo et al. 2016), the consensus nonanucleotide motif is ‘TRAKATTRC’.
The N terminus of the Rep contains motifs that are important for initiating RCR and it is not surprising that some of these motifs are well conserved across many ssDNA viruses, phages, and plasmids that replicate using the RCR mechanism (Ilyina and Koonin, 1992; Vega-Rocha et al. 2007a; Rosario, Duffy, and Breitbart, 2012; Krupovic, 2013). The presence of a single catalytic tyrosine residue in the RCR motif III classifies genomovirus, geminivirus, bacilladnavirus, circovirus and nanovirus Reps as members of superfamily II (Ilyina and Koonin, 1992; Krupovic, 2013).
In genomoviruses, the conserved sequence of the RCR motif I, which is thought to be involved in the recognition of iterative sequences associated with the origin of replication, is predominantly ‘uuTYxQ’ (u denotes hydrophobic residues and x any residue) (Fig. 6; Supplementary Data S1), with the exception of the Reps of currently known gemykoloviruses and gemykrogviruses. The genomovirus RCR motif II, ‘xHxHx’ (Fig. 6; Supplementary Data S1), resembles that found in geminiviruses, and early work has shown that histidines in this motif coordinate divalent metal ions, Mg2+or Mn2+, which are important cofactors for endonuclease activity at the origin of replication (Koonin and Ilyina 1992; Laufs et al. 1995b). Genomoviruses have an RCR motif III of ‘YxxK’ and based on other Rep studies, this motif is involved in the dsDNA cleavage and subsequent covalent attachment of Rep through the catalytic tyrosine residue to the 5′ end of the cleaved product (Laufs et al. 1995a, b; Orozco and Hanley-Bowdoin, 1998; Timchenko et al. 1999; Steinfeldt, Finsterbusch, and Mankertz, 2006; Rosario, Duffy, and Breitbart, 2012). The conserved lysine residue in the RCR motif III (Fig. 6; Supplementary Data S1) is proposed to mediate binding and positioning during catalysis (Vega-Rocha et al. 2007a, b). A fourth conserved motif, the geminivirus Rep sequence (GRS), is only found in geminiviruses and genomoviruses (Fig. 6). In geminiviruses, it enables appropriate spatial arrangements of RCR motifs II and III (Nash et al. 2011). Site-directed mutagenesis of the GRS domain in tomato golden mosaic virus yielded non-infectious clones, demonstrating that the GRS is essential for geminivirus replication (Nash et al. 2011) and it is likely this is also the case for genomoviruses.
Rep is a multifunctional protein, with both endonuclease and helicase activities. Rep helicase activity is mediated by conserved motifs known as Walker A, Walker B and motif C located in a C-terminal NTP-binding domain (Fig. 6; Supplementary Data S1) (Gorbalenya, Koonin, and Wolf 1990; Koonin, 1993; Choudhury et al. 2006; Clerot and Bernardi 2006). The helicase domain found in Rep proteins of eukaryotic ssDNA viruses belongs to the helicase superfamily 3 (Gorbalenya, Koonin, and Wolf 1990; Koonin 1993). The conserved Walker A motif of genomoviruses is ‘GxxxxGKT’, with the exception of gemytondvirus which contains a highly derived variant of this motif (GPHRRRRT; Fig. 6). Previous studies have shown that during synthesis of progeny strands, Rep helicase activity unwinds the dsDNA intermediate in the 3′–5′ direction using nucleotide triphosphates as an energy source (Choudhury et al. 2006; Clerot and Bernardi 2006). Walker A motif forms part of the ‘P-loop’ structure in the NTP-binding domain that facilitates ATP recognition and binding with a conserved lysine residue (Desbiez et al. 1995; Timchenko et al. 1999; Choudhury et al. 2006; Clerot and Bernardi 2006; Rosario, Duffy, and Breitbart 2012; George et al. 2014). The Walker B of genomoviruses is predominantly ‘uuDDu’ (Fig. 6; Supplementary Data S1), whereas the motif C is ‘uxxN’ (u denotes hydrophobic residues and x any residue; Fig. 6, Supplementary Data S1). The hydrophobic residues in Walker B motif contribute to ATP binding and are essential for ATP hydrolysis, whereas the one in motif C (Fig. 6; Supplementary Data S1) interacts with the gamma phosphate of ATP and the nucleophilic water molecule via a conserved asparagine residue (Choudhury et al. 2006; George et al. 2014).
Genomoviruses from different genera display distinct signatures within the nonanucleotide as well as conserved nuclease and helicase motifs, which are generally consistent with the proposed taxa (Fig. 6; Supplementary Data S1).
The Reps of genomoviruses are most closely related to those of geminiviruses and hence here we used a geminivirus taxonomy-informed approach to classify 121 genomoviruses into Rep sequence-based genera. Within the Genomoviridae family we establish eight new genera in addition to the one created previously (Krupovic et al. 2016). Detailed analysis of sequence motifs conserved within the genomoviral genomes further supports the validity of the proposed genera. We also define a species demarcation criterion of 78% genome-wide identity, that is sequences that share>78% pairwise identity with other genomovirus sequences belong to the same species and those that share<78% can be considered as new species. It is worth noting that despite the fact that geminiviruses have been studied for over two decades, the sequence diversity of all known geminiviruses is similar to that of the recently discovered genomoviruses (46 vs 47%, respectively). This observation strongly suggests that the extent of sequence diversity within this expansive virus group remains largely unexplored.
Although the guidelines presented here are tailored for the classification of viral genomes in the family Genomoviridae, a similar sequence-based framework can be easily adapted for other virus clusters identified though metagenomics studies and lacking a pre-existing taxonomic framework, in particular for novel CRESS DNA viruses. We do acknowledge that this approach deviates from a previous norm that used a set of criteria including biological properties such as host range, pathology, vectors, etc. coupled with sequence data. However, given that the rate at which genome sequences of uncultivated viruses are being identified from various sources, we need to establish more robust classification approaches that can easily be implemented on the bases of sequence data alone. Indeed, this necessity is acknowledged by the ICTV which encourages submissions of taxonomic proposals for classification of viruses that are known exclusively from their genome sequences (Simmonds et al. 2017). This new tide in virus taxonomy is expected to catalyze the comprehension of the diversity, ecology and evolution of the global virome.
Supplementary data are available at Virus Evolution online.
This article is based on the taxonomic proposal 2016.001a-agF.U.v5.Genomoviridae which has been considered and approved by the Executive Committee (EC) of the ICTV. AV and MK are elected members of the ICTV EC.
Conflict of interest: None declared.