We report the identification and nearly complete genomes of three novel RNA viruses and a nucleotide composition analysis to infer the kingdom or phylum of their cellular hosts. Based on phylogenetic analyses and gene organization, we propose these three new viruses as prototypes of novel families or unassigned genera in the picorna-like virus superfamily (
23).
In the past few years, genomes of several highly divergent viruses have been characterized by unbiased metagenomic approaches (
18,
24,
36). Most of these viruses are genetically very closely related to previously characterized viruses, allowing the phylum of their likely hosts to be inferred (
16-
18). However, inferring the hosts of genetically more distinct viruses is more problematic, especially if they are found in stool (
27). Stools are known to contain viruses that infect host cells and/or bacteriophages, as well as viruses of dietary origins from consumed plants, insects, and animals (
3,
4,
16,
17,
27,
36,
38).
Systematic differences in dinucleotide composition of viral genomes, such as the underrepresentation of CpG and UpA dinucleotides and overrepresentation of CpA in mammalian RNA viruses and other dinucleotide biases in other eukaryotic viral genomes, has been documented extensively (
2,
6,
9,
13-
15,
32,
33). Remarkably, the adaptive basis or mutational biases underlying this observation currently remain undetermined, although it has been hypothesized that the observed biases reflect evolutionary selection on RNA viruses to mimic compositional patterns of their hosts rather than a shared mutational bias (
13,
30). One suggested mechanism is selection pressure to avoid recognition by an interferon-induced hypothetical Toll-like receptor (TLR) molecule capable of recognizing and targeting CpG dinucleotides in RNA rather than DNA (as in TLR9) (
30).
Plants and animals diversified more than a billion years ago, while vertebrate and arthropod lineages diverged between 573 and 656 million years ago (
25,
31). It is reasonable to expect that viruses which specifically infect these groups would be subjected to distinct, host-specific evolutionary pressures (
30). Moreover, genomes of RNA viruses and host mRNA molecules coexist in the same cytoplasmic cellular environment and are expected to share some common features due to constraints induced by host factors. These predictions were exploited here to infer possible origins of viruses in hosts with different biases in dinucleotide frequencies, since vertebrates, plants, and invertebrates (principally insects) are known to differ substantially in their dinucleotide frequencies (
21,
34). Discriminant analysis of mono- and dinucleotide frequencies (Fig. ) provided a much better differentiation of the three possible sources of viruses in the current analysis than simple computation of CpG underrepresentation (Fig. ), as it incorporated additional information, such as the occurrence of other dinucleotide biases and the G+C content dependences of these biases. Using discriminant analysis, NCA correctly identified the phylum or kingdom of the cellular hosts of 96% of these viruses, suggesting it to be useful for identifying the hosts of novel RNA viruses. We predicted using NCA that all three novel viruses described here most likely replicated in an insect host.
The already large degree of diversity in picorna-like viruses can be expected to grow as metagenomic studies of different environments, such as seawater (
7,
8) and animal samples (
18,
19,
28), provide more viral genome sequence data. A recent proposal was made to create a viral taxonomy order named
Picornavirales (
26), consisting of the members of clades 1 and 6 of the picorna-like virus supergroup, as defined by RdRp phylogeny (Fig. ) (
23). Since calhevirus RdRp phylogenetically groups with the members of the proposed
Picornavirales order, this virus may belong to this new order, although we have not tested for other required characteristics, namely, the presence of a 5′ covalently linked VPg, autoproteolytic cleavage of the polyprotein, or an icosahedral viral particle with pseudo-T3 symmetry (
26). The presence of an apparent serine rather than cysteine protease appears rare in the
Picornavirales, having been reported only for the algal marnavirus, one of eight proposed named or unassigned families in this new order (
26). The RdRp proteins of TNV-1 and TNV-2 appear to be more closely related to those of the nodaviruses, whose hosts include both fish and arthropods, including insects. NCA indicated that contamination of this child's food with an insect(s) was the likely source of these divergent picorna-like viral genomes in his stool. This conclusion was supported by the detection of dicistrovirus genomes (only known to infect insects) in stool samples from other children (
35) (data not shown). Multiple insect viruses were also found in the guano of insectivorous bats (
28). If insect viruses remain infectious after passage through the mammalian digestive tract, as do some plant viruses (
37), ingestion and excretion by mammals may be another means by which insect viruses are dispersed. A determination of whether NCA can be expanded to identify the possible origin of picorna-like viral genomes from simpler eukaryotic organisms will require further studies.