|Home | About | Journals | Submit | Contact Us | Français|
Nearly complete genome sequences of three novel RNA viruses were acquired from the stool of an Afghan child. Phylogenetic analysis indicated that these viruses belong to the picorna-like virus superfamily. Because of their unique genomic organization and deep phylogenetic roots, we propose these viruses, provisionally named calhevirus, tetnovirus-1, and tetnovirus-2, as prototypes of new viral families. A newly developed nucleotide composition analysis (NCA) method was used to compare mononucleotide and dinucleotide frequencies for RNA viruses infecting mammals, plants, or insects. Using a large training data set of 284 representative picornavirus-like genomic sequences with defined host origins, NCA correctly identified the kingdom or phylum of the viral host for >95% of picorna-like viruses. NCA predicted an insect host origin for the 3 novel picorna-like viruses. Their presence in human stool therefore likely reflects ingestion of insect-contaminated food. As metagenomic analyses of different environments and organisms continue to yield highly divergent viral genomes NCA provides a rapid and robust method to identify their likely cellular hosts.
Recent advances in sequencing technologies have led to the identification of highly divergent viruses from bacteria, archaea, and eukaryotes as well as from environmental samples (1, 3, 5, 18, 19). The initial step to identify new viruses typically consists of sequence similarity searches followed by phylogenetic analyses (18). Subsequent studies often aim to confirm the host species and determine the pathogenic potential of new viruses. Human stools contain a diverse array of viruses, including those infecting bacteria in the gut, viruses infecting human gut cells, and viruses in recently eaten animal tissues and plants (16, 27, 38). The source of novel viruses found in stool can therefore be difficult to determine if their genomes are highly divergent from those of viruses with known hosts (16, 27). A novel approach based on viral nucleotide composition was developed to help determine these viruses' likely cellular hosts.
The picorna-like virus superfamily contains viruses infecting all the major branches of eukaryotic life and is characterized by a partially conserved set of genes that consists of genes encoding the RNA-dependent RNA polymerase (RdRp), a chymotrypsin-like protease (3C), a superfamily 3 helicase (S3H), and a genome-linked protein (VPg) (23). The RdRp open reading frame (ORF) contains conserved protein motifs that can be aligned readily: phylogenetic analysis of representative sequences of currently described picorna-like viruses in the region revealed the existence of six evolutionarily supported clades (23).
We describe here the genomes of three highly divergent RNA viruses found in a human stool that belong to different clades of the picorna-like virus superfamily, whose origins we determined by using a novel nucleotide composition analysis were likely to be insects contaminating human food.
Stool samples were collected from a poliovirus-negative child suffering from acute flaccid paralysis (AFP) in Afghanistan. Stool suspensions made in Hanks' buffered salt solution (HBSS) (1:10) were passed through a 0.2-μm filter and centrifuged at 35,000 × g for 3 h at 10°C. Pellets were mixed with a mixture of nucleases to enrich for particle-protected nucleic acids (18, 19). Sequence-independent amplification and 454 pyrosequencing were then performed as previously described (18, 36). Sequence data were analyzed as described previously (36). This stool sample was also previously shown to contain Cosavirus, a new genus of the family Picornaviridae (18, 36).
Sequences showing significant tBLASTx hits to picornaviruses (E values of <0.001) were linked to other sequences with similar characteristics detected in the same human stool sample by reverse transcription-PCR (RT-PCR). 3′ Rapid amplification of cDNA ends (3′ RACE) was used to acquire the 3′ end of the calhevirus (CHV-1) genome. Ten microliters of extracted RNA was mixed with 10 pmol of primer DT-01 (ATTCTAGAGGCCGAGGCGGCCGACATGT30VN), denatured at 75°C for 5 min, and chilled on ice. A reaction mix of 9 μl containing 4 μl of 5× first-strand buffer (250 mM Tris-HCl [pH 8.3], 375 mM KCl, 15 mM MgCl2) (Invitrogen), 2 μl of 100 mM dithiothreitol (DTT), a 1-μl solution containing each deoxynucleoside triphosphate (dNTP) at 10 mM, 8 units (0.2 μl) of recombinant RNase inhibitor (Promega), and 200 units of SuperScript III reverse transcriptase (Invitrogen) was then added and incubated at 52°C for 30 min, followed by 75°C for 10 min. Two units of RNase H (NEB) was added, and the reaction mixture was further incubated for 10 min at 37°C. PCR was performed using a calhevirus-specific primer, CHV-3end-F1 (CCTGCACAGGCCCTTTCA), and DT-02 (ATTCTAGAGGCCGAGGCGGCC). PCR consisted of an activation step of 5 min at 95°C followed by 35 cycles of amplification at 95°C for 1 min, 60°C for 30 s, and 72°C for 2 min. To acquire the 5′ end of the calhevirus genome, 10 μl of extracted RNA was mixed with 10 pmol of virus-specific primer CHV-5end-R-1 (AGGCTCACACCGTTCAGCAC), denatured at 75°C for 5 min, and chilled on ice. An RT reaction mix similar to that used for 3′ RACE was added, and the reaction mixture was incubated at 52°C for 30 min, followed by 75°C for 10 min. Two units of RNase H was then added, and the reaction mixture was further incubated for 10 min at 37°C. cDNA was purified using a Qiagen PCR purification kit, and a poly(C) tail was added using terminal deoxynucleotide transferase (NEB) and dCTP. PCR was performed using the virus-specific primers CHV-5end-R-2 (AGTCTCAATCGCTGCGGTCA) and PPC01 (GGCCACGCGTCGACTAGTACGGGIIGGGIGGGGIGG, where I is deoxyinosine). PCR cycles consisted of an enzyme activation step for 5 min at 95°C followed by 35 cycles of amplification at 95°C for 1 min, 60°C for 30 s, and 72°C for 1 min. PCR products were directly sequenced or were subcloned into pGEM-T Easy vector (Promega) and then sequenced. For tetnovirus-1 (TNV-1) and tetnovirus-2 (TNV-2), sequences derived by metagenomics were also linked by PCR. 5′ and 3′ RACE failed to generate sequences for their extremities.
The alignment generated by Koonin et al. (23) in analyzing the phylogeny of the picorna-like virus supergroup (see Table S1 of reference 23) was used to identify sequence relationships of the novel viral sequences. The translated sequence matched conserved motifs within RdRp and was added to the existing alignment by use of CLUSTALW, followed by some minimal manual sequence editing to optimize alignment of likely homologous amino acid residues. The alignment used is provided in Table S1 in the supplemental material. Phylogenetic trees were constructed by minimum evolution, using amino acid P distances with Poisson correction for multiple substitutions. Bootstrap resampling and the interior branch test of phylogeny were used to infer the robustness of phylogenetic groupings.
The set of 284 complete viral RNA genome sequences or segment sequences longer than 3,000 bases was selected to be representative of different species, genera, and families of positive-stranded RNA viruses classified in the picorna-like virus supergroup 1 and were downloaded from the GenBank taxonomy browser on 6 September 2009. Each was annotated by order, family, and genus, along with host range. A further set of 35 complete genome sequences with an exclusively insect host range, differing by >10% from reference sequences, was incorporated into the data set. A list of the accession numbers for this control data set is provided in the supplemental material. Viruses capable of replicating in both insects and mammalian hosts (i.e., arboviruses) were excluded from the analysis. Mononucleotide and dinucleotide frequencies for each sequence were determined using the program Composition Scan in the Simmonic sequence editor, version 1.7. Dinucleotide bias was determined as the ratio between the observed frequency of each of the 16 dinucleotides and the expected frequency determined by multiplying the frequencies of each of the two constituent mononucleotides, as previously described (21). Discriminant analysis was performed using the statistical package SYSTAT with default parameters. Sequences in the order Picornavirales were assigned to three host categories, namely, mammal, insect, and plant, and frequencies of each mononucleotide and dinucleotide were used as predictive factors to infer host ranges of unknown virus sequences from the current study.
The genome sequences of CHV-1, TNV-1, and TNV-2 have been submitted to GenBank with accession numbers HM480374, HM480375, and HM480376, respectively.
Viral particles were purified by filtration from a stool sample from a 14-month-old Afghan child suffering from nonpolio acute flaccid paralysis. Following nuclease treatment of the filtrate to remove non-particle-protected nucleic acids, the viral nucleic acids protected within their viral capsids were extracted. cDNA synthesis was then performed, and PCR was performed using primers with randomized 3′ ends. The resulting DNA was subcloned into a plasmid vector. One of 48 plasmid inserts showed protein similarity (BLASTx E score of 2e−5) to the RdRp core region of picornaviruses. To obtain the rest of this viral genome, the same virally enriched extracted nucleic acid from the child's feces was subjected to 454 pyrosequencing, resulting in ~23,000 sequence reads. Sequence reads were aligned using a criterion of >30-bp overlap with >90% nucleotide sequence identity, resulting in a total of 1,922 contig and singlet sequences. One contig showed significant protein similarity (psiBLAST score of 2e−11) to picornavirus RdRp, while other sequence contigs showed significant protein similarity to a picornavirus RNA helicase. These fragments were then joined by RT-PCR, using specific primers, to acquire a 5.3-kb viral sequence. 5′ and 3′ RACE procedures were used to sequence the extremities of this virus (see Materials and Methods). The genome sequence was confirmed by sequencing 4 overlapping RT-PCR amplicons generated directly from stool nucleic acids. For the same patient, pyrosequencing also yielded two other large contigs, of 5,063 and 4,164 nucleotides, and these are described below.
The genomic organization, presence of overlapping ORFs, and stop codons of CHV-1 were confirmed by direct RT-PCR reamplification and sequencing (Fig. (Fig.1).1). The size of the new virus genome was 8,241 nucleotides (nt), excluding the poly(A) tail. Similar to other picorna-like viruses, the viral genome sequence was A/U rich (A = 25.6%, U = 27.4%, G = 24.2%, and C = 22.8%). The genome contained three large ORFs, encoding nonstructural (NS) proteins, structural proteins, and a highly basic protein of unknown function.
Two methionine codons which could initiate ORF1 were found at CHV nucleotide positions 181 and 226, with the latter being in optimal Kozak context (RNNAUGG; ACGATGG in the CHV genome). This suggested the presence of a 225-nt 5′-untranslated region (5′UTR) with a long, thermodynamically stable stem-loop. The 5′UTR was shorter than those found in picornaviruses but longer than those of human or animal caliciviruses (65 to 182 nt). 3′ RACE predicted a 234-nt untranslated region. A typical polyadenylation signal (AAUAAA) was not identified in the 3′UTR.
ORF1, at nucleotide positions 226 to 5376, is predicted to encode a 1,717-amino-acid (aa) polyprotein of 195.5 kDa (Fig. (Fig.1).1). A conserved domain database search predicted the presence of helicase, trypsin-like serine protease, and RdRp domains in the ORF1 protein. Protein domains were designated in accordance with a protein similarity search against the pFam database, with a minimum E score of 0.001 (http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#NCBI_curated_domains). These domains are organized in a similar order to those of viruses of lineage 1 of the picorna-like virus superfamily described by Koonin et al. (23). The three motifs typical of picorna-like virus helicases and all eight conserved RdRp motifs could be identified (22). For example, the helicase domains of picornaviruses fall into helicase superfamily III and contain a motif A [Gx4GK(S/T)] followed approximately 35 aa downstream by a motif B (WWWxxDD, where W is any hydrophobic residue) and approximately 30 aa downstream by a motif C [KgxxWxSxWWWx(S/T)(S/T)N] (22). In CHV, these motifs are present as GGPRMGKT-53 aa-CLIFYDD-47 aa-KGTYINPAFVVATSN (see Fig. S1 in the supplemental material).
A small region between the helicase and RdRp regions of the CHV genome appeared to encode a serine protease, as predicted based on the presence of a conserved motif with a serine amino acid in the catalytic site [MEPGD(S)GSLVI] (11, 12). Notably, CHV is the only virus of picorna-like virus clade 1 (23) that encodes a serine protease. The only other groups of RNA viruses known to encode serine proteases are astroviruses (clade 3) and sobemoviruses (clade 2) (23) (Fig. (Fig.22).
ORF3 starts at nt 5379 and ends at nt 6164, and it encodes a 261-aa protein of unknown function. The ORF encodes a highly basic protein containing 47 positively charged (arginine and lysine) and 23 negatively charged (aspartic acid and glutamic acid) residues, with an estimated isoelectric point of 10.2. Two other virus families, Caliciviridae and Hepeviridae (containing mammalian and avian hepatitis E viruses), also carry a basic protein recently shown to play a role in virion morphogenesis and pathogenesis (35, 37). In the case of caliciviruses, the ORF for the small basic protein is generally carried at the 3′-most region of the genome. The corresponding gene in Hepeviridae is located between ORF1 and ORF2 as in CHV, but this virus family is taxonomically distinct from other members of the picorna-like virus superfamily.
ORF2 overlaps ORF3, starting at nt 5993 and ending at nt 7951, and encodes a 653-aa protein. A protein similarity search against the NCBI conserved domain database identified a picornavirus capsid protein domain in the N terminus of ORF3 (E value, 8e−9). The remaining portion of the capsid protein showed no significant identity (psi-BLAST E score of <0.001) to any viral protein.
The RdRp region of calhevirus (Fig. (Fig.2)2) (amino acid positions 278 to 717) was aligned with those of representative members of the highly diverse picorna-like virus order (23). The inclusion of additional sequences and the use of a different tree construction method created almost identical phylogenies to those described previously, with only minor and unsupported changes in topology (by bootstrapping and interior branch testing) in two of the deeper branches in the tree (Fig. (Fig.2).2). Of the seven monophyletic groups resolved, six corresponded to the six designated clades in the original analysis (Fig. (Fig.2)2) (23). The seventh clade contained sequences of caliciviruses that originally formed a deep branch with clade 4 (23).
The CHV-1 sequence grouped within the previously designated clade 1 viruses, a group containing viruses with host ranges restricted to plants, chromoalveolates, and arthropods (specifically insects in this group). However, CHV-1 showed no close relationship to any existing virus family within clade 1, an observation consistent with the lack of close resemblance of its genome architecture to that of other known RNA viruses. There was therefore no evidence of a close relationship with viruses infecting humans or other mammals among the members of the picorna-like virus supergroup, namely, picornaviruses, caliciviruses, and astroviruses (red symbols in Fig. Fig.22).
Partial genomes of TNV-1 (5,063 nt) and TNV-2 (4,154 nt) were initially acquired using 454 pyrosequencing, followed by confirmation of the genomic organization by sequencing of directly acquired overlapping RT-PCR products. Multiple attempts to acquire the 5′- and 3′-terminal sequences failed. It may be relevant that the loosely related insect and fish nodaviruses are known to have very short 5′UTRs and to lack the 3′ poly(A) tail and free hydroxyl group required for 3′ RACE (10). The TNV-1 and TNV-2 genome fragments both contained two large ORFs encoding nonstructural and structural proteins (Fig. (Fig.1).1). The nonstructural protein of TNV-1 contains RdRp and cysteine-like protease domains, while TNV-2 NS proteins appear to include only an RdRp domain (http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#NCBI_curated_domains). The RdRp domains of both TNV-1 and -2 are phylogenetically distantly related to the RdRp domains of nodaviruses, whose genomes include two RNA fragments separately encoding nonstructural and capsid proteins (Fig. (Fig.2).2). In contrast, TNV-1 and TNV-2 nonstructural and capsid proteins are encoded by the same RNA segment. The capsid ORF of TNV-1 overlaps its NS protein ORF by 526 nt, while the capsid ORF of TNV-2 overlaps its NS protein ORF by 1,171 nt. The nonstructural protein of TNV-1 displays <25% identity to that of TNV-2. The postulated capsid proteins of TNV-1 and TNV-2 both contain a rhinovirus capsid domain, suggesting a jelly-roll conformation for the capsid proteins. These two highly divergent RNA viruses were named tetnoviruses (tetranodaviruses), as they showed the closest phylogenetic relatedness to nodaviruses within the RdRp protein (Fig. (Fig.2).2). TNV-1 and TNV-2 grouped together within clade 2, a group containing the fish nodaviruses greasy grouper nervous necrosis virus (GGNNV) and striped jack nervous necrosis virus (SJNNV), the insect nodaviruses nodamura virus (NoV) and flock house virus (FHV), and SmVA, from the protist Sclerophthora macrospora. However, nodaviruses have bipartite RNA genomes, and their nonstructural and structural proteins are encoded by different segments of RNA (Fig. (Fig.1).1). The genomic organization of TNV-1 and TNV-2 therefore more closely resembles that of viruses classified in the family Tetraviridae (Fig. (Fig.11).
The existence of systematic differences in the abundances of certain dinucleotides in viral genomes has been documented extensively (2, 6, 9, 13-15, 30, 32, 33). Although with an uncertain mechanistic basis, underrepresentation of CpG and UpA and overrepresentation of CpA have also been recorded for single-stranded RNA viruses infecting mammals, as well as for some groups of small DNA viruses (13, 14, 20, 32, 33). Viruses infecting different hosts may therefore show different dinucleotide patterns, providing information from which their likely host origins may be inferred.
Because CHV-1, TNV-1, and TNV-2 were identified as RNA viruses with homology in the RdRp region to members of the picorna-like virus supergroup (Fig. (Fig.2),2), composition comparisons were made with representative RNA virus sequences corresponding to virus families represented within this group (23). This comprised 62 arthropod, 174 plant, and 83 vertebrate viral sequences (Table (Table1).1). Despite being interspersed phylogenetically, viruses in these three host ranges showed consistently different patterns of dinucleotide bias (Fig. 3A and B). Mammalian virus genomes showed the greatest degree of CpG underrepresentation, in proportion to their G+C contents. Insect viruses showed the least underrepresentation, with observed-to-expected ratios predominantly in the range of 60% to 110%.
Because of the large number of measured outcomes (4 mononucleotide and 16 dinucleotide frequencies) that may potentially vary between virus groups, discriminant analysis was used to evaluate the contribution of each to enable classification into the three main host groups of currently classified viruses (Table (Table1;1; Fig. Fig.3B).3B). Dinucleotide bias was determined as the ratio between the observed frequency of each of the 16 dinucleotides and the expected frequency determined by multiplying the frequencies of each of the two constituent mononucleotides, as previously described (21). Discriminant analysis comprises two steps. First, for training purposes, linear or quadratic functions of composition variables that maximally differentiate categories are established, using mononucleotide and dinucleotide composition data from the control set. These functions are then applied to sequences of unknown category (in this instance, novel viral genomes). Results are shown as canonical score plots, wherein the values for the two most significant contributory factors determined for classification are plotted for both the control and test sequences (separately labeled) (Fig. (Fig.3A),3A), with 95% confidence ellipses centered on the centroid of each group. A formal categorization of the three host origin categories from the complete analysis is reported in Table Table1;1; these data include further parameters that contribute to differentiation of categories not represented graphically. Ninety-six percent of the control sequences were identified correctly (Table (Table1),1), and CpG frequency was the most influential factor. Using this model, CHV-1, TNV-1, and TNV-2 were assigned to the insect host group (Fig. (Fig.3B3B).
We report the identification and nearly complete genomes of three novel RNA viruses and a nucleotide composition analysis to infer the kingdom or phylum of their cellular hosts. Based on phylogenetic analyses and gene organization, we propose these three new viruses as prototypes of novel families or unassigned genera in the picorna-like virus superfamily (23).
In the past few years, genomes of several highly divergent viruses have been characterized by unbiased metagenomic approaches (18, 24, 36). Most of these viruses are genetically very closely related to previously characterized viruses, allowing the phylum of their likely hosts to be inferred (16-18). However, inferring the hosts of genetically more distinct viruses is more problematic, especially if they are found in stool (27). Stools are known to contain viruses that infect host cells and/or bacteriophages, as well as viruses of dietary origins from consumed plants, insects, and animals (3, 4, 16, 17, 27, 36, 38).
Systematic differences in dinucleotide composition of viral genomes, such as the underrepresentation of CpG and UpA dinucleotides and overrepresentation of CpA in mammalian RNA viruses and other dinucleotide biases in other eukaryotic viral genomes, has been documented extensively (2, 6, 9, 13-15, 32, 33). Remarkably, the adaptive basis or mutational biases underlying this observation currently remain undetermined, although it has been hypothesized that the observed biases reflect evolutionary selection on RNA viruses to mimic compositional patterns of their hosts rather than a shared mutational bias (13, 30). One suggested mechanism is selection pressure to avoid recognition by an interferon-induced hypothetical Toll-like receptor (TLR) molecule capable of recognizing and targeting CpG dinucleotides in RNA rather than DNA (as in TLR9) (30).
Plants and animals diversified more than a billion years ago, while vertebrate and arthropod lineages diverged between 573 and 656 million years ago (25, 31). It is reasonable to expect that viruses which specifically infect these groups would be subjected to distinct, host-specific evolutionary pressures (30). Moreover, genomes of RNA viruses and host mRNA molecules coexist in the same cytoplasmic cellular environment and are expected to share some common features due to constraints induced by host factors. These predictions were exploited here to infer possible origins of viruses in hosts with different biases in dinucleotide frequencies, since vertebrates, plants, and invertebrates (principally insects) are known to differ substantially in their dinucleotide frequencies (21, 34). Discriminant analysis of mono- and dinucleotide frequencies (Fig. (Fig.3B)3B) provided a much better differentiation of the three possible sources of viruses in the current analysis than simple computation of CpG underrepresentation (Fig. (Fig.3A),3A), as it incorporated additional information, such as the occurrence of other dinucleotide biases and the G+C content dependences of these biases. Using discriminant analysis, NCA correctly identified the phylum or kingdom of the cellular hosts of 96% of these viruses, suggesting it to be useful for identifying the hosts of novel RNA viruses. We predicted using NCA that all three novel viruses described here most likely replicated in an insect host.
The already large degree of diversity in picorna-like viruses can be expected to grow as metagenomic studies of different environments, such as seawater (7, 8) and animal samples (18, 19, 28), provide more viral genome sequence data. A recent proposal was made to create a viral taxonomy order named Picornavirales (26), consisting of the members of clades 1 and 6 of the picorna-like virus supergroup, as defined by RdRp phylogeny (Fig. (Fig.2)2) (23). Since calhevirus RdRp phylogenetically groups with the members of the proposed Picornavirales order, this virus may belong to this new order, although we have not tested for other required characteristics, namely, the presence of a 5′ covalently linked VPg, autoproteolytic cleavage of the polyprotein, or an icosahedral viral particle with pseudo-T3 symmetry (26). The presence of an apparent serine rather than cysteine protease appears rare in the Picornavirales, having been reported only for the algal marnavirus, one of eight proposed named or unassigned families in this new order (26). The RdRp proteins of TNV-1 and TNV-2 appear to be more closely related to those of the nodaviruses, whose hosts include both fish and arthropods, including insects. NCA indicated that contamination of this child's food with an insect(s) was the likely source of these divergent picorna-like viral genomes in his stool. This conclusion was supported by the detection of dicistrovirus genomes (only known to infect insects) in stool samples from other children (35) (data not shown). Multiple insect viruses were also found in the guano of insectivorous bats (28). If insect viruses remain infectious after passage through the mammalian digestive tract, as do some plant viruses (37), ingestion and excretion by mammals may be another means by which insect viruses are dispersed. A determination of whether NCA can be expanded to identify the possible origin of picorna-like viral genomes from simpler eukaryotic organisms will require further studies.
The work was supported by NIH awards HL083254 (E.D.), AI090196 (A.K.), and AI57158 (Northeast Biodefense Center to A.K. and W.I.L.) and by a Department of Defense award (LSI-03-514 to A.K. and W.I.L.).
Published ahead of print on 28 July 2010.
†Supplemental material for this article may be found at http://jvi.asm.org/.