An initial survey of 28 YNP hot springs was performed to determine which hot springs contained a detectable RNA signal in the isolated RNA viral fraction (see Table S1 in the supplemental material). The surveyed sites were selected based on being high-temperature (>80°C) and low-pH (<4.0) springs likely to be dominated by Archaea. This initial screen of viral fractions was based on an RNA-dependent reverse transcriptase (RT) assay where [32P]dCTP incorporation into first-strand cDNA was determined in RNase-treated controls and untreated enriched viral samples, both of which had previously been DNase treated. While 24 sites showed little difference between the amounts of [32P]dCTP incorporated by RNase-treated and untreated samples, 3 hot springs showed reproducible and statistically significant differences (). Resampling of these hot springs 12 months later resulted in nearly identical detection of the RNA signal in the RNA viral fraction, suggesting that RNA viral communities were being maintained in these hot springs. The three sites (NL10, NL17, and NL18) with the highest RT-dependent signals were selected for generating metagenomic libraries.
A stringent approach was taken to assemble the RNA viral and DNA viral metagenomes. Removal of highly duplicated sequence reads at 98% similarity with CD-HIT-454 resulted in a 41.6% reduction in reads across the viral metagenomes (). These duplicated sequences are probably indicators of both the amplification bias and the high sequencing depth of the relatively low-complexity viral communities found in these hot springs. Removal of these highly represented sequencing reads significantly improved the overall assembly, resulting in much higher confidence contigs.
| Table 1Cellular, DNA viral, and RNA viral assembly statistics |
The initial analysis of the viral RNA metagenomes indicated that they contained a high proportion of sequences found in the paired viral DNA metagenomes. This finding suggests that the initial treatment of the isolated viral nucleic acid with DNase was insufficient to remove 100% of the viral DNA sequences, probably due to its vast excess compared to the amount of RNA in the viral fraction. To address this bias toward DNA viral sequences, RNA contigs were compared against the DNA viral metagenomes using BLASTN. Matches of RNA contigs (E-value of 1 × 10−30) against reads from the DNA metagenomes were excluded from further analysis. More than 72% of the contigs in the initial RNA metagenomic data sets were removed as a result of matching sequences in the viral DNA data sets (7,060 of the initial 9,696 contigs). This screening procedure, while removing the majority of contigs, provided confidence that the remaining contigs assembled from the RNA viral metagenomes were derived from RNA viruses present in that hot spring and not derived from viral DNA sequences.
The assembly of sequences from the Nymph Lake hot spring sites resulted in a large number of contigs under stringent assembly conditions (98% identity within at least 50 bp of overlap) (). The 14 viral samples (7 enriched RNA viral samples and 7 enriched DNA viral samples) generated ~2.8 million reads and 807 Mb of sequence. After deduplication, there were 877,000 reads and 590 Mb of unique sequence. The assembled viral DNA metagenomes generated 27,746 total contigs, with an average large-contig size (>1,000 bp) of 1,898 bp. The assembled viral RNA metagenomes generated 9,696 total contigs, with an average large-contig length of 1,361 bp. Nearly 73% of the sequence reads in the DNA viral metagenomes were either fully or partially assembled into contigs, whereas 36.8% of sequence reads in the RNA viral metagenomes assembled into contigs. The assembled viral DNA metagenomic contigs had an average length of 574 bp with an average read depth of 38-fold. The final assembled viral RNA metagenomes had an average contig length of 410 bp with an average read depth of 34-fold base coverage.
Cellular DNA metagenomes, comprising 632,000 sequencing reads which represented 229 Mb of sequence, were assembled under the same high-stringency parameters as the viral data sets (). The NL0808 and NL0908 cellular assemblies generated 22,649 contigs with an average contig length of 716 bp and an average read depth of 36-fold base coverage. Sixty-five percent of the reads assembled into contigs.
The analysis of the assembled cellular metagenomes revealed that 81.0% of the contigs had a significant match against the server's protein databases (A). The majority of the sequences (57%) matched to Archaea, primarily to the Crenarchaeota (86.9%) and to a lesser degree to the Euryarchaeota (11.6%). Matches to Nanoarchaeota, Korarchaeota, and Thaumarchaeota were also detected. A smaller portion of the cellular contigs matched to Bacteria (20.6%), mostly to Delta- and Gammaproteobacteria and Firmicutes (clostridia and bacilli). However, overall these matches to the Bacteria were weak compared to the matches to the Archaea. This could reflect shared genes between Archaea and Bacteria and/or a statistical bias toward bacterial sequences due to their overrepresentation compared to the representation of archaeal sequences in the public databases. In support of this possibility, further analysis of the cellular metagenomes using the ribosomal databases available on the MG-RAST server (Ribosomal Database Project, SILVA rRNA Database Project, and Greengenes) revealed 4 matches against the Crenarchaeota (Thermoprotei), one match to the Nanoarchaea, and a single distal match to Sulfurihydrogenibium yellowstonense, a bacterium of the phylum Aquificae originally isolated from a YNP hot spring. Overall, these results confirm that the YNP hot springs analyzed here are dominated by Archaea.
Examination of the putative viral RNA contigs showed that the majority of sequences (56%) did not align to known sequences (C). A small fraction of the sequences matched known viruses (3%). The remaining sequences matched sequences from Archaea (20%), Bacteria (10%), and Eukaryota (11%).
Analysis of the RNA viral contigs using NCBI's protein and nucleotide nonredundant databases and the CDD, as well as Uniprot/Swissprot (
31), detected several sequences with similarity to RNA-dependent RNA polymerases (RdRps), a positive-strand single-stranded RNA virus “hallmark gene” (
39). The largest contig in the RNA viral data set (5.6 kb) and smaller contigs with similarity to RdRps were confirmed to be present and persistent in the hot springs over time. Resampling of the NL10 and NL18 RNA viral fractions and extraction of the total RNA from the cellular fractions 18 months after the last sampling for the metagenomic analysis showed the continued presence of the RNA genome segments. An RT-PCR assay with primers designed from these RNA contigs was performed on samples taken from the same hot springs over an 18-month period (). These assays demonstrated that sequences were single stranded because amplifiable product was generated only when cDNA synthesis used a specific, directional primer. The longest RT-dependent contig, contig00002, is 5,662 nt in length, composed of 258 reads, and contains a single large open reading frame (ORF) that encodes a putative viral polyprotein encompassing an RdRp and a putative capsid protein as detailed below. Omission of the RT or template or pretreatment of samples with RNase prior to the RT-PCR assay eliminated the signal, indicating that the source of the signal was an RNA template in the viral fraction (). This analysis supported the original contig assemblies and demonstrated that these putative RNA viruses are stable members of the viral community within the hot springs explored.
Further search for RdRps, as well as viral structural (capsid) proteins, was performed using a combination of BLAST, MG-RAST, and HHpred. Two contigs were identified as containing significant similarity to known RdRps. A similar search examining the DNA-enriched viral metagenome data set revealed no similarity to RdRps but did show many hits to capsid proteins, mainly those of archaeal DNA viruses, as expected.
Contig00002 contains a single long ORF potentially coding for a 198-kDa (1,809 amino acids) polyprotein that spans almost the entire length of the contig, suggesting that a complete or nearly complete RNA viral genome was assembled. In the middle part of this polyprotein (approximately between amino acids 800 and 1020), a putative RdRp domain was identified using several search methods. A search of the Conserved Domain Database at the NCBI using the RPS-BLAST program detected a highly significant similarity (E-value of approximately 6 × 10
−11) to the RdRp domain profile (no significant similarity to any other domain was detected for this or other parts of the polyprotein). A BLASTP search of the nonredundant protein database at the NCBI yielded statistically significant similarity (E-value of 2 × 10
−4, 23% identity in an alignment of 341 amino acid residues) to the RdRp of Pariacato virus, a nodavirus, and limited, not-significant (E-value of approximately 0.1) similarity to the RdRps of several other RNA viruses, including tombusviruses and pestiviruses. The second iteration of the PSI-BLAST search resulted in highly significant similarity to the RdRps of a variety of positive-strand RNA viruses of eukaryotes. This putative RdRp sequence contains all three highly conserved motifs, A (D-X[4,5]-D) and B ([S/T]G-X[3]-T-X[4]-N[S/T)], which are involved in nucleotide selection as well as metal binding (
24), and C, which contains the highly conserved GDD motif and is an essential part of the Mg
2+-binding site (A) (
32,
37). Modeling of the inferred protein RdRp onto the X-ray structure of the positive-strand single-stranded Norwalk virus RdRp structure showed that the putative archaeal virus RdRp contained the principal structural elements of the Palm domain of the RdRps of eukaryotic positive-strand RNA viruses (B). Matches to RdRps were also detected for contig00228 (comprising 10 reads), which encompassed three overlapping short ORFs, each of which showed approximately 70% amino acid sequence identity to the predicted RdRp of contig00002 (A). Thus, this contig appears to encode an RdRp of a putative archaeal virus that is related to but clearly distinct from the putative virus encoded by contig00002.
A comparison of the contig00002 polyprotein sequence to databases of protein family profiles using the HHPred program supported the finding of highly significant similarity to viral RdRps and, additionally, led to the detection of sequences that were similar to capsid proteins of positive-strand eukaryotic RNA viruses within the nodavirus family. The sequence with the highest similarity to capsid proteins is approximately 90 amino acids in length and is located C terminal to the RdRp, spanning amino acid residues 1562 to 1652 of the polyprotein. The alignment between this portion of the polyprotein and the capsid proteins of eukaryotic viruses demonstrated a level of similarity that was not statistically significant, although the alignments included approximately 30% identical amino acid residues. Nevertheless, a more detailed, direct comparison of the contig00002 polyprotein sequence to the nodavirus capsid protein sequences, as well as the related capsid protein sequences of tetraviruses and nodaviruses, allowed us to extend the alignment and identify the main structural elements of the capsid proteins (). The nodavirus capsid protein adopts an 8-strand jelly roll fold that is distantly related to the structures of the capsid proteins of other small icosahedral positive-strand RNA viruses. In addition, unlike other well-characterized viral capsid proteins, nodaviruses possess an autoproteolytic activity that is involved in the processing of the capsid protein precursor (the α protein) into the mature large (β) and small (γ) capsid proteins (
61,
63). The nodavirus capsid protein is the founding member of the A6 peptidase superfamily that employs an unusual catalytic mechanism involving an interaction between the active aspartate of the capsid protein precursor and a conserved asparagine that is proximal to the scissile peptide bond in the C-terminal part of the protein, at the boundary between β and γ capsid proteins (
75). These residues, which are essential for autocatalytic cleavage of the nodavirus capsid protein precursor, are also conserved in the sequences of capsid proteins of tetraviruses. Counterparts to both the catalytic aspartate and the cleavage site asparagine were detected in the contig00002 polyprotein, and regions surrounding these two conserved amino acids were found to have the highest levels of sequence similarity (). Exhaustive analysis of the parts of the contig00002 polyprotein outside the RdRp and capsid protein regions failed to detect similarity to any other known proteins or domains. Collectively, the observation of a large polyprotein (a common feature of eukaryotic positive-strand RNA viruses) containing putative RdRp and capsid proteins in addition to a possible autoproteolytic activity suggest that this contig corresponds to a near full-length genome of a previously unknown positive-strand RNA virus.
Phylogenetic analysis of the identified RdRp sequences showed that they formed a lineage distinct from known RdRps of eukaryotic and bacterial RNA viruses (). The putative RdRps encoded by contig00002 and contig00228 did not show specific affinity with any of the three superfamilies of eukaryote positive-strand RNA viruses (picornavirus-like, alphavirus-like, and flavivirus-like) or the only known bacterial lineage (leviviruses), and statistical tests showed that association with any of these major lineages or additional distinct lineages, such as nodaviruses, could not be convincingly ruled out (see the supplemental material). These results are compatible with a central position of the new RdRp in the tree and, hence, with its possible ancestral status.
To investigate the possibility that the RNA virus genomes detected were contaminants of the hot springs introduced from external sources, we tested the ability of an RNA plant virus known for its high stability to be maintained within the hot spring environment. Samples of CCMV were mixed with hot spring water in 50-ml Falcon tubes and placed back into the hot springs. Aliquots were taken at regular intervals for up to 30 min and quantified using a qRT-PCR assay which can detect as few as 10 viral genome copies. Within 1 min of sampling, there was no detectable amount of the CCMV RNA (data not shown). This experiment indicates that it is unlikely that an externally introduced RNA virus can survive long enough to be detected, especially in the form of long RNA segments, under the high-temperature acidic conditions of the hot springs from which the putative archaeal virus sequences were obtained.
To further probe the possibility that the detected RNA viral genomes replicate in archaeal hosts, these sequences were compared to the sequences produced by environmental transcriptomic analysis of two high-temperature YNP hot springs at neutral or alkaline pHs (Mushroom and Octopus Springs) that are known to be dominated by thermophilic bacteria. These hot springs harbor a thermophilic bacterial community of relatively high complexity that consists of
Chloroflexi,
Cyanobacteria,
Acidobacteria, and
Chlorobi (
36). A BLASTN analysis found that the great majority of the contigs (97% of the total RNA viral contigs) did not have statistically significant matches to the metatranscriptome of these neutral or alkaline YNP hot springs. Three percent of the RNA viral contigs did show a significant match. However, the average length of the matching contigs was 342 bp, much shorter than the overall average contig length. This analysis indicates that there is little overlap between the RNA viral community present in the archaea-dominated acidic hot springs and the bacteria-dominated neutral or alkaline hot springs and provides further evidence that the putative viral RNA genomes detected are associated with archaeal hosts.
Finally, we analyzed the CRISPR direct repeat (DR) and spacer content present in our cellular metagenomic data sets in an attempt to link the putative RNA viral genomes directly to a specific host type and to investigate potential host immunity to RNA viruses. CRISPR DRs and spacers were extracted from the cellular metagenomic data sets. The spacer sequences (10,349) were compared with the assembled viral RNA and DNA metagenomes. Forty-six spacers (0.44%), associated with 4 types of DRs, were identical to RNA sequences within the viral metagenomic data set (). A similar comparison of the spacers to the DNA viral metagenome revealed 628 (6.07%) identical spacers (data not shown). The majority of matching spacer sequences of the RNA metagenome (44/46) were related to DRs of the archaeal species Sulfolobus islandicus and Sulfolobus acidocaldarius. These were the only significant matches found in the CRISPRfinder direct repeat database. These findings suggest that the RNA viral genomes replicate in an archaeal host belonging to the Sulfolobales, a cell type commonly found in NL10 and acidic hot springs worldwide, and elicit a CRISPR-mediated immune response. However, we cannot categorically rule out the unlikely possibility that the CRISPR spacer matches to the RNA viral metagenome are in fact to DNA viral sequences that are still present in our viral RNA metagenomic data set.