In recent years, due to the development of NGS technologies, vsiRNA profiling has been extensively used as a surrogate to study the processing of viral genomes by the RNAi machinery. Although vsiRNAs align along full-length viral genomes, some regions are less covered than others (cold spots). Such cold spots were observed in the profile of the Semliki forest virus (SFV), and they correlated with predicted secondary structure that appears to be insensitive to Dicer processing (17
). A similar interpretation was given for cold spots observed in the profile of FHV (5
). However, these profiles were incomplete due to an imprecise reference sequence. Indeed, we obtained similar incomplete profiles when we used ANVNCBI
(B and C) as the reference sequence, while vsiRNAs mapped along the full-length FHVS2R+
genome after reconstruction (A). These results show that the identification of artifactual cold spots results from disparities between the sequence of the virus studied and the one used as reference. It is worth mentioning that artifactual cold spots may also be linked to technical factors, such as the small RNA library preparation protocol or the choice of the sequencing platform (7
Interestingly, in the case of SFV, synthetic siRNAs corresponding to cold spots were shown to inhibit viral replication more than synthetic siRNAs corresponding to hot spots. These results suggest that cold spots constitute a promising target for the design of effective antiviral siRNA-based therapeutics applied to agronomic or human health. To optimize such an approach, it is important to limit the identification of artifactual cold spots, and therefore, the a priori knowledge of the actual virus reference sequence becomes critical.
Determining the actual reference sequence is particularly important when working with rapidly evolving RNA viruses (15
). Although in some cases, the accumulation of changes in consensus sequence may be a slow process, viral genomes may vary in even a limited number of replication cycles, particularly if genetic bottlenecks and positive selection are occurring. With more researchers turning to NGS technology to study viruses, we developed Paparazzi as a one-step tool that fully exploits the data generated by NGS to reconstruct viral genomes and therefore profile vsiRNAs accurately. Paparazzi successfully and rapidly reconstructed the sequences of the three viruses present in our infected sample, even if the initial reference sequence used as the scaffold differed by up to 10% with the sequence of the actual replicating virus. Altogether, our results show that Paparazzi can be used as a proxy to the Sanger method to resequence and potentially genotype viral strains from RNAi-competent organisms.
A key aspect of Paparazzi is its ability to detect variations in viral genomes without prior knowledge of their existence. First, single-nucleotide polymorphisms can be quantified using the nucleotide frequency matrix that is calculated during genome reconstruction. Second, both major and minor DI breakpoints can be identified by the DI discovery snippet, which improves the accuracy of the viral genome reconstruction when DI-derived reads are overrepresented. The application of this last feature allowed us to show that DI-derived reads are highly targeted by the RNAi machinery for the FHVS2R+
RNA2 genome. Two nonmutually exclusive hypotheses may account for these observations. (i) DIs replicate faster because of their smaller size, and the DI/genome-derived siRNA ratio follows the stoichiometry of these two species. (ii) DIs are less efficiently encapsidated than the full-length RNA2 sequence and are thus more exposed to the RNAi machinery. It was previously proposed that the prevalence of DI-derived siRNAs correlates with FHV persistence in S2 cells (5
). However, we found a similar overrepresentation of DI-derived siRNAs under our experimental conditions, which is lethal for the naïve S2 cells. Therefore, the abundance of DIs is not sufficient to allow persistence of FHV infection, although it may contribute to this phenomenon.
While analyzing our data, we noticed that the overrepresentation of DI-derived reads affects the accuracy of virus genome reconstruction. Indeed, the genome sequence of ANVNCBI
that was determined using both de novo
assembly of virus-derived small RNAs and Sanger sequencing-based gap fill-in (22
) displays two internal duplications (248–261/517–530 [100% identity] and 112–153/1349–1390 [89% identity]) in its RNA2 segment. None of these duplications were observed in any other FHV genome, including that of FHVS2R+
. Interestingly, we obtained the same duplications when Paparazzi was instructed not to filter DIs (248–261/517–530), or if Paparazzi is applied allowing 2 mismatches for genome reconstruction (116–152/1353–1389). As these duplications are absent from the Sanger sequencing-determined sequence of FHVS2R+
, we infer that the duplications observed in ANVNCBI
RNA2 result from artifacts in de novo
assembly, and therefore, the actual sequence of ANV RNA2 may differ from the one previously published (22
). Although Paparazzi is not a viral discovery pipeline, its ability to reconstruct full-length viral genome sequences and spot artifacts of reconstruction make it a powerful companion tool to polish the results obtained from virus discovery pipelines. Of note, these features of Paparazzi can also be exploited when directly resequencing viral RNA by NGS technologies.
In conclusion, Paparazzi provides an effective tool for viral genome reconstruction, accurate vsiRNA profiling, and studying RNAi processing. The development of such a tool fits into a constant effort to limit the bottleneck between NGS data generation and analysis in a context where NGS technologies become more affordable and efficient by the day.