|Home | About | Journals | Submit | Contact Us | Français|
Viral metagenomics has recently yielded numerous previously uncharacterized viral genomes from human and animal samples. We review some of the metagenomics tools and strategies to determine which orphan viruses are likely pathogens. Disease association studies compare viral prevalence in patients with unexplained symptoms versus healthy individuals but require these case and control groups to be closely matched epidemiologically. The development of an antibody response in convalescent serum can temporarily link symptoms with a recent infection. Neutralizing antibody detection require often difficult cell culture virus amplification. Antibody binding assays require proper antigen synthesis and positive control sera to set assay thresholds. High levels of viral genetic diversity within orphan viral groups, frequent co-infections, low or rare pathogenicity, and chronic virus shedding, can all complicate disease association studies. The limited availability of matched cases and controls sample sets from different age groups and geographic origins is a major block for estimating the pathogenic potential of recently characterized orphan viruses. Current limitations on the practical use of deep sequencing for viral diagnostics are listed.
Viral metagenomics directly characterizes the genetic material of viral communities bypassing the need for prior virus-specific in vitro or in vivo amplification [1–3]. Viral metagenomics was initially used to analyze environmental samples using Sanger shotgun sequencing , and has rapidly expanded to include samples such as human and animal feces [5–11], blood [12,13], tissues [14–17] and respiratory secretions [18–20] often using next generation sequencing. Viral metagenomics “deep sequencing” has been largely focused on viral discovery to identify new pathogens, sequencing of viral variants of known species to better understand their evolution , and unbiased survey of viral communities [22–25] without the diversity-reducing effect of prior amplification in cell culture. Viral metagenomics can also be used to detect low titer virus and characterize full viral genomes circumventing the need for molecular methods requiring stringent nucleotide hybridization such as PCR and microarrays .
Due to its earlier introduction and longer read length (facilitating the recognition of highly divergent viral sequences) the Roche 454 system has been the most popular high-throughput sequencing tool for viral discovery. The less expensive (per base pair) shorter sequence reads from the Illumina genome analyzer platform are gradually increasing in popularity [28,29]. Sequencing technologies delivering data in hours instead of days, including that from Ion Torrent and Pacific Biosciences, have been recently introduced but yield fewer reads. A high rate of sequencing errors is inherent to many of these high throughput technologies. Frame-shifting insertion/deletion (indel) mutations disrupt viral open reading frames and thus interfere with protein similarity searches that can impact the identification of highly divergent viruses. Indels have less impact on the identification of already known viral pathogens possible using nucleotide similarity searches. When sequencing high titer viruses sequencing errors are corrected during assembly of the consensus sequence genome from the numerous overlapping sequences covering each nucleotide positions. Deep sequencing may also be used to detect rare variants such as HIV drug resistance mutants present as a minority within largely wild-type viral quasispecies .
Directly sequencing nucleic acids in biological fluid without prior viral particle enrichment will result in a high background of host and bacterial genetic material including chromosomal, episomal and ribosomal sequences . To reduce such background viruses can be purified using simple filtration methods to exclude larger particles. Nuclease treatments can also be used to digest naked cellular nucleic acids abundant in biological fluids while viral nucleic acids remain protected within the viral capsid . When large sample volume is available, such as in environmental studies, ultra-centrifugation methods can also be used to concentrate and purify viral particles from the expected density bands [32,33].
Regardless of the sequencing technology used, the viral nucleic acids have to be amplified to generate the large quantities of DNA required for most sequencing platforms. Since many viruses have RNA genomes, their detection first requires the genomic RNA to be reverse transcribed into cDNA. Numerous sequence-independent amplification methods have been successfully used but two basic approaches predominate [3,21,34–41]. Random PCR is based on the degenerate 3’ end of PCR primers used to randomly prime DNA synthesis [20,42]. The 3’ degeneracy allows such primers to anneal throughout the length of viral RNA or DNA genomes. Following two rounds of extension placing such a primer at both extremities of a viral sequence, multiple rounds of PCR amplification are performed with the same primer but lacking the degenerate 3’ end (NNNNNN). This method is analogous to one commonly used for generating fluorescently labeled cDNA for analysis of transcriptome on microarrays . A related technique called sequence-independent single primer amplification (SISPA) involves partial cleavage of double stranded cDNA (generated with a short random primer) using a 4bp recognizing restriction enzyme followed by ligation of a sticky adaptor and PCR using a primer complementary to the adaptor [31,39,44].
Another popular random amplification method involves the use of the highly processive phix29 DNA polymerase with short random primers in a multiple displacement amplification [45,46]. A possible bias for the amplification of circular DNA viral genome may occur using phix29. Viral RNA can also be amplified after reverse transcription and ligation of the cDNA molecules into long chimeric DNA molecules more appropriate for phix29 DNA polymerase amplification .
To facilitate the recognition of novel viruses, overlapping sequence reads are first computationally assembled into longer contigs. The less complex the nucleic acid population, the greater the number of reads that will be collapsed into contigs. Both contigs and unassembled singletons are then subjected to searches against public sequence databases using the BLAST suite of tools . Nucleotide similarity can quickly identify sequences that are closely related to those of known viral species. Previously uncharacterized viral species that are highly divergent from those already in public databases may be unrecognizable using nucleotide similarity searches and require the computationally more demanding comparison of their virtual translation products to all known viral proteins in order to detect weaker matches. A limitation of such similarity-based approach is that still uncharacterized viral families with no detectable protein sequence similarity to known viral families in GenBank would go unidentified. Possible approaches to identify such highly divergent viruses include further analysis of the taxonomically unclassifiable contigs often found in viral metagenomic studies. Extension of such contigs by targeted PCR or further metagenomics sequencing may allow the detection of distant homology to a known virus. The presence of the same unclassifiable sequence in different patients might also support claims for the detection of a new viral family but would ultimately require direct evidence of viral replication such as amplification in vitro or in vivo, and sero-conversion in exposed hosts. Once a single genome of a new viral family is characterized its constituent genera and species will be readily detectable using sequence similarity searches. Characterization of novel viral families from any particular host group such as vertebrates, arthropods, protists, plants, and prokaryotes will therefore facilitate detection of their viral homologues in other hosts.
A significant fraction of common human diseases including acute gastroenteritis, acute respiratory tract infections, hepatitis, and encephalitis remain without infectious etiology. Autoimmune diseases and some cancers may also be induced by still unidentified viral triggers. Viral metagenomics therefore provides a simple tool to identify candidate pathogens for such diseases. Clinical diagnostic laboratories frequently use cell culture and on occasion observe viral cytopathic effects induced by unidentifiable viruses. When viral concentration is high, random amplification and even superficial DNA sequencing (i.e. a few plasmid subclones) can rapidly characterize the viral genomes in cell culture supernatants [48,49]
Since its recent deployment, viral metagenomics has quickly led to the genetic characterization of numerous human and animal viruses. While many of these “new” viruses were discovered in clinical samples from patients with unexplained symptoms, their identification in such context may be coincidental and may simply reflect harmless and very common infections. Evidence of the pathogenicity of such orphan viruses is therefore required before these viruses may be included in diagnostic tests. Diagnosis of infection with a particular viral pathogen against which no treatment exists may prevent unnecessary intervention such as antibiotic treatment, and improve palliative measures and transmission prevention.
What data are required before an infectious agent can be generally accepted as a pathogen has been previously reviewed in depth [50,51]. Human inoculation with orphan viruses being precluded, indirect evidence of their pathogenicity relies largely on observation of natural infections. The initial question is often whether a newly characterized virus is actually a human virus or is simply ingested or inhaled and passing without consequence through gut or bronchi lumen. The common detection of ingested plant, animal, and insect viral nucleic acids in human and animal stools attests to their capacity to survive through the digestive track [6,7,52–54]. Detection of virus specific human antibodies can be used to show replication in human hosts [55,56]. Virus detection in blood, tissues, or CSF (as opposed to the gut, respiratory secretions, or skin) may also be considered evidence of human replication since it is difficult to conceive how a large number of viral particles could be passively transferred to internal anatomical compartments. Viral replication in human cell culture may also reflect a human tropism but because host restrictions can be bypassed in cell cultures and many viruses can grow in non-host species cell lines in vitro, replication cannot be considered definitive evidence.
Determining whether a human virus is pathogenic often starts by comparing its prevalence in matched disease cases and healthy controls using PCR. Single or nested PCR or quantitative PCR assays can be rapidly designed for sensitive detection and viral load measurement. A major block to performing such case-control studies is acquiring large numbers of samples from both unexplained disease cases and from healthy controls, samples not typically collected in medical care facilities. As an alternative, control samples may be collected from patients with unrelated symptoms. Because viral exposure and susceptibility can vary greatly in different human populations the cases and controls need to be carefully matched epidemiologically especially for age, geographic provenance, and if possible general exposure to viruses (socio-demographics or occupation). Failure to properly match cases with controls may lead to misleading results when comparing groups that have different levels of harmless commensal virus infections. Disease association studies would optimally also be performed with different age groups from different continents. The results of such association studies can be strongly suggestive of a pathogenic role.
Samples from clustered disease outbreaks may also be used to test disease association. If a virus is present in all or most patients within an outbreak but not, or only rarely, in unaffected local controls it may be considered associated with the symptoms, especially if other potential infectious culprits are shown to be absent. Single disease cases may also be highly informative if longitudinal samples are collected showing a rise and decline in the concentration of new virus coinciding with onset and resolution of symptoms and known pathogens are not detected using either a battery of specific tests or viral metagenomics.
Seroconversion in convalescent sera is also often used to time an infection to the onset of symptoms . For some infections such as HCV and HIV viral replication and particle release may linger for long periods and even establish life-long chronic infections. Some parvoviruses may even be retained in vivo in likely inert but PCR detectable forms for years [58,59]. Demonstrating seroconversion in convalescent sera through IgM or IgG detection with a recent prior antibody negative bleed can temporarily associate an infection with onset of symptoms . Both neutralizing and viral antigen binding antibodies may be used. Neutralizing antibody tests require prior viral amplification in culture, which may be problematic for some viruses. Antibody-antigen binding assays require generating properly folded and modified viral epitopes and assay calibration typically using known antibody positive sera, reagents that may not be readily available. The absence of such positive and negative control sera may be circumvented by setting a threshold level based on bimodal distribution of antibody reactivity within a population or based on the inflection point of ranked binding assay signals [60,61].
The detection of viral replication or expression in affected tissue(s) such as the liver for hepatitis, and brain or CSF for encephalitis can also provide supporting evidence for a pathogenic role. The strong association of the recently described Merkel cell polyomavirus (MCV)  with Merkel cell carcinoma  provides an example of convincing disease association for an otherwise highly prevalent infection [61,64]. The virus is found clonally integrated into the cells of the tumor with viral mutations that truncate a helicase gene, presumably to promote transformed cell survival by preventing viral replication .
Complicating disease association studies are the extent of viral genetic diversity shown by some recently characterized viral groups [66–69]. As an example, a high number of serotype/genotype can be found within each human enterovirus species and each associated with no or a variety of symptoms (http://www.picornaviridae.com/enterovirus/prototypes/hev-b_prototypes.htm) . Other groups of similarly genetically divergent viruses may therefore be as equally diverse phenotypically. When determining the pathogenicity of such new viral groups each serotype/genotype should therefore be considered separately increasing the number of samples required to detect sufficient numbers of infections to measure possible disease association.
Many viruses are common but only cause symptoms in a very small subset of infections. To detect a disease association for a virus frequently detected in healthy controls, it must be found in a still higher fraction of the unexplained matched disease cases. If a virus is responsible for only a small fraction of the unexplained cases it may be difficult for that increase in prevalence to rise above the high background of asymptomatic infections. Two groups of very common human viruses found in plasma, the anellovirus genus and GBV-C in the Flaviviridae family, were initially detected in hepatitis patients and at first thought to be associated with that condition [71–73]. Further studies showed anelloviruses to be extremely diverse and nearly universal in human plasma and likely transmitted through close family contact very early after birth [74,75]. GBV-C was also shown to be a very common infection, especially in individuals exposed to blood products . Association of either group of virus with hepatitis was not confirmed [75,77,78]. Given that anelloviruses exhibit an extreme degree of genetic variability it remains possible that a subset are associated with some disease in a situation analogous to those of papillomaviruses, but their ubiquitous detection makes association studies challenging .
Co-infections may also aggravate symptoms. If disease induction is greatly increased in the context of other infections then only the total number of infections or particular combinations of viruses may be associated with symptoms. Deciphering such complex interactions will require metagenomics or microarray technologies that query all virus families rather then single virus assays in order to identify all participating viruses.
Ultimately, convincingly demonstrating disease association requires confirmation by different research groups using different sets of patient and control samples. The reduction of symptoms and disease prevalence seen after using specific anti-virals and vaccinations are direct ways to demonstrate human virus pathogenicity. Since such measures are only developed for highly pathogenic and prevalent viral infections, the ultimate confirmation of their pathogenicity first depends on disease association tests such as described above. For animal viruses their direct inoculation into their host species can greatly facilitate the determination of their pathogenicity.
The use of viral metagenomics for diagnostic purposes appears promising once significant roadblocks are removed [1,26,80]. Meanwhile deep sequencing can be used to detect viral contamination during the development and manufacture of biological products [81,82]. The current barriers to the routine use of viral metagenomics as a diagnostic tool include its current complexity and slow speed, difficult data interpretation, and access to more streamlined alternative methods such as qPCR, or microarrays [83,84]. The random amplification and DNA library construction protocols are currently complex but likely to become simplified and faster in the near future. The sequencing time itself is also likely to be reduced. Shrinking the time from clinical sample collection to sequence data generation and bioinformatics analysis to a medically useful few hours to a day will enhance the appeal of a sequencing approach for diagnostic purposes. The issue of sensitivity to allow the detection of very few copies of viral genomes will also have to be compared, using spiked samples, to highly sensitive PCR based and microarray methods. Cost will likely remain a major barrier especially when compared to a single qPCR assay in cases where only testing for a single virus is indicated. Further limitations include access to appropriate computational power needed to analyze very large sequence data sets. What level of sequence divergence to a reference viral genome (i.e. BLAST E score) will be tolerable and how many matches to different regions of the genome will be required to confirm the presence of a virus remain to be determined. The persistent problem of DNA contamination and the medical significance of very low level of viral nucleic acid detection will also need to be resolved. If the issues described above can be successfully addressed, deep sequencing does have the potential to replace many virus-specific tests by a single procedure. In the immediate future, microbial sequences microarrays are likely to be simpler, cheaper, and more rapid tools to rapidly screen clinical samples in an unbiased way for all of the rapidly growing number of already genetically characterized viruses. In the interim, deep sequencing is likely to be more popular for research use such as viral discovery, re-sequencing, and testing biological products during early development.
The project described was supported by Award Numbers R01HL105770 and R01HL083254 from the NHLBI and support from BSRI. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NHLBI.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.