Here we present the discovery of a novel, highly divergent DNA virus, PHV, that appears to be a hybrid intermediate between the parvoviruses and circoviruses and is nearly identical by sequence to another virus, NIH-CQV, recently reported to have been found in patients with seronegative hepatitis. The arrangement of the linear ~3.6-kb genome of PHV, consisting of two forward-direction ORFs encoding the replication and capsid proteins and a pair of inverted terminal repeats, is characteristic of viruses in Parvoviridae
, and PHV retains the P-loop NTP-binding domain in the replicase protein and PLA2
-like and parvovirus coat motifs in the capsid protein that are conserved among members of that family. Although the replicase gene of PHV is more similar to that of circoviruses by direct BLASTx searches, we were unable to recover a closed circular form of PHV by inverse PCR, unlike the previous study by Xu et al. describing the discovery of the related NIH-CQV (33
). If present in PHV, an episomal form, which has been detected in parvoviruses such as adeno-associated virus and bocavirus (55
), may represent a rare variant in the viral population. By phylogenetic analysis, PHV does appear to bridge the interface between circoviruses and parvoviruses () and, as such, may be related to the last common ancestor of these two ssDNA viral family lineages.
The genome of PHV was independently detected and de novo
assembled in two laboratories from pooled samples from patients with non-A-E hepatitis or diarrhea. However, subsequent identification of PHV sequences in multiple, widely different sample sets and failure to confirm the presence of PHV after re-extraction of NA by a method other than the use of Qiagen spin columns raised a strong suspicion that PHV may be a laboratory contaminant. This was directly confirmed by PCR and deep sequencing of extracted and eluted water controls, revealing that PHV-specific sequences could be directly recovered from Qiagen spin columns. The degree of potential contamination by PHV is significant given that a contiguous stretch comprising ~2/3 of the genome could be de novo
assembled from deep sequencing data corresponding to a negative water control (). Even more strikingly, direct mapping of NGS reads corresponding to the water control to the PHV-1 reference sequence enabled assembly of >97% of the genome (). The high efficiency of silica-based spin columns in concentrating DNA/RNA during extraction (57
) may have played a role in amplifying even trace contamination from PHV occurring during the time of manufacture. Contamination was observed only in spin columns from a single manufacturer (Qiagen) and was not seen with columns from other manufacturers (Invitrogen), or with alternative methods of extraction, such as TRIzol (Invitrogen) or magnetic beads (Ambion or Qiagen EZ1). Furthermore, the extent of contamination appeared to be time and/or batch dependent, as spin column-based kits manufactured prior to 2011 were largely devoid of PHV sequences but those made in 2012 and 2013 were likely to be heavily contaminated ( and ). Such sporadic contamination events may inadvertently mislead researchers into erroneously making disease associations if they are unaware that a newly discovered virus is a contaminant and not a bona fide infectious agent. The presence of contamination in spin columns, such as the previously reported detection of sequences corresponding to murine DNA, circoviruses/densoviruses, and Legionella
bacteria in Qiagen NA extraction columns (16
), can also negatively impact the performance of both clinical and research-based assays for pathogen detection, underscoring the need for DNA-free reagents.
In the present study, the consensus genomes assembled from the various PHV strains were remarkably similar, exhibiting 96 to 100% nucleotide identity with each other (). The very slight observed differences may reflect natural variation or errors in the deep sequencing, either native to the technology or due to sequencing artifacts from random priming or PCR duplication. Strikingly, on an amino acid level, the translated sequences for the major proteins were 99 to 100% identical across all 12 assembled PHV genomes and NIH-CQV. Our finding of very low intrastrain variation in the PHV genome contrasts markedly with that described by Xu et al., in which significant genetic heterogeneity in NIH-CQV corresponding to putative sequence variants between patient samples was observed (33
). Although fold coverage maps from that prior study were not presented, it is possible that insufficient sequence coverage and/or errors in the NGS data may have accounted for the observed high substitution rates. Notably, in our study, greater sequence variation in the assembled genomes was observed in PHV-3 (negative water control) and PHV-6C (encephalitis samples, pool C), which had comparatively lower depths of coverage than the other PHV genomes ( and ). An alternative possibility is that there is indeed genetic heterogeneity in PHV/NIH-CQV that reflects natural variation and/or artifactual variation arising from lot-to-lot variability in the degree of spin column contamination.
The finding of laboratory contamination as the origin of PHV suggests that NIH-CQV, which shares 100% amino acid identity with PHV, is most likely also a laboratory contaminant. In the study by Xu et al., there was 70% PCR positivity in seronegative hepatitis patient samples with an average virus titer of 1.05 × 104
copies/μl (corresponding to 1.05 × 107
copies/ml) yet 0% positivity in healthy blood donors (33
). The dichotomy between these results and serological detection showing comparable rates of positivity for IgG specific to the C-terminal portion of the NIH-CQV capsid protein in hepatitis patients and blood donors is striking. The PCR results may be explained by lot-to-lot variability or the use of a Qiagen extraction kit prior to 2011, as those kits appeared to be less contaminated with PHV/NIH-CQV, while the serological results may potentially be due to detection of cross-reactive antibodies by the immunoblot assay. Previously, a serological assay designed to detect antibodies to p15E of XMRV showed elevated seroreactivity in human T-cell lymphotropic virus type 1 (HTLV-1)-infected individuals (60
), although none of these individuals had detectable antibodies to a second XMRV protein, gp70. Subsequent analysis revealed that a highly conserved sequence within the immunodominant region of HTLV gp21 that is shared with p15E was likely the source of the cross-reactive antibodies elicited by HTLV-1 infection (60
). In the study by Xu et al. (33
), confirmatory data based on serologic reactivity to multiple nonoverlapping epitopes within a single protein or more than one viral protein would have provided stronger evidence of infection by NIH-CQV.
By data mining of publicly available environmental metagenomic databases, sequences with 100% identity to PHV/NIH-CQV were detected in coastal waters off North America. The relatively low number of reads detected is likely due to several factors: (i) high-efficiency concentration of viral DNA in the spin columns, (ii) differential rates of PHV abundance in ocean water, and (iii) lower-throughput Roche 454 pyrosequencing rather than Illumina NGS for data generation. Viral abundance in aquatic ecosystems is exceedingly high, with concentrations estimated at ~108
per 1 ml (61
). In total, approximately 1030
viruses are thought to reside in the world's oceans, constituting a vast, largely unsequenced reservoir of genomes. In addition, highly diverse ssDNA viruses, such as circoviruses and parvoviruses, have been detected in seawater (62
) and in ocean dwellers such as peneid shrimp (63
), and viruses are known to infect diatoms (algae) that are ubiquitous in seawater (64
). Taken together, these observations suggest a plausible pathway for how PHV contamination of the NA spin columns could have occurred. Column-based NA purification is a solid-phase extraction method that binds NA by adsorption to silica, and the silica used in many commercial spin columns is derived from the cell walls of diatoms (57
). If Qiagen's NA extraction kits and “silica gel membrane technology” involve the use of diatoms (66
), it is plausible that PHV is a virus of diatoms and had inadvertently contaminated the spin columns during manufacture. The sporadic contamination observed in the silica-based spin columns ( and ) may thus be due to seasonal variation in diatom abundance, diatom type, and rates of viral infection (67
). The contamination of spin columns is not confined to PHV but can also be seen by the presence of sequences corresponding to phages, circoviruses, and parvoviruses other than PHV (16
). Further studies will be needed to establish that PHV is a virus of diatoms. Notably, we did not detect PHV in environmental metagenomic data sets corresponding to other oceanic or environmental communities, which may reflect a limited geographic and temporal distribution for the virus or a bias and/or incompleteness in the publicly available metagenomic databases surveyed. The impact, if any, of these oceanic viruses on human health or public safety is unknown.
As the use of molecular methods such as deep sequencing for pathogen discovery becomes more frequent, it is critical that robust strategies be developed to rapidly determine the biological and clinical relevance of any new candidate agent. This is especially true with the discovery of novel, potentially transfusion-transmissible viruses in blood that may have an immediate impact on infectious diseases and public health (68
), as exemplified by the high-profile putative association between XMRV and chronic fatigue syndrome that was eventually refuted by rigorous follow-up investigation (19
). In the present study, the confirmation of PHV as a laboratory reagent contaminant and not a candidate blood-borne infectious agent was made possible by (i) independent assessment at two research sites, (ii) free and open sharing of sequence data corresponding to multiple sample cohorts between laboratories, (iii) use of control samples subjected to the same extraction and deep sequencing steps as experimental samples, (iv) direct PCR confirmation of viral contamination, and (v) data mining of publicly available metagenomic sequence databases derived from a vast array of clinical and environmental samples. Our results thus strongly call into question any association of the PHV and NIH-CQV viruses with seronegative hepatitis or, indeed, any bona fide infections of humans. Timely reporting of “dediscoveries” as well as discoveries, by focusing effort and resource investment, is needed to maximize the translational impact of pathogen discovery to clinical medicine and infectious diseases.