To test the accuracy of VirusSeq, we analysed RNA-Seq data of 17 HCC cancers available in the TCGA database and detected HBV transcripts in four cases (), two of which are from patients with serologic evidence of HBV infection and one from a patient who is seronegative for HBV (and hepatitis C virus). Serology data were not available for the remaining case. Viral integration loci identified in our analysis included known genes MLL4 (two cancers; both from the two HBV-seropositive patients) and TERT, ITGAD, TEAD1, TECRL, C19orf55 and MIR548D2. Interestingly, the cancer with TERT-associated HBV sequences came from the patient who was reportedly seronegative for HBV. Our findings validate other reports that have demonstrated HBV insertion in TERT and MLL4 (Ferber et al., 2003
, Saigo et al., 2008
Characterization of genes with HBV integration breakpoints in HCC
We also analysed RNA-Seq data of 239 cases of head and neck squamous cell carcinoma (HNSCC) available in the TCGA database. We detected human papillomavirus (HPV) transcripts in 37 cancers as follows: 30 cancers with HPV16, five cancers with HPV33, one cancer with HPV35 and one cancer with Epstein–Barr virus. In 24 cancers, HPV transcripts encoding for key viral proteins/oncoproteins (E7 in 22 cases; E6 in 20 cases; E1 in 17 cases and E4 in eight cases) were integrated in the cancer genome, the majority in association with known genes. We used the HPV16 status from colorimetric in situ hybridization and the p16 immunohistochemistry data (clinical data) to estimate the sensitivity and specificity for HPV16 detection. We found that a total of eight samples were HPV16-positive from colorimetric in situ hybridization (six HPV16-positive) and/or p16 immunohistochemistry (seven HPV16-positive), and 36 samples were HPV16-negative. The HPV16 status was not available for all the remaining samples. The confidence intervals (CIs) were estimated using the Wilson score method by taking the sample size into consideration. For this HNSCC dataset, the sensitivity was 100% (8/8) with a 95% CI of 67.6–100%, and specificity was 100% (36/36) with a 95% CI of 90.4–100% ().
Estimation of sensitivity and specificity for HPV16 detection in HNSCC samples
We have developed a new algorithmic method called VirusSeq for detecting the known viruses and their integration sites in the human genome using NGS data. We evaluated VirusSeq on RNA-Seq data of 17 HCC and 239 HNSCC samples and showed that VirusSeq accurately detects the known viruses and their integration sites. VirusSeq can also perform this function using whole-genome sequencing data obtained from human tissue. The main limitation of VirusSeq is the requirement of the known virus database to nominate candidate viruses in human cancer tissue. This will certainly miss novel viruses that are not in the virus database. We expect VirusSeq to be an effective solution for detecting viruses and their integration sites in cancer studies. We invite users to test our software.