|Home | About | Journals | Submit | Contact Us | Français|
Summary: We developed a new algorithmic method, VirusSeq, for detecting known viruses and their integration sites in the human genome using next-generation sequencing data. We evaluated VirusSeq on whole-transcriptome sequencing (RNA-Seq) data of 256 human cancer samples from The Cancer Genome Atlas. Using these data, we showed that VirusSeq accurately detects the known viruses and their integration sites with high sensitivity and specificity. VirusSeq can also perform this function using whole-genome sequencing data of human tissue.
Availability: VirusSeq has been implemented in PERL and is available at http://odin.mdacc.tmc.edu/~xsu1/VirusSeq.html.
Supplementary information: Supplementary data are available at Bioinformatics online.
About 12% of all human cancers are known to be caused by viruses (Hausen, 2009); thus, the detection of viruses in human cancer tissue has significant clinical implications in oncology. The advent of next-generation sequencing (NGS) technologies using paired-end (PE) reads allows for the detection of viruses in human cancer tissue at unprecedented levels of efficiency and precision. Several groups have developed computational tools for pathogen/virus discovery by exploiting the great amount of NGS data obtained from human tissue (Isakov et al., 2011; Kostic et al., 2011). These groups have implemented a computational subtraction analysis, which has also been used to discover a new polyomavirus associated with most cases of Merkel cell carcinoma (Feng et al., 2008).
Although detecting viruses in human tissue is important in clinical oncology, investigating virus integration sites in host cell chromosomes is equally valuable, as insertional mutagenesis is one of the most critical steps in the pathogenesis of hepatitis B virus (HBV)-mediated hepatocellular carcinoma (HCC; Paterlini-Brechot et al., 2003). NGS data have been used to map the HBV integration sites in HCC samples (Jiang et al., 2012; Sung et al., 2012). However, no software tool is currently available for detecting viral integration sites by NGS data. We present VirusSeq, which starts with computational subtraction on NGS data and subsequently identifies viruses and their potential integration sites in the human genome with high specificity and sensitivity.
The PE reads in FASTQ format are used as input. VirusSeq works with both whole-genome and whole-transcriptome sequencing data. The raw PE reads are aligned to the reference genome using MOSAIK (Hiller et al., 2008) alignment software, which implements both a hashing scheme and the Smith–Waterman algorithm to produce gapped optimal alignments.
VirusSeq starts with computational subtraction of human sequences by aligning raw PE reads from whole-genome/transcriptome sequencing to the human genome reference. Thus, a set of non-human sequences is effectively generated by subtracting the human sequences. In the second step, VirusSeq aligns the non-human sequences against a comprehensive database that includes all known viral sequences from Genome Information Broker for Viruses (http://gib-v.genes.nig.ac.jp/) and quantifies the virus representation by the overall count of mapped reads within a virus genome to determine the existence of viruses in human samples with an empirical cut-off. Any virus with an overall count of mapped reads below the cut-off is treated as non-existent. We used 1000 as the cut-off for the overall count of mapped reads within a virus genome; this cut-off should be applicable for both RNA-Seq data and whole-genome sequencing data with 30× coverage. This cut-off should be reduced by half or more for low-pass whole-genome sequencing data.
The genome sequences of viruses, which are well known in terms of cancer association and were detected in the previous step in our The Cancer Genome Atlas (TCGA) dataset, were concatenated into a single chromosome named chrVirus, with related annotation of each viral gene in refFlat format. A new hybrid reference genome named hg19Virus is built by combining hg19 and chrVirus (designated as chr25 in hg19Virus). All PE reads without computational subtraction are mapped to this reference (hg19Virus). If the PE reads are uniquely mapped with one end to one human chromosome and the other to chr25, the read pair is reported as a discordant read pair. All discordant reads are then annotated with human and viral genes defined in the curated refFlat file. VirusSeq then clusters the discordant read pairs that support the same integration (fusion) event (e.g., HBV-MLL4). VirusSeq implements a dynamic clustering procedure (details in Supplementary Notes) to accurately determine the boundary of the cluster, whose size is constrained by the insert size (fragment length) distribution. To remove outliers within a cluster, VirusSeq implements the robust ‘extreme studentized deviate’ multiple-outlier detection procedure (Rosner, 1983). Once outliers are detected within a cluster, the cluster boundary is reset by excluding the outlier reads. VirusSeq reports the fusion candidates by using both supporting pairs (at least four) and junction spanning reads (at least one) as the cut-offs. Meanwhile, an in silico sequence is generated using the consensus of reads within discordant read clusters for each fusion candidate to help the PCR primer design, which facilitates quick PCR validation.
To test the accuracy of VirusSeq, we analysed RNA-Seq data of 17 HCC cancers available in the TCGA database and detected HBV transcripts in four cases (Table 1), two of which are from patients with serologic evidence of HBV infection and one from a patient who is seronegative for HBV (and hepatitis C virus). Serology data were not available for the remaining case. Viral integration loci identified in our analysis included known genes MLL4 (two cancers; both from the two HBV-seropositive patients) and TERT, ITGAD, TEAD1, TECRL, C19orf55 and MIR548D2. Interestingly, the cancer with TERT-associated HBV sequences came from the patient who was reportedly seronegative for HBV. Our findings validate other reports that have demonstrated HBV insertion in TERT and MLL4 (Ferber et al., 2003, Saigo et al., 2008).
We also analysed RNA-Seq data of 239 cases of head and neck squamous cell carcinoma (HNSCC) available in the TCGA database. We detected human papillomavirus (HPV) transcripts in 37 cancers as follows: 30 cancers with HPV16, five cancers with HPV33, one cancer with HPV35 and one cancer with Epstein–Barr virus. In 24 cancers, HPV transcripts encoding for key viral proteins/oncoproteins (E7 in 22 cases; E6 in 20 cases; E1 in 17 cases and E4 in eight cases) were integrated in the cancer genome, the majority in association with known genes. We used the HPV16 status from colorimetric in situ hybridization and the p16 immunohistochemistry data (clinical data) to estimate the sensitivity and specificity for HPV16 detection. We found that a total of eight samples were HPV16-positive from colorimetric in situ hybridization (six HPV16-positive) and/or p16 immunohistochemistry (seven HPV16-positive), and 36 samples were HPV16-negative. The HPV16 status was not available for all the remaining samples. The confidence intervals (CIs) were estimated using the Wilson score method by taking the sample size into consideration. For this HNSCC dataset, the sensitivity was 100% (8/8) with a 95% CI of 67.6–100%, and specificity was 100% (36/36) with a 95% CI of 90.4–100% (Table 2).
We have developed a new algorithmic method called VirusSeq for detecting the known viruses and their integration sites in the human genome using NGS data. We evaluated VirusSeq on RNA-Seq data of 17 HCC and 239 HNSCC samples and showed that VirusSeq accurately detects the known viruses and their integration sites. VirusSeq can also perform this function using whole-genome sequencing data obtained from human tissue. The main limitation of VirusSeq is the requirement of the known virus database to nominate candidate viruses in human cancer tissue. This will certainly miss novel viruses that are not in the virus database. We expect VirusSeq to be an effective solution for detecting viruses and their integration sites in cancer studies. We invite users to test our software.
This work was supported in part by the National Cancer Institute, U.S. National Institutes of Health, through grant U24CA143883 for TCGA, and grant P30 CA016672, the Cancer Center Support Grant to The University of Texas MD Anderson Cancer Center; also by the National Center for Research Resources through grant UL1TR000371; and the H. A. and Mary K. Chapman Foundation and the Michael and Susan Dell Foundation.
Conflict of Interest: none declared.