In recent years, the advent of NGS technology has largely propelled the genomic research. NGS can generate millions of reads ranging from 30–350 base pairs (bp) based on the sequencing platform used. Continuous improvement in NGS technology brings the increasing of the throughput to a high extent and also lowers the cost [1
]. With abundant reads aligned, many novel inferences can be made including regulatory element identification, mutation detection, gene expression estimation and detection of RNA splicing and fusion transcripts. NGS is expected to be a powerful tool for revealing genetic variations contributing to various complex diseases by providing sequence of a set of candidate genes, the whole exome or the whole genome. For example, whole genome sequencing can help in finding the frequency of tumor-specific point mutations for diseases such as multiple myeloma [2
], while whole exome sequencing can be used to discover protein-coding mutation as well as small non-coding RNAs and aberrant transcriptional regulation that may contribute to diseases such as MDS [3
The SNV calling algorithms can be divided into two categories. The first category includes threshold based commercial software packages such as Roche GSMapper and Lasergene, and the second category entails posterior probability based method including Maq [4
], SOAPsnp [5
], Varscan [6
], Atlas-SNP2 [7
] etc. For the threshold based prediction methods, a good threshold setting is difficult to obtain and relies heavily on the user experience [8
In transcriptome based data, the number of reads representing a given transcript is highly variable across all genes making it difficult to determine a minimum depth. Moreover, the confidence for the prediction of each location is unavailable. Compared to the threshold based methods, posterior probability (Bayes) based methods achieve flexibility by considering the confidence of observation of each position on the genome. For the cancer genome sequencing data, sequencing errors, as well as the altered ploidy and tumor cellularity, are important factors affecting the accuracy of SNV calling. Although tools exist for SNVs discovery from NGS data, few are specifically suited to work with data from tumors. Recently, SNVMix [9
] addressed this problem by incorporating the dependency of near-by genotypes and the posterior probability to improve the accuracy of SNVs prediction. However, the performance of SNVMix for data with low sequencing depth is not satisfactory compared to its performance with data having high sequencing depth. It has been observed that NGS provides lower sequence coverage in certain areas of genome including regulatory regions [10
]. It is necessary to improve the performance of SNVs detection for tumor data with low sequencing depth. Moreover, SNVMix has achieved a relatively high sensitivity in the Bayesian framework, but the specificity is some low. The performance of specificity is needed to be improved further.
Hidden Markov model (HMM) is widely used in many fields such as speech and handwriting recognition, text classification, as well as DNA and protein classification [11
]. Recently, a HMM based program VARiD [12
] was developed for SNVs prediction for data from multiple sequencing platforms. VARiD is mainly focused on color space sequence and does not fully consider the mapping and base quality of the aligned reads and corresponding bases on the aligned reads in the considered model. Moreover, this method is time consuming for whole genome analysis and has not been used on RNA-Seq or whole exome sequence analysis from tumor data so far.
In this paper we developed an algorithm SNVHMM, for SNVs prediction of tumor data obtained from NGS basing on a discrete HMM. Since non-SNVs are prevalent and continuous in the genome [13
], point mutations in cancer data are relevant to certain genes and are concentrated in the corresponding area [14
], the contextual information, especially for the non-SNVs, can be considered and made full use of in addition to the information from the overall distribution of traditional Bayesian framework. So SNVHMM is expected to gain more probability power from the contextual information on the genome compared to traditional Bayesian framework, and obtain better performance for SNVs prediction. Moreover, with the contextual information added to the whole distribution information, SNVHMM is also expected to improve the statistical performance of Bayesian method for tumor data with low sequencing depth.