The detection of somatic SNVs in tumors is an important part of tumor resequencing because these mutations can be directly relevant to the disease and are the most numerous. One method of discovering somatic SNVs is to compare the sequencing results between a matched tumor and normal pair. To this end we developed SomaticSniper to directly compare the tumor and normal reads and calculate the probability that the two samples have identical genotypes in both samples.
Our simulations on the algorithm show that it should be able to detect most mutations if the mutation is present in the majority of cells, and the normal is relatively pure. We have evaluated SomaticSniper on external data and found it to be more sensitive than other methods and, based on the total number of calls, of comparable specificity. Additionally, we have explored the precision of our algorithm by validating predicted somatic mutations on internally generated data. In contrast to our simulations, which suggested an FDR <15% if mapping error is non-existent, this validation experiment demonstrated a higher FDR. Our subsequent investigations revealed a number of reliable indicators that a predicted variant was, in fact, not real. Most interestingly, we identify an association of false positive bases with the Illumina Q2 base quality designation. This new feature may also prove useful in other false positive reduction techniques, such as base quality recalibration. By implementing some statistical and empirical filters, we were able to greatly increase the validation rate on both our training set and four independent datasets with only a small number of validated somatic mutations failing the filters. While our precision is low on the AML sample, this is expected since there are a smaller number of detectable events due to both tumor cells in the normal sample and a lower mutation rate for this cancer type. In solid tumors, where neither problem is likely to be as severe, we expect that the precision should be similar to that observed on the tested breast cancer tumors.
Despite the success of SomaticSniper on the COLO-829 data, this dataset represents an ideal case for somatic SNV calling and there remains room for improvement in future work. Since COLO-829 is a cell line, it represents the simple case of a perfectly pure, homogenous tumor with a perfectly pure, homogenous matched normal. Cancer projects will rarely work with such an ideal sample and tumors can be expected to contain multiple subclones with varying expected allele frequencies depending on their site-specific copy number and abundance within the tissue sample. Indeed, the internal data with which we evaluated our precision were obtained from a patient with a high white blood cell count (105 000 cells per microliter) and our data indicate that ~30% of the cells from the normal sample were, in fact, tumor (Ley et al., 2008
). In addition, the tumor sample will likely be impure in many cases. While the matched normal can be expected to be free of tumor cells for most samples, this may not always be the case (especially for liquid tumors that circulate into all tissues, and for solid tumors where the matched normal tissue is obtained from adjacent tissue).
Our simulation studies demonstrate the rapid decline of detection power that occurs when the normal sample contains tumor cells. This is due, in part, to the assumptions of the MAQ genotyping model underlying SomaticSniper, which currently operates by ignoring the copy number state and sample purity. This is true for SNVMix2 as well, since it derives its expected genotype frequencies from training on germline variants. Optionally, our caller can take into account the prior knowledge that somatic SNVs are expected to be rare, although our testing suggests that incorporating such a penalty significantly reduces the sensitivity of the algorithm at current WGS coverage levels (Supplementary Fig. S1
). As outlined above, these assumptions are inappropriate for optimal somatic SNV calling and future improvements in somatic SNV calling must take these issues into account.
Sites predicted to be somatic by our method will include both true mutations and some false positives. The number of false positives is likely to be a function of the quality of the reference sequence, the alignments, the data quality and the ability to accurately provide error estimates to the software. While our filters increase precision, fully specified error models or adjustments to the error estimates provided in the mapping qualities and base qualities of the data ought to improve the specificity and sensitivity of these filters.