2.1 Problem formulation
Given tumour–normal paired allelic counts obtained from NGS sequence data aligned to the human reference genome, we focus on the problem of identifying the joint-genotype
(see below) of the samples at every location in the data with coverage. For simplicity, and following standard convention, we imagine that each position has only two possible alleles, A
. The allele A
indicates that the nucleotide at a position matches the reference genome and B
indicates that the nucleotide is a mismatch. In NGS data, we can measure the presence of these alleles using binary count data that examines all reads at a given site i
and counts the number of matches, ai
, and mismatches, bi
(Goya et al., 2010
). In , we see how this formalism can be extended to tumour–normal paired samples.
Fig. 1. Hypothetical example of the JointSNVMix analysis process. Reads are first aligned to the reference genome (green). Next the allelic counts, which are the number of matches and depth of reads at each position are tabulated. Allelic count information can (more ...)
For a diploid genome, we consider all pairs of alleles that gives rise to the set,
}, the set of diploid genotypes. Now given two diploid samples, the set of possible joint-genotypes
consists of all combinations of diploid genotypes, which is equivalent to the Cartesian product of
with itself, i.e.
We assume the joint genotype of a given position can be mapped onto the more biologically interpretable set of marginal genotypes according to . This can be done by assigning the joint genotype to the most probable state, or marginalizing together the joint genotype probabilities for a given state. As an example of marginalization, we compute
)), i.e. the sum of probabilities of a wild-type genotype in the normal data and a variant genotype in the tumour data.
The nine possible joint genotypes and their associated mappings onto biologically interpretable marginal genotypes
2.2 JointSNVMix models
JointSNVMix1 and JointSNVMix2 are generative probabilistic models that describe the joint emission of the allelic count data observed at position i in the normal and tumour samples. shows the graphical models representing JointSNVMix1 and JointSNVMix2. A complete description of the notation and model parameters is given in .
Fig. 2. Probabilistic graphical model representing the (a) JointSNVMix1 and (b) JointSNVMix2 model. Shaded nodes represent observed values or fixed values, while the values of unshaded nodes are learned using EM. Only the distributions for the normal are shown (more ...)
Parameters in the model are learned using the EM algorithm as discussed below, while hyper-parameters are fixed to the value given
We introduce a random variable Gi as a Multinomial indicator vector representing the joint genotype of the samples. More explicitly Gi=(G(AA,AA)i, G(AA,AB)i,…, G(BB,BB)i) where G(gN,gT)i=1 if the joint genotype of position i is (gN, gT), and G(gN,gT)i=0 otherwise. We assume the count data from the two samples are jointly emitted from Gi thus capturing correlations between the variables, and allowing statistical strength to be borrowed across the samples. This is the key insight that differentiates this model from running an independent analysis of each sample and joining the inferred genotypes post hoc.
Given the joint genotype of the sample, we model the normal and tumour sample as being conditionally independent. For JointSNVMix1, the conditional distribution for each sample is modelled as a three component mixture of Binomial densities, where the densities correspond to the genotypes AA
. These conditional densities are the same as used by SNVMix1 model (Goya et al., 2010
). For JointSNVMix2, the conditional densities are the same as SNVMix2 (Goya et al., 2010
), which allows for the incorporation of base and mapping quality information. A complete description of the model is available in the Supplementary Material
2.3 Inference and parameter estimation
We use the expectation maximisation (EM) algorithm to perform maximum a posteriori
(MAP) estimation of the values of the model parameters and latent variables. One could hand-set parameters of the model to intuitive values; however, we expect that fitting the model will allow for sample-specific adjustments to inter-experimental technical variability and inter-sample variation from tumour–normal admixture (so called tumour cellularity) in the tumour samples. A full derivation of the update equations for JointSNVMix1 and JointSNVMix2 is given in the Supplementary Material
We initialize training with user supplied parameter values. Local minima are a potential pitfall when training with EM. To check the models sensitivity to initial parameter values, we generated 100 sets of random parameters and started training from these parameters. We observed that the trained parameters consistently converge to the same values, suggesting local optima are not a major problem for this model (Supplementary Figs S2 and S3
Storing the posterior marginals generated in the E-step requires
) memory, thus for a large dataset training may exhaust available memory. To circumvent this problem, we subsample every n
-th position with coverage exceeding a specified threshold in both the tumour and normal. For the results presented below, we subsampled every 100-th position with a coverage of at least 10 reads. Lower values of n
and hence larger subsample sizes will lead to improved parameter estimates at the cost of using more memory and CPU time.
2.4 Implementation and performance
The JointSNVMix software package is implemented using the Python programming language with computationally intensive portions written in Cython. The input is aligned sequence data in base space stored in the SAM/BAM format (Li et al., 2009
) from a tumour–normal pair. The program implements two main commands, train and classify. A typical analysis consists of training to learn the model parameters. These can then be used with the classify command. Doing both steps of the analysis using coding space positions from a 30× genome takes ~6 h on a four core Intel i7 processor running at 1.73 GHz with 8 GB memory.
: we generated the synthetic data by sampling 106
sites from the JointSNVMix1 model in a. We used the following parameters:
is normalized so that the sum over all entries equals 1. The depths dNi
were sampled for a Poisson distribution with parameter λ=10. The synthetic data along with class labels is available in Supplementary Material
We set parameters for the AB class in the vectors μN, μT to 0.6, which is slightly skewed towards the reference allele. We did this to ensure the parameters would be different than the hand-set defaults used in the untrained version of JointSNVMix1 and SNVMix1.
For the SNVMix1 and JointSNVMix1 models, we used a threshold
)≥0.5 to call mutations somatic in this experiment. We did not benchmark SNVMix2 and JointSNVMix2 because our simulation technique does not generate quality scores.
: we analysed matched tumour/normal data from Morin et al. (2011
) (dbGAP study accession phs000235.v2.p1). Exome data for patients A and B was captured using Agilent SureSelect, and subsequently sequenced on the Illumina GA II platform and aligned with Burrows-Wheeler Aligner (BWA). For patients C-M, the data were generated by whole genome shotgun sequencing (WGSS) and was run on the Illumina Hiseq2000 platform and aligned with BWA as described in Morin et al. (2011
Ground truth data was predicted from the primary tumour genome and RNA-Seq data with the SNVMix software package followed by manual curation to remove artefacts. Support on both strands was required and variants near gapped alignments disregarded. Two or more high-quality bases matching SNV were required. Finally, putative variants were visually inspected in IGV. Validation of the non-synonymous curated predictions by targeted Sanger sequencing of the normal and tumour samples was performed to establish the true somatic mutations. There were 312 unique positions validated as somatic mutations across all patients in this study. Complete details are available in Morin et al. (2011
In the analysis presented below, only coding space positions were analysed. For SNVMix2 and JointSNVMix2, no pre-processing was performed. For the other models, we removed bases with base or mapping qualities <10. Summary statistics for the aligned data are included in the Supplementary Material
2.6 Alternative methods
In order to evaluate the effect of modelling the joint distribution of the tumour and normal data, we compared the JointSNVMix1 and JointSNVMix2 models to their independent analogues, SNVMix1 and SNVMix2. We re-implemented the SNVMix models in order to compare classifier performance without introducing variation due to implementation. We also considered independent and joint methods for classification which use Fisher's exact test. We include these two methods to verify the performance difference we observe in the SNVMix models are due to joint analysis.
Independent Fisher: this method uses a right-tailed Fisher's exact test in order to test the null hypothesis that the number of variant bases observed is due to random error. If the null hypothesis is not rejected at P-value of 0.05, the site is assigned the genotype AA. Otherwise, a site is assigned a genotype AB if the frequency of the B-allele is between 0.2 and 0.8, and a genotype of BB if this frequency is >0.8.
Joint Fisher: this method first applies the Fisher method to call the genotypes in normal and tumour separately. Putative somatic sites are identified and a two-tailed Fisher's exact test is run to test the null hypothesis that the normal and tumour count data do not differ significantly. If the null hypothesis is not rejected at P-value of 0.05, the putative somatic site is reassigned the reference genotype (AA,AA).
: we re-implemented the SNVMix1 model used in the published SNVMix software (Goya et al., 2010
). To assign joint genotype probabilities, we take the genotype probabilities from the normal and tumour samples and multiply them to obtain the joint genotype probabilities.
: we re-implemented the SNVMix2 model used in the published SNVMix software (Goya et al., 2010
). Joint genotype probabilities are derived as for SNVMix1.
2.7 Performance metrics
Since exhaustive validation of both somatic and non-somatic events was not available for the datasets used, we measured the concordance of the somatic predictions with databases known to be enriched for somatic or germline mutations. In theory, classifiers that predict more true somatic mutations should show higher concordance with the database of somatic mutations and lower concordance with the germline database.
To generate a database of somatic mutations, we took the complete set of 312 unique somatic mutations validated across the cases and joined them with COSMIC v54 database (Forbes et al., 2011
). We used the SNP predictions from the 2010/11/23 release of the 1000 genomes projects (Durbin et al., 2010
), with the positions found in COSMIC v54 removed, as the database of likely germline variants.
For the SNVMix and JointSNVMix methods, which assign probabilities to their predictions, we plot curves based on the rank ordering of somatic predictions. We do this by computing concordance as the probability threshold for classification is lowered. For the Fisher methods, which do not assign scores to their predictions, we plot a point in space for the complete set of predictions.
To estimate the recall rate of the JointSNVMix models, we computed how many of the validated mutations for a case were found by the models using a threshold of
)≥0.5. There were 307 validated positions across the 12 cases used for this analysis. This number differs from the 312 unique positions because some mutations are found across multiples cases, and some cases from the original study were not included in our analysis.
We note that copy number variation (CNV) due to segmental aneuploides and abnormal karyotypes could affect predictions. To assess the effect of CNV on prediction accuracy, we repeated the recall experiment using positions found in regions of predicted CNVs. CNVs were predicted (Supplementary Fig. S4
) using an in-house tool, HMMCopy, available from http://compbio.bccrc.ca
. This tool requires whole genome data, so patients A and B had to be excluded from this analysis. When we excluded these cases there were 187 validated somatic positions, 44 (23.5%) of which were found in regions of CNV.