We have developed a novel CNV calling methodology using read depth information from exome sequence data and implemented it within an R package called ExomeDepth. This allowed us to maximize the statistical power to detect even small CNVs in the presence of technical variability inherent to high-throughput DNA sequencing technologies. As a consequence, compared with other tools, we found the greatest improvement for datasets that show more technical variability across samples. Although ExomeDepth is designed to be used as a standalone software, the contruction of a reference set is a problem shared across all read depth CNV calling algorithms for exome data. Therefore, additional refinements proposed by other CNV calling algorithms to improve calling accuracy could potentially be applied after ExomeDepth has identified the optimum aggregate reference set.
Because ExomeDepth assumes that the CNV call is not present in the reference sample, it is best suited to call rare CNVs. Nevertheless, our analysis of exome samples from the 1000 Genome Project indicates that it can call common CNVs as well, even though some power is lost when the allele frequency is high. Our computations indicate that the power to detect rare CNVs is maximized for a reference:test ratio of ~10:1. Hence, while we find no obvious benefit in using a very large dataset (> 100 exomes), it is essential to generate exome data in batches of six or more samples.
Our analysis shows that a Gaussian model for the read count data is not appropriate for exome sequence data, which is likely to explain the discrepancy with the observed larger number of CNV calls generated by ExomeCNV. The discrepancy with exomeCopy is more surprising, because exomeCopy fits a robust negative binomial model related to our model. Additionally, in the exomeCopy analysis, we used the optimized reference set identified by the ExomeDepth analysis. The default parameters of the exomeCopy hidden Markov chain were also similar. Comparing both methods, we find that the main difference is that exomeCopy essentially takes a profile likelihood approach: it uses the median normalized read depth at each exon to account for the variability in exon capture efficiency, which is a nuisance parameter. However, the median read depth can be unreliable, if the sample size is small and/or the data are noisier (e.g. in Dataset 2). Instead, in ExomeDepth we used a logistic model, which deals with that nuisance parameter by conditioning on the total read count for each exon. We hypothesize that exomeCopy would show results more similar to ours, if the exon-specific parameters were estimated within the negative binomial model rather than prior to model fitting. However, the number of exons, and therefore, the number of parameters to estimate, would be large which makes it difficult in practice.
In our study the statistical power to detect CNVs varied extensively between two datasets and was largely determined by sample-to-sample variability within datasets that emerged either at the exome capture or at the sequencing step. This feature of the data cannot be detected by commonly used single-sample quality metrics: it is the correlations across samples rather than the single-sample summary statistic that are relevant. Owing to its essential role for CNV calling, we argue that a measure of sample-to-sample consistency (e.g. correlation between FPKM values) should be provided by sequencing facilities when exomes are analysed in sufficiently large batches.
Finding of the GATA2 and DOCK8 deletions illustrates the power of ExomeDepth to identify even heterozygous and small CNVs comprising just one to two exons. We conclude that reducing technical variability between the samples and using bioinformatics tools that maximize statistical power of CNV detection, such as ExomeDepth, will allow efficient CNV identification and will increase the value of the future exome sequencing experiments.