The resolution of CNV detection with ExomeCNV is limited largely by the probe design. The CNV segments identified by our method range from 120 bp (single exons with higher than average coverage) to 240 Mb in size (whole chromosomes); however, the true breakpoint can be anywhere in the space between the terminal exon called within a CNV region and the adjacent exon in a non-CNV region. Hence, although a given CNV event can be detected at a single exon in some instances, the absolute resolution of our method is in fact limited to the inter-exon distance around an exon, which can be as small as 125 bp or as large as 22.8 Mb with the median of 5 kb (statistics based on SureSelect Human All Exon Kit G3362).
Although ExomeCNV relies on the availability of matched control samples, we can also derive a matched control sample from a pool of other samples, which then serves as an effective control. This is useful for the identification of germline inherited or de novo CNVs in an individual. Because the expected copy number in the reference population is constant (usually two), by the law of large numbers, averaging depth-of-coverage from sufficiently many samples yields a good control set, assuming that they are all captured using the same probe set and capture method and sequenced in the same manner. This may limit the application of ExomeCNV to data generated at a given site with a given protocol. Calling CNVs using this pooled sample as background will generate CNV calls that are present in the case sample but not the control population. Also, by the central limit theorem, pooling independent samples helps reduce variance in depth-of-coverage and increases precision of our method. We have pooled as few as eight samples and have observed that this is indeed the case (Supplementary Materials
). However, it is important to note that using pooled sample as control imposes a strong assumption that the samples do not share common CNV regions and that the population has an average genomic copy number of two. Other potential challenges of using the pooled sample as control are discussed in the Supplementary Materials
Because ExomeCNV depends on an estimate of the admixture rate c
, misspecification of c
would affect its performance. We performed sensitivity analysis and found that misestimating c
would have a strong effect on sensitivity and specificity of CNV detection. Fortunately, LOH detection provides some data to directly estimate c
, as LOH detection does not depend on a prior knowledge of c
). For the melanoma sample, our estimate of 30% admixture rate matches that from genotyping arrays, confirming the validity of this approach. However, there are advantages to slightly overestimating c
as it makes the method more conservative and reduces false positives.
As we have shown, CNV and LOH detection is readily possible from exome sequencing data, extending the utility of this powerful approach. The fundamental basis that makes this approach possible is the consistency of depth-of-coverage of each exon (and BAF by extension) across multiple samples for each individual exon, as demonstrated in five samples performed in our laboratory (Supplementary Materials
, ). This consistency permits reliable parametric modeling of the shift in depth-of-coverage and BAF distributions, hence accurate identification of CNV and LOH. However, we do not observe the same level of consistency when comparing depth-of-coverage across different library types. For instance, a sixth sample was performed using a paired-end approach that results in very different coverage of each exon (), and as a result, ExomeCNV does not perform well when the control sample library is of one type and the case is of another, or when the case and control have significantly different coverage levels. Resolving these issues is a work-in-progress.
From the analytical power calculations, assuming 35× coverage (which is the lower end of a reasonable amount of sequence for variant calling and easy to generate with a variety of technologies), CNV detection has a limit of about 500 bp (in transcript coordinates), which is typically equivalent to 2–3 exons and spans about 10 kb of genomic space on average. Increased depth-of-coverage, which is likely to become the norm as sequencing costs decrease, reduces the interval size that is reliably detectable and should push the method to single exonic deletion resolution. Currently, CNV and LOH information should be detectable in whole-exome sequencing data at a resolution that is almost equivalent to what one can obtain from a dense SNP genotyping array.
ExomeCNV is available as a CRAN package ‘ExomeCNV’.