Although we have focused here on the insights into mutational mechanisms that can be gained when CNV breakpoints are mapped to base-pair resolution, there are two other important applications of this knowledge. Mapping CNVs to base-pair resolution allows precise annotation of function, including whether each CNV overlaps functional sequences and the likely the impact on those sequences. In addition, base-pair resolution enables the development of breakpoint-specific genotyping assays, which, by virtue of their qualitative nature, are likely to be more robust than quantitative assays for the same variants and thus more useful in locus-specific population surveys, such as association studies.
Genome-wide resequencing has recently become possible, but the cost still prohibits the ascertainment of CNV breakpoints from a large number of samples. Many fundamental research questions require approaches to sampling that differ from those of large international genome-resequencing projects (such as the 1,000 Genomes Project), including sampling a variety of tissues, individuals or organisms. As the technology matures, targeted resequencing could be the gold standard for validation in CNV studies. Moreover, we have shown that not predicating breakpoint sequencing on prior assumptions of the underlying allelic structure allows complex events to be discerned that may have been missed by PCR-based approaches.
Although we were able to increase the number of sequenced breakpoints by using two mapping pipelines, we did not exhaustively explore all possible mapping strategies. There are likely to be additional breakpoint sequences to be mined from these data, perhaps corresponding to complex rearrangements. The vast majority of events we have identified here are deletions, despite our expectation that at least 20% of targeted events are duplications14
. A modified strategy for capturing duplications—by targeting additional sequence within the breakpoints and using de novo
assembly of all targeted reads—seems particularly appropriate, considering the enrichment of repetitive contexts at duplication breakpoints14
Our experimental approach may not have ascertained all classes of CNV. We discovered the target CNVs by array CGH, a platform that is not well suited for identifying polymorphism of extremely high–copy number repeats or heterochromatin. Moreover, breakpoints embedded in repeats much larger than 300 bp cannot be sequenced with the approach used here. In the short term, the most complete picture of mutation processes will come from integrating information from multiple experiments.
Through power simulations, we showed that breakpoints for only a minority of targeted CNVs were likely to be found by this experiment. Nonetheless, substantially fewer breakpoints were recovered than we predicted through simulations. Several properties of real data may account for this. First, we did not simulate our reads with sequencing error, and the assumption of error-free sequencing allows a higher proportion of simulated reads to be mapped with confidence. Second, breakpoint-spanning reads have shorter contiguous matches to the reference genome than unsplit reads, and we did not attempt to model the effect that this lower sequence homology may have on capture efficiency. Third, the several mutation models we considered were only simple models of deletion and duplication; more complex models will presumably lower both the sampling and mapping power. Finally, it is possible that the locations of CNV breakpoints within target regions are biased toward sequences within the target region that have lower probe densities, and thus sampling power is not uniform across the target region.
In a single experiment, we sequenced more CNV breakpoints than have been reported in any previous study, to our knowledge, excepting genome-wide sequencing projects. Until now, the prohibitive cost and effort required to sequence CNV breakpoints has limited the number of events described at base-pair resolution. An analysis of 270 deletion breakpoints found that 40% of the breaks show microhomology and 14% contain small amounts of inserted bases3
. A study looking at 227 CNVs larger than 7 kb concluded that 38% of their events were formed by NAHR, 39% by NHEJ and 17.5% by retrotransposition, and 4.5% were VNTRs26
. In a screen of structural variants from individuals with lung cancer, 306 germline structural variants were sequenced27
. We reanalyzed this dataset, removing 226 inversions and likely transposable element polymorphisms. We found insertion of nontemplated sequence in 22.5% of events and microhomology in 40% of events, but only 7.5% of events showed both signatures; the remainder were blunt ends. In total, these figures accord reasonably closely with what we observed in the present study: microhomology at 70% of deletion breakpoints, inserted sequence in 33%, but just 10% of breaks showing both microhomology and inserted sequence. Thus, in contrast to previous studies that have disagreed over the relative proportions of different breakpoint signatures3,26
, once CNVs are detected at high (<3 kb) resolution and obvious differences in ascertainment accounted for, distinct studies agree relatively closely on the proportions of different breakpoint signatures, and thus on the relative contributions of different mutational mechanisms.
There are still hurdles between the generation of copious CNV breakpoint information and the use of that information to make rigorous inferences about germline CNV mutation processes. It cannot be taken for granted that insights derived from experiments on somatic cells (which often have mutations affecting other components of DNA repair) are comprehensive with respect to germline mutation processes. There may be additional mechanisms operating in the germline, and the relative contributions of mutational mechanisms may be different. One example of the former is the developmentally programmed homologous recombination that takes place preferentially at recombination hotspots in the germline, which drives mutation at some VNTR loci28
and can cause NAHR29,30
The second challenge is to develop a rigorous, statistically driven framework for mapping the breakpoint signatures we observed to the mutation processes that formed them. There are multiple mutational mechanisms that can generate similar breakpoint signatures: for example, MMEJ, MMBIR and NHEJ are all capable of generating deletions with microhomology at the breakpoints. There are thought to be subtle distinctions, however, in the properties of breakpoints produced by NHEJ and MMEJ; for example, MMEJ is thought to require longer stretches of microhomology (>5 bp) than NHEJ (1–4 bp). If these preferences can be precisely characterized, we envisage being able to use statistical analysis of large collections of CNV breakpoints to estimate the relative contributions of different pathways or sub-pathways to in vivo CNV formation. This could be done, for example, by modeling the empirical distribution of microhomology lengths as a mixture of contributions from different pathways.
There is not universal agreement as to whether certain mutational mechanisms are biologically distinct. For example, some view NHEJ and MMEJ as distinct pathways9
, whereas others see them as two strands of a more general and flexible NHEJ mechanism7
. The phenomenology is static, but researchers’ understanding of mutational mechanisms is dynamic, so the mapping of signatures to mechanisms is subject to change over time. Large amounts of data from targeted experiments, coupled with statistical analyses, should help crystallize these issues and establish population-based studies of CNV mutation as less of a descriptive exercise and more of an inference-based one.