|Home | About | Journals | Submit | Contact Us | Français|
Genome structural variation (SV) is a major source of genetic diversity in mammals and a hallmark of cancer. While SV is typically defined by its canonical forms – duplication, deletion, insertion, inversion and translocation – recent breakpoint mapping studies have revealed a surprising number of “complex” variants that evade simple classification. Complex SVs are defined by clustered breakpoints that arose through a single mutation but cannot be explained by one simple end-joining or recombination event. Some complex variants exhibit profoundly complicated rearrangements between distinct loci from multiple chromosomes, while others involve more subtle alterations at a single locus. These diverse and unpredictable features present a challenge for SV mapping experiments. Here, we review current knowledge of complex SV in mammals, and outline techniques for identifying and characterizing complex variants using next-generation DNA sequencing.
Structural variation (SV) is defined as differences in the copy number, orientation or location of relatively large genomic segments (typically >100 bp). The canonical forms include deletions, tandem duplications, insertions, inversions and translocations. Large-scale microscopically-visible genomic rearrangements have long been recognized for their role in evolution and disease, but the remarkable prevalence of submicroscopic SVs only became apparent this past decade with the development of high-resolution methods such as array-comparative genomic hybridization (array-CGH) and next20 generation DNA sequencing. Current data [1, 2] suggests that two humans differ by 5,000–10,000 inherited SVs and that both inherited and de novo SVs contribute to a number of normal and disease phenotypes . Similar levels are apparent in other mammalian species including chimpanzee , mouse [5, 6], rat , dog [8, 9] and cattle .
Virtually all tumor genomes harbor somatically-acquired SV, but the landscape is extremely diverse. Some tumors have tens or hundreds [11–16], while others have very few [12, 15, 17, 18], and the abundance of different SV classes varies considerably within and among tumor types [12, 15, 19]. A subset of cancer-associated SV appears to be functional and under strong selection, such as amplification of oncogenes, deletion of tumor suppressors and translocations that produce fusion genes, but many appear benign. Tumor genome instability may be caused by mutations in DNA maintenance machinery, widespread telomere erosion and/or unstable chromosome architectures acquired during tumorigenesis (e.g., dicentrics) .
In this context the recent discovery of many complex variants that defy simple classification into the typical SV classes has led to a re-examination of the mechanisms and impact of structural variation. Here, we review recent findings regarding complex variation and discuss techniques for their identification and characterization. We limit our discussion to mammals, mainly human, but we note that these same issues are relevant to other species.
Structural variants are defined by their breakpoints, which are the novel sequence junctions generated by structural mutation (Figure 1a–c). Structural variants arise through four general mechanisms (reviewed in ): 1) ligation of double strand DNA breaks (DSBs) through non-homologous end-joining (NHEJ) or microhomology-mediated end-joining (MMEJ); 2) exchange between sequences sharing significant stretches of homology, as can occur either by non-allelic homologous recombination (NAHR) during DSB repair or meiosis, or by single-strand annealing (SSA) at DSBs; 3) DNA replication errors such as strand-slippage or template switching; and 4) transposition of mobile elements.
Breakpoints are usually identified by comparing the structure of an experimental genome to that of the reference genome, and breakpoint positions are reported based on the coordinate system of the reference (Figure 2). This can cause some confusion since the number of sites may be different depending on which genome one is referring to. For example, a deletion produces a single junction in the experimental genome, but this junction is defined by two coordinates in the reference; the term “breakpoint” has in various studies been used to describe either, or both points of view. The Variant Call Format (VCF) definition resolves this ambiguity by using the terms “novel adjacency” and “breakend” to refer to sites in the experimental and reference genomes, respectively . For simplicity, we define breakpoints based upon their number and position in the experimental genome. This facilitates technical discussion since it is the genome for which experimental data is generated and interpreted, and usually the genome that harbors the derived SV allele produced by recent mutation. However, we note that the reference genome harbors a finite number of derived alleles that can only be reliably discerned by comparison to related species [2, 23].
By definition, complex structural variants are composed of multiple breakpoints whose origin cannot be explained by a single end-joining or DNA exchange event. Complex SVs vary considerably in their architecture. The most extreme forms exhibit multiple rearrangements between distinct loci and/or different chromosomes, sometimes involving complex patterns of copy number alteration at or near rearrangement breakpoints [16, 24, 25]. Many are comprised of multiple deletions, duplications and/or rearrangements at a single locus [6, 26–29]. The most subtle forms contain one or more small65 scale insertions, deletions or rearrangements at the breakpoint of a larger SV [6, 30–33]. As one might expect, the most extreme forms of complex SV are generally associated with cancer or sporadic disorders, and the majority of complex SVs identified in healthy individuals are ostensibly benign.
By definition, complex SVs arise through a single mutational event. A central caveat is that this fact can be difficult to establish. Any complex variant structure can, in theory, also be produced by independent temporally-distinct mutations (Figure 1c–e), and repeated mutation is known to occur at localized regions within dicentric chromosomes subjected to breakage-fusion-bridge cycles , or at unstable loci such as fragile sites , recombination hotspots , palindromes , and “core duplicons” . Artificially complex breakpoint patterns can also be produced by one simple mutation at an otherwise complex locus in the reference genome, such as those formed by repeated segmental duplication during evolution . Thus, it can be difficult to accurately distinguish between simple and complex forms of structural mutation.
While the methods for detecting complex breakpoint patterns formed by sequential versus complex mutation are essentially the same, their origins and consequences are different. We mainly focus on variants formed by complex mutation and attempt to distinguish between these classes when possible.
The observation of complex structural mutation is not new. Using standard cytogenetic methods such as G-banded karyotyping and fluorescent in situ hybridization (FISH), numerous complex chromosome rearrangements (CCRs) have been identified in patients suffering from sporadic disorders or infertility (reviewed in [24, 40]). CCRs involve at least 3 breakpoints from 2 or more chromosomes (Figure 1a), and are estimated to comprise ~3% of spontaneous rearrangements detected in prenatal diagnoses . Some events are remarkably complex: ~26% of 251 well-characterized CCRs have more than 5 breakpoints [24, 40], and two contain 15–17 breakpoints [42, 43]. Similarly complex intra-chromosomal rearrangements have also been reported . Interestingly, when rearrangements are fine-mapped, many apparently simple rearrangements are found to be complex, the number of detected breakpoints tends to increase and additional copy number mutations and local rearrangements are often found near breakpoints [43, 45, 46] (prior work reviewed in [24, 40]). The fact that most CCRs are identified as spontaneous events strongly argues that they arise through a single complex mutation rather than multiple independent simple mutations.
More recently, array-CGH has revealed a number of smaller-scale submicroscopic complex genomic rearrangements associated with sporadic disease. These mutations are generally “non-recurrent” in that they exhibit novel breakpoints, as opposed to recurrent mutations formed by NAHR. The most detailed studies characterized a series of non-recurrent pathogenic de novo SVs at the PLP1  and MECP2 genes , and at a 3Mb locus associated with Potocki-Lupski and Smith-Magenis syndromes . Remarkably, the authors found complex structures in 41% of 61 non-recurrent mutations. Taking into account previous (reviewed in ) and subsequent [47–50] reports of complex SV at disease-associated loci, these data indicate that complex mutations account for a significant fraction of de novo SVs. Reported patterns include adjacent copy number alterations separated by unaltered intervening sequence, deletions or duplications embedded within larger duplications, and triplications.
These observations, as well as previous data from bacterial studies , led to two related models for the generation of complex SVs: fork stalling and template switching (FoSTeS) , and microhomology-mediated break-induced replication (MMBIR) . In these models, a stalled or broken replication fork undergoes template switching events utilizing microhomology (e.g., 2–5 bp) between the 3’ end of the newly synthesized strand and non-allelic loci (Figure 1d). Complex SVs are produced when multiple switches occur at a single broken/stalled fork. Importantly, template switches may occur between distant loci spanning entire chromosome arms , presumably due to proximity in the nucleus, which implies that they may also be involved in many complex chromosomal rearrangements. Fine-scale mapping of CCRs using modern sequencing technologies  will help resolve this question.
One might predict that many inherited germline SVs, most of which are likely benign, might also exhibit these features. An early study found that 5 of 24 deletion breakpoints showed small-scale insertions or rearrangements, or multiple deletions separated by non-deleted sequence . Another clue came from sequencing breakpoints in synteny between the human and gibbon genomes . Of 24 rearrangement breakpoints, 11 contained insertions ranging from 9 bp to 20 kb, and some insertions were mosaic structures composed of common repeats and segmental duplications originating from nearby genomic regions.
Three recent genome-wide DNA sequencing-based studies have assessed the prevalence of complex SVs by characterizing inherited SV breakpoints at single base resolution. The first  examined 1171 breakpoints in the mouse genome and found ~16% of variants to be complex. Of these, 84% were composed of multiple breakpoints in close proximity (<1 kb), often with intertwined breakpoint patterns caused by one or more adjacent deletion/duplication events plus local rearrangement, and the remainder contained small breakpoint insertions or rearrangements. Common patterns included duplications separated by small non-duplicated segments, deletions adjacent to larger duplications, and deletions with an internal sequence transposed to edge of the breakpoint, often in inverted orientation (Figure 3). Two subsequent studies in human focused mainly on breakpoint insertions. One  used DNA capture technology to sequence 324 breakpoints predicted by array-CGH , and found that 5.2% contained breakpoint insertions, most of which were derived from nearby loci and inserted in inverted orientation. Another  sequenced 1054 SV breakpoints identified by fosmid paired-end mapping , and found that 5.5% contained insertions of DNA larger than 20bp, and 73% of the breakpoint insertions were derived from a locus less than 250 kb away. Thus, three studies, using distinct methods and definitions of variant complexity, have converged on a fairly similar estimate for inherited complex SV: 5–16%. Given the technical difficulties associated with high-throughput mapping, assembly and interpretation of breakpoint sequences, as well as the apparently higher incidence of de novo complex variants (discussed above) we suspect that the true number is somewhat higher.
The architecture of a somatic genome is less constrained than that of a germline genome, which must complete meiosis and development to survive, and tumors evolve under diverse selective pressures and mutational forces. As a result, the types and numbers of de novo SVs in different tumors varies widely, and diverse karyotypic configurations have been observed. Many tumors show complex patterns of gene amplification , presumably due to repeated mutation and strong selection. In some breast tumors, “firestorms” of amplification and deletion have been observed  on chromosome arms, likely resulting from breakage-fusion-bridge. These complex patterns have historically been explained by a gradual accumulation of mutations during tumorigenesis .
The field has been upended by the discovery of extraordinarily complex intra- and inter-chromosomal rearrangements in certain tumor genomes. In the initial finding, sequencing of a single chronic lymphocytic leukemia (CLL) genome revealed 42 somatically-acquired SV breakpoints in several clusters on the long arm of chromosome 4 (4q) . These included deletions, intra-chromosomal rearrangements, and inter-chromosomal rearrangements to a single site each on chromosomes 1, 12 and 15. Remarkably, only one additional somatic SV was discovered in the rest of the genome. The 4q region exhibited numerous hemizygous deletions (1 copy) separated from each other by unaltered segments (2 copies), and the boundaries of deleted segments corresponded to intra- and inter-chromosomal rearrangement breakpoints. This pattern differs markedly from previously described tumors, but is not rare; the authors mined SNP array data and found similar patterns in 18 of 746 (2.4%) diverse cancers/cell lines, 4 of which were confirmed by whole-genome sequencing, and in 5 of 20 (25%) unselected bone cancers also analyzed by genome sequencing.
The authors presented three lines of evidence that these unprecedented rearrangements are generated through a single catastrophic event . First, simulations revealed that breakpoints are clustered in a highly nonrandom manner. Second, the copy number profiles associated with complex events only exhibit two states – either losses or gains but not both – interdigitated with unaltered segments, whereas sequential mutation should produce many states. Third, within breakpoint clusters harboring intertwined deletions and rearrangements, losses derive from the same parental chromosome and heterozygosity is preserved at unaltered segments, which greatly constrains the order of events under a model of sequential mutation. The authors refer to this mutational process as chromothripsis, and propose that a chromosome is shattered in a one-off event, perhaps by ionizing radiation or one dramatic cycle of breakage-fusion-bridge, and stitched back together again in imprecise fashion (Figure 1b). Interestingly, a recent study  reported an inherited complex rearrangement with a similar structure, which indicates that chromothripsis-like mechanisms also operate in the germline.
More recently, a single complex rearrangement was identified in 3 of 7 prostate cancer genomes analyzed by whole-genome sequencing . One involved 4 loci on a single chromosome, another involved 4 loci on two chromosomes, and the third involved 9 loci on 4 chromosomes. Strikingly, two involved a novel “closed chain” breakpoint pattern, such that each locus was connected to two other distinct loci. While the precise structure of “closed chain” rearrangements is unclear (Figure 4), there are two key differences between them and those attributed to chromothripsis: 1) there is no obvious clustering of breakpoints on a single chromosome; and 2) the breakpoint regions do not exhibit copy number mutations. It is an open question whether these rearrangements are caused by chromothripsis or a distinct mechanism such as FoSTeS/MMBIR. Perhaps indicating the latter, the data shown for one rearrangement are more consistent with 3 small insertions into a single locus rather than a series of translocations.
Advances in DNA sequencing technologies have enabled the exploration of genome structure with exquisite detail. Unlike conventional cytogenetic methods or array-CGH, sequencing permits genome-wide characterization of breakpoints from all classes of SV with high precision. The general algorithmic approaches and available tools for detecting SV breakpoints from DNA sequence data have been reviewed elsewhere [62, 63]. In essence, the identification and interpretation of complex SV involves 3 steps: 1) genome-wide breakpoint detection using one or more of the techniques discussed in Box 1; 2) screening for clusters or interconnected chains of breakpoints that comprise a single complex variant; and 3) reconstructing the architecture of the variant locus to infer the causal mechanism and potential functional impact.
When DNA sequences are aligned to the reference genome, copy number variations (CNVs) are evident as significant increases or decreases in the depth of aligned sequence . Inferring CNV via DOC analysis is conceptually similar to array-CGH and typically yields similar or moderately better resolution (1–15kb). DOC is inadequate for mapping fine-scale locus complexity, but permits initial identification of complex CNVs and helps determine whether complex breakpoint patterns involve copy number mutations. DOC can also detect NAHR-derived CNVs whose breakpoints lie within large repeats (which can confound the methods below).
PEM strategies identify SV breakpoints by examining the alignments of relatively short sequences from the ends of larger DNA molecules . Sequencing libraries are created with fragments of known length (generally 200–500 bp for paired-end libraries and 1–10 kb for mate-pair libraries). Paired-end sequences (readpairs) that are “concordant” with the reference genome align with the expected distance and orientation, whereas readpairs spanning an SV breakpoint will produce “discordant” alignments with an unexpected alignment distance and/or orientation. Each SV class produces a distinct mapping signature (Figure 2). However, current fragment sizes limit sensitivity, discordant mapping patterns can be difficult to interpret at complex SVs (Figure 4), and PEM cannot map breakpoints to single-base resolution.
SRM identifies sequences that actually contain a breakpoint [2, 59] (Figure 2). The alignments for such sequences are “split” because DNA segments flanking the breakpoint align to disjoint locations in the reference genome. SRM inherently maps breakpoints to single base resolution and thus provides mechanistic insight. Owing to the repetitive structure of mammalian genomes, genome-wide SRM requires reads longer than ~200 bp. Long-read (> 500 bp) SRM is a particularly powerful approach for studying complex SV because multiple breakpoints can potentially be captured by a single read, greatly aiding in variant locus reconstruction (Figure 3e,f).
Local de novo sequence assembly  can be used to reconstruct a variant locus. Once an assembled sequence is aligned to the reference genome, breakpoint(s) are discerned following the same principles as SRM. Unlike SRM, assembly typically generates substantially larger “contigs” that are more amenable to characterizing complex SV. While currently infeasible for most laboratories, whole-genome assemblies promise the most comprehensive description of SV, as large portions of entire chromosomes can be aligned to precisely identify both canonical and complex rearrangements [1, 61].
Once raw breakpoints have been mapped the primary goal is to distinguish clusters of breakpoints delineating complex variants from nearby, yet potentially simple SV breakpoints caused by independent mutations. The development of robust tools for identifying complex events is a difficult and unsolved problem because at present there are no defined rules for constraining the expected breakpoint patterns. It is not clear whether such rules exist. Nevertheless, discerning complex mutations can be relatively straightforward when analyzing human families or minimally mutated cancer genomes, since spontaneous events can be readily distinguished from inherited variants by analyzing related samples. However, detecting complex variants in a “sea” of simple variants, as in studies of inherited SV or highly rearranged cancer genomes, is problematic because breakpoints may lie in close proximity due to chance alone. This may not be a concern for functional studies but is crucial for inferring mechanism. There is no simple solution to this conundrum, and thus most studies have focused on the most obvious examples of complex SV.
Simple and flexible approaches are therefore preferable. Screens must begin by accounting for simple multi-breakpoint variants such as inversions, retrotranspositions and reciprocal translocations (Figure 2e,f). Merging these breakpoint calls is conceptually simple, but we are not aware of any available software that does so comprehensively. Breakpoint clusters can then be identified by simple sliding window schemes that compare local breakpoint density to a null model. Ideally, this screening method should take into account the non-uniform distribution of simple SV in normal and tumor genomes, as well as commonly observed complex variant architectures. It may be possible to use homology profiles to tease apart nearby or overlapping clusters that arose through distinct mechanisms, but since breakpoints formed by template switching and end-joining can display similar levels of microhomology, in practice this will be difficult. Complex SVs that do not involve obvious breakpoint clusters at a single locus can be identified by computationally searching for chains of interconnected breakpoints that share at least one locus in common. Tools in the BEDTools software suite  can be adapted for this purpose . By integrating results from clustering and chaining approaches, most classes of complex SV can be discerned. We stress, however, that these higher-order clustering steps can produce falsely complex SVs at repetitive or poorly-assembled loci in the reference genome that generate abundant breakpoint calls, as often occurs at or near centromeres, telomeres, simple tandem repeats, and regions laden with segmental duplications. Thus, subsequent annotation and characterization steps are crucial.
The above methods may fail to detect complex SVs that possess neither clustered nor chained breakpoints, but rather are composed of nested or overlapping variant calls that affect a common genomic interval. This pattern is trivial to detect, but is also commonly produced by sequential mutation and should be interpreted with caution. These methods may also miss cryptic complex variants that contain small-scale insertions or rearrangements at the breakpoint itself. For these it is necessary to carefully inspect breakpoints at single-base resolution and to align the breakpoint sequence to the reference genome. Sensitive alignment is crucial because small breakpoint alterations can masquerade as non-templated addition of nucleotides during NHEJ, merely due the inability of aligners to find significant matches.
A key question for any complex variant is: what exactly does it look like? Integration of breakpoints identified by PEM, SRM and/or local assembly (Box 1), combined with DOC analysis to distinguish between balanced rearrangements and copy number mutations, is theoretically sufficient to infer the architecture of most variants (Figure 3 and Figure 4). However, this remains a major challenge for two reasons. First, neither reconstructing nor visualizing complex variant structures are trivial problems and there is a notable dearth of suitable computational tools. Thus, to our knowledge, all DNA sequencing-based studies to date have relied heavily on manual curation and human expertise to interpret complex breakpoint patterns. This laborious approach has proven effective and resulted in detailed architectural information for over 250 complex SVs [6, 16, 25, 32, 33, 43], but is unsustainable given the scale of current genome sequencing projects. Second, the accuracy of interpretation depends entirely on the accuracy of the underlying breakpoint calls, and current breakpoint mapping strategies suffer from either high false positive or high false negative rates, and sometimes both. It is therefore likely that complex SVs are more prevalent, and more architecturally diverse, than currently recognized owing to under-ascertainment and misinterpretation.
Manual variant reconstruction is greatly aided by data visualization software (Figure 5). The UCSC Genome Browser , Integrative Genomics Viewer (IGV)  and Savant  excel at displaying raw sequence data aligned to the reference genome and can also display annotation tracks, but are only practical for visualizing small genomic regions (< 100 kb). A current advantage of IGV is the ability to visualize two distinct loci in “split-screen” mode, but Savant offers superior visualization of readpair connectivity. At the other end of the spectrum, visualization tools like CIRCOS  or GREMLIN  provide aesthetically-pleasing rearrangement depictions, but are mainly useful for summarizing results, not interpreting data. A major limitation of the above tools is that they display data solely with respect to the reference genome, which does not allow one to easily infer variant architecture.
Rapid interpretation requires a direct comparison of the structure of assembled breakpoint sequences, or entire variant loci, to the structure of the reference genome. In some cases a simple dotplot may suffice. The PARASIGHT software (J. Bailey et al., unpublished: http://eichlerlab.gs.washington.edu/jeff/parasight) is ideally suited to this task because it shows pairwise alignments in an informative format that preserves the structure of both variant and reference sequences (Figure 5c), and can display annotation tracks. For example, an automated PARASIGHT pipeline enabled visualization and interpretation of several thousand assembled breakpoints in several days . Unfortunately, while PARASIGHT is extremely flexible, it is difficult to use and often requires substantial customization for informative viewing. Other tools such as MIROPEATS  and BARAVI (R. Ophoff et al., unpublished: http://www.genetics.ucla.edu/labs/ophoff/BARAVI/) support pairwise alignment and visualization but cannot display tracks. The paucity of user-friendly breakpoint visualization software presents a major bottleneck for interpreting complex variants and underscores the need for improved tools.
Manual curation is the most accurate approach for variant reconstruction, but as the study of complex SV expands to thousands of genomes it is neither practical nor reproducible. In theory it should be possible to develop software that infers variant architecture from breakpoint predictions and DOC profiles, but we are unaware of any that explicitly attempts to do so. Moreover, we suspect that automated reconstruction of complex SVs would require impeccable input data. For example, sophisticated algorithms have proven necessary merely to integrate breakpoint calls and DOC profiles for simple deletions [2, 71]. As sequencing methods continue to improve, automated approaches will eventually be feasible through increased read lengths, emerging technologies such as “strobe” sequencing  and, ultimately, routine generation of high-quality diploid genome assemblies.
If a complex SV can be assembled into a single contig, variant reconstruction becomes a tractable problem of describing the relative structure of two DNA sequences. The first step is to align the variant sequence to the reference genome. A complication is that portions of the variant “query” sequence containing repeats will align to multiple loci. This problem is trivial for variants that involve a single well-defined locus, but for rearrangements that involve repetitive regions or multiple loci resolving these ambiguities can be difficult. This is also a significant problem for the initial detection of complex SVs from long-reads or draft assemblies. Most suitable aligners report all significant alignments including irrelevant “sub-alignments” contained within larger aligned sections of the query [73–75], which necessitates subsequent selection of the “best” minimal set for locus reconstruction. The BWA-SW aligner uses a greedy heuristic strategy to discard sub-alignments that are subsumed by larger alignments ; we have found that this, or similar, heuristic strategies are adequate for moderately complex variants composed mainly of unique sequence. Otherwise, it is preferable to pursue a more optimal alignment selection strategy.
Once alignments are defined, reconstructing variant architecture is a semantic problem of describing the relationship between alignment blocks based upon their relative positions and orientations in the variant and reference sequences. The VCF 4.1 specification offers a sensible solution for this practical problem .
Mechanistically minded studies might seek to reconstruct the mutational events that generated each complex variant. Similar problems has been studied in the context of ancestral genome reconstruction using breakpoint graphs [77–80], and for inferring the mutational history of segmental duplications using modified A-Bruijn graphs  or DAWGs . Genome-scale models are subjected to various simplifying assumptions to prevent intractable computational complexity, but for any given complex variant optimal solutions are possible. An unsolved problem is how to define optimal solutions that take into account current models of mutation.
Studies of complex SV have provided new insights into the processes that generate genome variation, and this has clear implications for conventional models of species and cancer evolution that generally assume progressive, step-wise mutations. In both contexts, complex mutations represent a form of punctuated genome evolution. Resulting variants may have more subtle, unpredictable and multi-faceted phenotypic impacts than simple variants. For example, complex mutations can rearrange exons to create novel proteins, shuffle promoters, enhancers and/or repressors into a novel regulatory configuration, or simultaneously disrupt multiple genes and pathways. In the context of a developing tumor, simultaneous formation of multiple fusion genes, amplified oncogenes or deleted tumor suppressors may lead to rapid expansion of a clone with very different characteristics than neighboring cells.
A major unresolved question in the field is how complex variants arise. The two general models for complex SV formation – template switching during DNA replication (FoSTeS/MMBIR) [26, 29] and chromosome shattering (chromothripsis)  – each have eminently sensible features, but it is worth remembering that neither has been directly implicated. This begs the question of whether these mechanisms indeed account, either alone or through collusion, for the architecturally diverse rearrangements that have been observed. Or is another as-yet undescribed mechanism at work? At present, there is not sufficient data to answer these questions. However, we speculate that most complex variants arise through a common mechanism. The rearrangements thus far attributed to chromothripsis differ from those explained by FoSTeS/MMBIR mainly in their greater size and complexity; the patterns are ostensibly similar. We further note that a recent study of germline rearrangements  has proposed that FoSTeS/MMBIR may explain complex breakpoint clusters that resemble those attributed to chromothripsis [25, 43, 84]. These clusters contain 3 copy number states, including duplications and triplications, and small breakpoint insertions derived from nearby loci. These features are much easier to explain by replication than by chromosome shattering. On the other hand, shattering is a more simple explanation for the staggeringly complex variants that exhibit frequent oscillation between 2 copy number states (deleted and unaltered), as observed in tumor genomes. We expect future breakpoint sequencing studies to yield additional clues, but we are not confident that the true mechanism(s) can be resolved by sequencing alone, since neither variant architectures nor breakpoint homology profiles appear sufficient to distinguish the two models. Direct experimental studies may be necessary to yield clarity.
The likelihood that complex mutations primarily arise through processes that are active in somatic cells, and not concentrated in meiosis, also implies that many other simple mutations do as well, and thus each individual may be a mosaic composition of cells with different genome structures. Indeed, evidence of somatic variation is growing [85–91], and this may potentially account for certain phenotypes that emerge during development and aging. The potential link to replication also implies that environmental conditions or trans-acting mutations that affect replication fidelity can modulate mutation rates. It has been proposed that replication stress may lead to flurries of structural mutation [21, 29], and there is direct evidence for this in E. coli  and cultured human cells [92, 93]. Further work is necessary to prove this theory, but the potential existence of genetic and environmental modulators of complex mutation is intriguing.
In most cases the functional consequences of complex SVs are unclear, and their true contribution to natural variation remains an open question. Whether these variants turn out to be a curious sideshow of mutational complexity or a driving force of functional innovation can only be answered by ongoing and future whole-genome sequencing of well-phenotyped samples. Rapidly improving DNA sequencing technologies will aid this effort, but perhaps the greater challenge lies in bioinformatic interpretation. At present, there is a notable paucity of high-throughput methods for complex SV identification, visualization, reconstruction or interpretation. We expect this challenge to be met in coming years, and we look forward to a more complete understanding of the mechanisms and functional ramifications of complex structural variation.
Our work has been sponsored by the National Institutes of Health (DP2OD006493-01 to IMH; 1F32HG005197-01 to ARQ), the Burroughs Wellcome Fund (IMH) and the March of Dimes (IMH). We thank R.A. Clark for implementing our SV visualization pipeline.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.