Establishing Initial Analysis Parameters
We used paired-end (PE) sequencing simulations as a tool to establish the initial analysis parameters, to quantify the effect of sequencing depth on detection of known SVs, and to study alignment related false positives. We simulated a rearranged genome based on C57BL/6J mouse reference (mm9), introducing 10 interchromosomal translocations and 10 large deletions into areas of varying mappability (). Read length, mean insert size and standard deviation of the insert size were chosen to be representative of our experimental data (50, 315, 44, respectively). Using three independent simulated datasets with 10, 20, 40, 80 and 160 million read pairs, we assessed the number of detected real and false positives, as well as the detection probability as a function of local mappability.
List of simulated SVs with mappabilities.
PE sequencing proved to be an efficient method for SV detection at coverage levels corresponding to 80 or more million read pairs. 90% of events in our simulated rearranged genome were detected with 160 million read pairs, about the minimum currently obtainable from a single lane using the Illumina HiSeq platform (). As expected, detectability of a certain rearrangement strongly depended on the breakpoint microenvironment, with more coverage needed to detect events in regions of lower mappability (). When assessing false positives, we found that 97% of total SV calls were attributed to reads with more than one equally valid mapping position. These reads originate from various repetitive genomic regions (such as centromeric satellite sequences, retroelements, RNA genes, etc.) and had to be removed from the analysis. After examining BWA mapping quality scores of reads contributing to real and false positives, we chose a cutoff of 23 for our analysis (for further discussion, see “False positives arising from BWA alignment errors”). It should be noted that cutoff is chosen based on the desired ratio of real and false positives, with lower cutoff increasing sensitivity at the expense of specificity. After applying the BWA mapping quality cutoff to our simulated datasets, we observed no more false positives related to read mapping errors. However, we noticed size-related false positives that appeared with the increasing coverage. These false positives were small deletions originating from higher end and duplications originating from the lower end of the normal DNA library fragment size distribution. To correct for insert size related false positives, we used a size cutoff of 8 standard deviations and applied it to our analysis. This parameter should be determined for each library individually, depending on the desired sensitivity: increasing the standard deviation cutoff will lead to increasing the minimal detectable deletion and duplication size. Depending on the analysis needs, it may be beneficial using lower standard deviation cutoffs together with an assessment of the number of supporting read pairs, as SVs with a higher number of supporting read pairs can indicate a real event. However, this approach should be used with caution when analyzing tumor samples where loss or gain of copy number can lead to false conclusions.
Paired-end sequencing simulations.
Simulations of PE sequencing proved to be a useful tool in developing the data filtering strategy. After optimizing the initial parameters described above and removing all false positive calls from simulated datasets, SV calls in the experimental dataset could be attributed to the sample and the experimental procedure itself, rather than analysis artefacts. Simulations were also useful as a means to predict necessary coverage for detecting certain types of events. Importantly, when relating simulations to the experimental data analysis, it has to be taken into account that expected frequency of rearrangements, and hence the needed coverage, will normally be 50% due to the diploid nature of the genome. In case of heteroclonal or impure samples (the usual case when dealing with tumor samples), this frequency is expected to be even lower.
As our experimental dataset, we chose an uncharacterized thymic lymphoma obtained from a Rag2c/c
mouse. Thymic lymphomas arising spontaneously in this mouse model harbor a large number of structural rearrangements such as translocations, large deletions and amplifications 
. Illumina’s paired-end sequencing was chosen over the mate pair strategy, which we abandoned in the early course of this work due to difficulties in DNA library preparation. We sequenced two genomic libraries, one obtained from the solid tumor tissue and the other from the liver of the same animal (germline control). We found the control library to be essential due to a large number of germline SVs originating from remains of a 129 strain background (the mouse was initially created as a 129SvEv/C57BL6 hybrid). The tumor and control library were sequenced to 17x and 9x physical coverage, respectively (, ).
We used SVDetect () and BreakDancer () to call initial SVs, as these are the two most widely used large structural variant detection programs applicable to 50 bp read PE data. Generally, the analysis using the BreakDancer initially produced more intrachromosomal and less interchromosomal SV calls compared to SVDetect, perhaps due to differences in the clustering strategy. The same analysis parameters and filtering procedure was applied to both programs, yielding similar results at the end.
Tumor-specific SVs: data filtering.
In contrast to simulations, analysis of experimental data led to a large number of false positive calls after applying initially established analysis parameters described above. We define these false positives as events supported by reads mapping to repetitive genomic regions, as well as those that span regions with retroelement activity. The number of false positives was especially large among interchromosomal SVs, explained by the higher likelihood of a repetitive read being misaligned to a chromosome different from its mate. In order to find and validate real tumor-specific variants, it was necessary to analyze the source of these calls and reduce them to a manageable number. We identified 3 main types of false positive calls, depending on their source: 1) false positives related to variation between mouse strains, 2) false positives arising from alignment errors, and 3) false positives related to PCR duplicates originating from sample preparation combined with sequencing errors. We developed different pre- and post-detection filtering procedures in order to work around these challenges.
False Positives Related to Structural Variation between Laboratory Mouse Strains
Structural variation among commonly used laboratory mouse strains, similar to structural variation between individual humans, has already been documented in great detail 
. Most knock-in mice, including the one used in this study, can be classified as hybrid strains, even if the animals were backcrossed a number of times to the reference genome strain (C57BL/6J). Observed SVs can mostly be attributed to germline retroelement activity, and are manifested as insertions of SINE, LINE and LTR elements as well as reverse-transcribed intronless genes (retrogenes). When an experimental dataset is compared to the C57BL/6J reference genome, several types of structural variants are called. Most commonly, retroelement insertions present in the reference, but missing in the sample strain, will be called as deletions, while those present in the sample strain, but missing in the reference, will be called as balanced translocations. Insertions of retrogenes can be recognized as a number of deletions encompassing introns, accompanied by a translocation call from the chromosome of origin to the recipient chromosome ().
Retrotransposon and retrogene insertions leading to false positive calls.
In order to filter out germline SVs described above, we found it necessary to obtain a control dataset by sequencing normal tissue originating from the same animal. In this study, a control dataset was prepared using liver tissue and compared to the tumor dataset. Using this strategy, we were able to remove most germline SVs. However, certain SVs failed to be detected as germline, due to lack of overlap between supporting read pairs. Therefore, we found it necessary to examine each SV manually for potentially missed overlap with the control. Even after applying the comparison procedure, a number of events we identified as high quality candidates were validated as germline (30% of intrachromosomal and 50% of interchromosomal SVs). This result can be attributed to lower coverage in our control dataset, leading to lower sensitivity of germline SV detection. Aneuploidy of tumor tissue (additional copies of some chromosomes or loss of others) creates local differences in coverage between the tumor and control dataset, which adds to the complexity of the analysis ().
False Positives Arising from BWA Alignment Errors
To remove false positives related to alignment errors, we tested the effect of BWA mapping quality score-based filtering on the number of resulting SV calls. Although BWA authors designate reads with 0–10 mapping quality as “unreliably mapped” 
, we found the best cutoff range for mapping quality score in our experiment to be 0–22 (). To partially correct for undesired removal of real SV candidates in less unique genomic regions, calls with large numbers of supporting read pairs were examined manually. However, none of the examined removed SVs could be designated as high quality candidates, since they all involved genomic regions of low mappability. After applying this read mapping quality filter before any other filtering is applied, the number of called SVs was reduced to 85% for intrachromosomal and 36–39% for interchromosomal events ().
To further reduce the number of SV calls resulting from misalignment of reads originating from repetitive regions, we tested the strategy of removing SVs with overlap with the RepeatMasker 
and the simple repeats track of the UCSC Genome Browser. We found that RepeatMasker strategy reduces the number of false positive calls significantly, but filters out 12% of previously validated rearrangements, including some with potential biological importance (eg. Pten deletion). Importantly, reads coming from RepeatMasker annotated regions are not necessarily difficult to map uniquely, since this track contains many ancient repeated elements that have significantly diverged through evolution. RepeatMasker filtering strategy was finally used only to identify high confidence candidates among interchromosomal events with low numbers of supporting read pairs. In contrast to the RepeatMasker, overlap with simple repeats track was found to be successful in filtering out alignment error related false positives only.
As another strategy of dealing with repetitive element-related false positives, we tested the efficiency of filtering SVs against the low mappability regions, calculated based on the mappability data of the UCSC Genome Browser (see Materials and Methods). This strategy proved to very successful, removing significant numbers of false positive calls, especially efficient in the case of interchromosomal SVs ().
False Positives Related to Errors in Duplicate Calling
In the course of our analysis, we observed false positives called from small clusters of 2 or 3 read pairs, with both reads mapping at positions 0–2 bp away from one another (). As already discussed by others in the field 
, most of these “imperfect duplicates” probably originated from one DNA fragment and diverged either during PCR amplification, perhaps due to template strand slipping, or sequencing errors at the beginning or the end of the read during the sequencing procedure. These bona fide duplicates cannot be removed using existing tools such as Picard’s MarkDuplicates since they do not have identical mapping positions. Percentage of imperfect duplicates appears to be correlated with the percentage of perfect PCR duplicates: specific datasets with high perfect duplicate percentage will show higher percentage of imperfect duplicates (M. Mijušković, results not part of this study).
An example of imperfect duplicates.
We defined imperfect duplicates as pairs with the same mapping position of both reads with the possible offset up to 2 bp. Detection of these duplicates was done during clustering of discordant read pairs by SVDetect or BreakDancer, using different strategies (see Materials and Methods). After applying this filter, the number of intrachromosomal and interchromosomal SVs was reduced by 0.3–1.7% and 3.9–19.5%, respectively (). Importantly, these numbers might underestimate the total imperfect duplicate percentage since in this case they were detected after removing low mapping quality reads.
Validating Structural Variants
We created the final list of 61 high confidence SVs (see Materials and Methods) after manual examination of 381 intrachromosomal and 130 interchromosomal SVs detected by SVDetect and 328 intrachromosomal and 64 interchromosomal SVs detected by BreakDancer obtained after applying our filtering procedure. The majority of these calls, called by both programs, were found to either be a result of alignment errors related to repeats (59%), or previously unidentified germline SVs such as retroelement or retrogene insertions (23%). BreakDancer detected only a subset of high confidence SVs found by SVDetect (47 out of 61), even before any filtering was applied, perhaps due to differences in the clustering algorithm.
We used PCR to test 57 intrachromosomal and 4 interchromosomal high confidence SVs found by the BreakDancer and/or SVDetect (Table S1
). From this set, we validated 23 large (1–539 kb) deletions, 10 inversions, 5 duplications and 2 translocations as tumor-specific, and the specificity of the PCR products was confirmed by Sanger sequencing (). Thus, 40 of the 61 high confidence SVs identified by our method were validated as tumor specific SVs. The other 19 intrachromosomal and 2 interchromosomal events were PCR validated as germline SVs. 16 out of 21 of these SVs had at least one supporting read pair in the original control dataset and failed to be detected due to our 2 supporting read cutoff. These false positives can be avoided either by sequencing the control dataset to higher coverage, when possible, or examining the control dataset using the 1 read pair cutoff.
Tumor-specific SVs validated by PCR and Sanger sequencing.
Among validated tumor-specific SVs, we found several tumor-suppressor gene deletions, as well as some expected canonical antigen receptor gene rearrangements (). Notably, two tumor-specific translocations, two inversions and one validated tumor-specific duplication show signs of a complex rearrangement 
First, our work shows that simulating paired-end sequencing can be an effective way to develop the analysis strategy, predict coverage necessary to detect DNA breakpoints in different genomic environments and to separate sources of false positive calls into sample related and those that arise due to analysis artefacts.
Second, we have found that a control dataset obtained from the same animal is essential to reduce a large number of germline SVs that exist between commonly used laboratory mouse strains, even in cases when the animals are backcrossed a number of times to the reference genome strain.
Third, we have defined two types of duplicated reads leading to false SV prediction, both arising from PCR over-amplification during sample preparation: perfect duplicates, with matching genomic coordinates, and those with 1–2 bp coordinate offset that are not detected using existing tools. We present a method to remove SVs resulting from those reads using either SVDetect or BreakDancer.
Fourth, we find that removing reads with low BWA mapping quality, as well as SV calls that overlap with genomic regions of low mappability, is a very efficient way to filter our large numbers of false positives that arise due to alignment errors.
Finally, using this method, we validated a fairly large number of true tumor-specific SVs from a rather small dataset. Starting with a large number of candidate events, we were able to rapidly discard majority of false positives and focus on a tractable number of candidates for manual analysis (~5% of the initial number of calls from this dataset). We validated our filtering method with two widely used SV detection programs, SVDetect and BreakDancer, showing that it is universally applicable, rather than being restricted to a single program and its possible shortcomings. The final number of candidate events, as well as the number of false negatives, is a function of coverage and the stringency of filtering parameters. Depending on the needs of the experiment, these parameters can be set to a desired level in order to achieve an acceptable number of false positives vs. false negatives.
Our method should be applicable for future work in model organisms as well as in human tumors. In the clinical context, higher coverage would be needed to reduce the number of undetected germline SVs, as well as to improve the detection of low frequency somatic SVs.