|Home | About | Journals | Submit | Contact Us | Français|
Multiple somatic rearrangements are often found in cancer genomes. However, the underlying processes of rearrangement and their contribution to cancer development are poorly characterised. Here, we employed a paired-end sequencing strategy to identify somatic rearrangements in breast cancer genomes. There are more rearrangements in some breast cancers than previously appreciated. Rearrangements are more frequent over gene footprints and most are intrachromosomal. Multiple architectures of rearrangement are present, but tandem duplications are common in some cancers, perhaps reflecting a specific defect in DNA maintenance. Short overlapping sequences at most rearrangement junctions suggest that these have been mediated by non-homologous end-joining DNA repair, although varying sequence patterns indicate that multiple processes of this type are operative. Several expressed in-frame fusion genes were identified but none were recurrent. The study provides a new perspective on cancer genomes, highlighting the diversity of somatic rearrangements and their potential contribution to cancer development.
Cytogenetic studies over several decades have shown that somatic rearrangements, in particular chromosomal translocations, occur in many human cancer genomes1-3. The prevalence of rearrangements is, however, variable with some cancer genomes exhibiting few and others, including the genomes of many common adult epithelial cancers, showing many.
Somatic rearrangement is a common mechanism for the conversion of normal genes into cancer genes1-5. Indeed, of the ~400 genes that are currently known to be somatically mutated and implicated in cancer development, most are altered by genomic rearrangement (http://www.sanger.ac.uk/genetics/CGP/Census/). These rearrangements usually result in the formation of a fusion gene, derived from two disrupted normal genes, from which a fusion transcript and protein is generated. In some instances, however, rearrangements place an intact gene under the control of new regulatory elements or cause internal reorganisation of a gene. These events usually result in activation of the protein to contribute to oncogenesis, as in the paradigm of the BCR-ABL fusion gene in chronic myeloid leukaemia6.
Most of the currently known fusion genes are operative in leukaemias, lymphomas and sarcomas1,3 although similar cancer-causing rearrangements in RET, NTRK1, NTRK3, BRAF and TFE3 were reported in rare epithelial cancers many years ago (http://www.sanger.ac.uk/genetics/CGP/Census/). Activated fusion genes in common adult epithelial cancers such as those of the ETS transcription factor family in prostate cancer7 and the ALK gene in lung adenocarcinoma8 were discovered only recently and not through the conventional strategy of positional cloning of cytogenetically ascertained chromosomal breakpoints. Their late emergence primarily reflects the complexity of cytogenetically visible rearrangement patterns in the genomes of many adult epithelial cancers and the consequent difficulty in prior selection of driver rearrangements for further study among the many likely passenger changes9. Rearrangements also constitute a subset of mutational events that result in inactivation of recessive cancer genes (tumour suppressor genes) and underlie genomic amplifications that result in increased copy number of cancer genes10-12.
In recent years, the diversity of prevalence and pattern of point mutations and copy number changes in cancer genomes have been elucidated by systematic PCR-based resequencing studies and by application of high resolution copy number arrays respectively13-16. Understanding of the basic patterns of rearrangement in most cancers, however, remains rudimentary. We recently demonstrated that second generation sequencing of both ends of large numbers of DNA fragments generated from cancer genomes allows comprehensive characterisation of rearrangements17. Here we apply this approach to a series of breast cancers to explore patterns of rearrangement and their potential contribution to cancer development.
Twenty four breast cancers were investigated by sequencing both ends of ~65,000,000 randomly generated ~500bp DNA fragments from each cancer on Illumina GAII Genome Analysers (Supplementary Figure 1). The series included primary tumours and immortal cancer cell lines, examples of the commonest phenotypically defined subtypes and cases with high risk germline predisposition mutations in BRCA1, BRCA2 and TP53 (Table 1). Rearrangements were initially identified as discordant paired-end reads which did not map back to the reference human genome in the correct orientation with respect to each other and/or within ~500bp of each other. These were subsequently confirmed and evaluated for somatic or germline origin.
2,166 confirmed somatic rearrangements were identified among the 24 cancers (Supplementary Tables 1 and 2). The presence of multiple read pairs spanning each rearrangement (Supplementary Table 1), easily detectable copy number changes associated with many and targeted FISH studies on a subset indicate that most rearrangements are in cells of the dominant cancer clone and not in minority cell populations. By investigating whether rearrangements were found for 200 changes in genomic copy number, we estimate that ~50% rearrangements have been detected.
Some cancers carried many rearrangements. For example the breast cancer cell line HCC38 has at least 238 (Figure 1), many more than could have been predicted from cytogenetic studies (http://www.path.cam.ac.uk/~pawefish/BreastCellLineDescriptions/HCC38.html). However, there is substantial variation in prevalence, with some primary breast cancers carrying only a single rearrangement (Figure 1, Supplementary Tables 1 and 2, Supplementary Figure 2). Overall, breast cancer cell lines showed more rearrangements (median 101, range 58-245) than primary cancers (median 38, range 1-231). This difference may be due to the acquisition of additional rearrangements during in vitro culture. However, it may also reflect less sensitive detection of rearrangements in primary cancers due to contaminating normal tissue or the relative propensity of some subclasses of breast cancer, for example metastases generating pleural effusions, to become established in culture. In some cancers rearrangements are evenly distributed through the genome. By contrast, in others they cluster in and connect genomic regions showing amplification. The rearrangement architecture in such amplicons is often highly complex10,11.
The orientations and relative chromosomal locations of the two genomic segments forming each rearrangement can be used as the basis of a rearrangement classification system. This may be further elaborated using information from copy number and other analyses that allow reconstruction of the genomic architecture associated with each rearrangement. For the purposes of this report, we have derived a simplified version of this system, classifying each rearrangement according to a) whether it is in an amplicon or not, b) if not in an amplicon whether it is interchromosomal or intrachromosomal and c) if intrachromosomal whether it results in a deletion, tandem duplication, or rearrangement of inverted orientation (Figure 1b).
There were 1,311 intrachromosomal and 239 interchromosomal rearrangements (excluding rearrangements within amplicons of which 397 were intrachromosomal compared to 219 interchromosomal) (Table 2, Supplementary Tables 1 and 2). Thus intrachromosomal rearrangements substantially outnumber interchromosomal rearrangements in breast cancer genomes. The breakpoints of 1574 out of 1708 intrachromosomal rearrangements were within 2Mb of each other. These patterns are presumably attributable to the greater likelihood of physical interaction between two positions on the same chromosome compared to positions on different chromosomes, perhaps due to individual chromosomes occupying domains in the nucleus, and between two locations that are in close proximity on the same chromosome. The predominance of intrachromosomal rearrangements has not previously been appreciated because of the limited resolution of cytogenetic studies and because FISH based approaches, such as spectral karyotyping, generally employ a single fluorescence colour per chromosome.
The most commonly observed architecture of rearrangement was tandem duplication (there were 739 tandem duplications, 357 deletions and 215 rearrangements with inverted orientation, Table 2, Supplementary Tables 1 and 2). The evidence that these are truly tandem insertions derives from the structure of the genomic rearrangement itself supported by cDNA and FISH studies (Tables 3 and and4,4, Supplementary Figures 3 and 4). The duplicated segments ranged in size from 3kb to greater than 1Mb (Supplementary Table 1). Despite being the commonest class of rearrangement in breast cancer, tandem duplications have previously been overlooked because they are intrachromosomal and involve small chromosomal segments beyond the resolution of cytogenetics or previous generations of copy number arrays.
There were major differences between individual breast cancers in the number of tandem duplications (Figure 1b, Supplementary Figure 5). Some exhibited more than one hundred while others showed few or none. The mechanistic basis for these differences is unknown, but may be due to a defective DNA maintenance process which confers a “mutator phenotype”. This would be similar, in principle, to the defects in DNA mismatch repair that are responsible for microsatellite instability in some cancers. This putative mutator phenotype is unlikely to be directly related to deficiencies in BRCA1 or BRCA2 mediated DNA repair as some cancers arising in individuals with germline mutations in these genes have few tandem duplications. Large numbers of tandem duplications were generally observed, however, in cancers that do not express estrogen and progesterone receptors.
DNA sequence across the rearrangement junction was obtained from 1,821 (3,642 breakpoints) of the 2,166 confirmed rearrangements (Supplementary Table 3). The sequences 100bp either side of each breakpoint were examined for the presence of motifs and sequence content. No striking signatures were observed, although there was a slight deficit of C:G base pairs compared to the genome as a whole (p<0.001) and modest enrichment of some motifs (Supplementary Tables 4 and 5). However, no single motif was commonly found in any class of rearrangement.
The sequences either side of each rearrangement junction were then compared to each other. In most instances the two contributing DNA segments showed a short stretch of identical sequence, known as an overlapping microhomology, immediately adjacent to the rearrangement junction which was present only once in the rearranged DNA (Figure 1c, Table 2, Supplementary Table 3 and Supplementary Figure 6). Approximately 15% rearrangements showed non-templated sequence at the rearrangement junction (Table 2, Supplementary Table 3 and Supplementary Figure 7). In many, this is only a few base pairs long, although the longest segment of this type was 154bp. A further 3.8% of rearrangements included one or more small fragments of DNA (<500bp) from elsewhere in the genome interposed between the rearrangement breakpoints identified by the paired end sequencing. We have previously termed these small DNA fragments “genomic shards”10,17.
Overlapping microhomologies and non-templated sequences at rearrangement junctions are often considered to be signatures of a non-homologous end-joining (NHEJ) DNA double strand break repair process18-21. The segments of overlapping microhomology are believed to mediate alignment of the two DNA fragments that are joined. It has, however, recently been proposed that complex germline rearrangements with genomic shards and overlapping microhomology might be due to replicative mechanisms21. The small proportion of complex rearrangements with genomic shards may indicate that this mechanism is relatively infrequently operative in breast cancer.
It has previously been suggested that there exist multiple NHEJ repair processes which may be characterised by different lengths of overlapping microhomology at rearrangement junctions21,22. To investigate this possibility we examined the distribution of microhomology lengths in each breast cancer (Figure 1c, Supplementary Figure 6). In some breast cancers, rearrangements with zero base pairs of microhomology were most frequent, whilst in others rearrangements with two or more base pairs were the commonest class. Rearrangements with zero base pairs of microhomology were most common in amplicons, in contrast to all other classes of rearrangement in which the modal class of microhomology was 2bp (Figure 2). These differences are unlikely to be due to chance (p<0.001) and suggest that there are at least two classes of NHEJ repair which are operative to different extents in different breast cancers21.
Because the analysis of paired-end sequences requires alignment to the reference human genome and because sequences within repetitive elements are more likely to misalign it is conceivable that we have missed classes of rearrangement mediated by repeats. To investigate this possibility further we constructed libraries from the breast cancer cell line HCC1187 in which the sequenced ends were 3kb rather than 500bp apart. The 3kb paired ends will flank the majority of common repeats and thus allow detection of rearrangements mediated by them. Although additional rearrangements were detected, a distinct class of repeat mediated rearrangement was not found (data not shown).
Fifty per cent of rearrangements fell within the footprint of a protein coding gene compared to 40% expected by chance (p<10−7). The reasons for this striking enrichment of rearrangements in genic regions are not clear. Since rearrangements that confer selective advantage on a cancer clone are a priori more likely to be located in genes it is conceivable that some of this effect is due to selection and that a subset of rearrangements is implicated in cancer development. However, it may be more plausible that there are structural properties of genic regions that increase the likelihood of a DNA double strand break occurring, perhaps dependent on active transcription or chromatin configuration.
Twenty-nine rearrangements were predicted to generate in-frame gene fusions. Using exon-exon RT-PCR, rearranged transcripts from 19/22 in-frame fusion genes in non-amplified regions and from 2/6 (1 not determined) in amplified regions were found (Table 3). Thus most in-frame rearranged genes from non-amplified regions have the requisite 5′ and 3′ DNA sequences for transcript formation and stability. Conversely most from amplified regions do not and these rearrangements probably represent fragments of one or both genes reflecting the high density of rearrangements often present in these regions10. Sixty-six in-frame internally rearranged genes were also identified. 39/58 assessed showed rearranged transcripts (Table 4). In some cancers multiple in-frame rearranged and expressed genes are present (Tables 3 and and4,4, Supplementary Tables 6 and 7).
Several in-frame fusion genes are potentially of biological interest as candidates for new cancer genes. Notably, two were members of the ETS family of transcription factors. ETV6 is rearranged to form cancer genes with multiple different fusion partners in leukaemias23, congenital fibrosarcoma24 and myelodysplastic syndrome. It also forms a rearranged cancer gene with NTRK3 in the rare subclass of secretory breast cancer25. Here, ETV6 was fused to ITPR2 (Figure 3) through an inversion involving intron 2, a site previously reported in other cancers23, and was rearranged in a further breast cancer without clearly forming an in-frame fusion gene. ITPR2 encodes Inositol 1,4,5-triphosphate receptor Type 2 which is involved in signal transduction and regulation of cellular calcium fluxes. The second rearrangement fused EHF, which has not been previously implicated in cancer development, to NFIA a transcription factor involved in adenovirus replication (Supplementary Figure 3).
Fusion genes implicated in cancer development are likely to be recurrent. However, none of the novel fusion genes we identified was present in more than one out of the 24 cancers screened. Three expressed, in-frame fusion genes were examined by FISH (ETV6-ITPR2, NFIA-EHF and SLC26A6-PRKAR2A) and 20 by RT-PCR across the rearranged exon-exon junction in 288 additional breast cancer cases. No further examples were found indicating that they are either passenger events or that they contribute infrequently to breast cancer development.
Rearrangements were found in several known cancer genes including BRAF, PAX3, PAX5, NSD1, PBX1, MSI2 and ETV6 (see above). Each of these is a partner in a fusion gene in other classes of human cancer and was rearranged in two of the 24 samples analysed, although in many cases an in-frame fusion gene was not obviously generated (Supplementary Table 8). Rearrangements found in RB, APC, FBXW7 and other recessive cancer genes may have resulted in gene inactivation to contribute to cancer development.
Several other genes were rearranged in multiple cancers (Supplementary Table 9). Some are in amplified regions surrounding ERBB2 (for example ACCN1 which is rearranged in four out of the 24 breast cancers) or other known targets of genomic amplification in breast cancer. It is likely that these are recurrently rearranged because of the high density of rearrangements associated with these regions of recurrent genomic amplification. Others, however, are not in regions of genomic amplification. For example, SHANK2 was rearranged in five of the 24 breast cancers, while IGF1R, GRHL2, EFNA5, and MACROD2 were each rearranged in four. These recurrently rearranged genes generally have large genomic footprints and may simply represent bigger targets for randomly positioned rearrangements (Supplementary Table 9). For some, however, an elevated local rate of DNA double strand breakage (“fragility”) may also contribute to the clustering of rearrangements.
This study has generated the most comprehensive insight thus far into patterns of somatic rearrangement in cancer genomes. Most rearrangements in breast cancer are intrachromosomal. Tandem duplications appear to be the most common subclass and are known to form activated cancer genes in other cancer types26,27. The high prevalence of tandem duplications in a subset of cancers suggests the presence of a defect in DNA maintenance which generates this particular class of rearrangement. The underlying abnormality responsible for this phenotype is unknown. It may reside in the licensing mechanisms responsible for defining, priming and monitoring origins of DNA replication28.
Breast cancers are highly heterogeneous and are subclassified on the basis of estrogen receptor, progesterone receptor and ERBB2 expression and by mRNA expression profiles29,30. Subclasses defined in these ways show correlations with patterns of genomic alteration31,32. Breast cancers with many tandem duplications are usually estrogen and progesterone receptor negative and classified by expression profile as basal-like. In contrast, cancers with few rearrangements or with rearrangements within amplicons (other than those involving ERBB2) are usually estrogen receptor positive and classified as Luminal A and Luminal B types respectively.
Many novel in-frame fusion genes or internally rearranged genes were identified, most of which were expressed. None, however, were found to be recurrent. Approximately 2% rearrangements would be expected to generate an in-frame fusion gene by chance, compared to 1.6% observed. It is therefore likely that most are passenger events. Nevertheless, as previously suggested for somatic point mutations13,14 it may be that multiple, infrequently rearranged cancer genes are operative in breast cancer as they are in leukaemia2. Furthermore, detailed analysis of rearrangement breakpoints will be necessary to investigate the possibility of fusions between promoters/regulatory elements and intact genes that result in deregulation of expression. Much larger series will be required to investigate comprehensively the possibility of recurrent cancer-causing rearrangements in breast cancer.
Exhaustive sequencing of substantial numbers of cancer genomes to yield complete catalogues of all classes of somatic mutations will gather pace over the next few years. The current study offers insight into the complexity of rearrangement patterns that will be encountered in solid tumour genomes, demonstrates the potential for generation of active rearranged genes that may be implicated in cancer development and illustrates the types of information that will emerge on mutational processes that have been operative during the development of individual cancers.
Genomic libraries from nine breast cancer cell lines and fifteen primary breast cancers were generated using 5μg of total genomic DNA17. Briefly, 5μg of genomic DNA was randomly fragmented to between 200 and 700bp by focused acoustic shearing (Covaris Inc.). These fragments were electrophoresed on a 2% agarose gel wherefrom the 450-550bp fraction was excised and extracted using the Qiagen gel extraction kit (with gel dissolution in chaotropic buffer at room temperature to ensure recovery of AT rich sequences). The size fractionated DNA was end repaired using T4 DNA polymerase, Klenow polymerase and T4 polynucleotide kinase. The resulting blunt ended fragments were A-tailed using a 3′-5′ exonuclease-deficient Klenow fragment and ligated to Illumina paired-end adaptor oligonucleotides in a ‘TA’ ligation at room temperature for 15 minutes. The ligation mixture was electrophoresed on a 2% agarose gel and size-selected by removing a 2 mm horizontal slice of gel at ~600bp using a sterile scalpel blade. DNA was extracted from the agarose as above. 10ng of the resulting DNA was PCR amplified for 18 cycles using 2 units of Phusion polymerase. PCR cleanup was performed using AMPure beads (Agencourt BioSciences Corporation) following the manufacturer’s protocol. We prepared Genome Analyzer paired-end flow cells on the supplied Illumina cluster station and generated 37bp paired-end sequence reads on the Illumina Genome Analyser platform following the manufacturer’s protocol. Images from the Genome Analyzer were processed using the manufacturer’s software to generate FASTQ sequence files. These were aligned to the human genome (NCBI build 36.2) using the MAQ algorithm v0.4.333.
Reads which failed to align in the expected orientation or distance apart were further evaluated using the SSAHA algorithm34 to remove mapping errors in repetitive regions of the genome. In addition, during the PCR enrichment step, multiple PCR products derived from the same genomic template can occasionally be sequenced. To remove these, reads where both ends mapped to identical genomic locations (plus or minus a single nucleotide), were considered PCR duplicates, and only the read pair with the highest mapping quality retained. Further, erroneous mapping of reads originating from DNA present in sequence gaps in NCBI build 36.2 were removed by excluding the highly repetitive regions within 1Mb of a centromeric or telomeric sequence gap. Additional read pairs, where both ends mapped to within less than 500bp of one another, but in the incorrect orientation were excluded from analysis, unless support for a putative rearrangement was indicated by additional read pairs. The majority of these singleton read pairs are likely to be artifacts resulting from either intramolecular rearrangements generated during library amplification or mispriming of the sequencing oligonucleotide within the bridge amplified cluster. Finally, read pairs where both ends mapped to within 500bp of a previously identified germline structural variant were removed from further analysis, as these were likely to represent the same germline allele.
Full methods for generation of high resolution, genome wide copy number information can be found in reference17. Briefly, the human reference genome was divided into bins of ~15kb of mappable sequence and high quality, correctly mapping read pairs, with a MAQ alternative mapping quality ≥35, were assigned to their correct bin and plotted. A binary circular segmentation algorithm originally developed for genomic hybridisation microarray data35 was applied to these raw plots to identify change-points in copy number by iterative binary segmentation.
The following criteria were used to determine which incorrectly mapping reads pairs were evaluated by confirmatory PCR: (i) Reads mapping ≥10kb apart spanned by ≥2 read independent read pairs (where at least one read pair had an alternative mapping quality ≥35); (ii) Reads mapping ≥10kb apart spanned by 1 read pair (with an alternative mapping quality ≥35), with both ends mapping to within 100kb of a change-point in copy number identified by the segmentation algorithm; iii) Reads mapping ≥600bp apart spanned by ≥2 read independent read pairs (where at least one read pair had an alternative mapping quality ≥35) with both ends mapping to within 100kb of a change-point in copy number identified by the segmentation algorithm; iv) Selected read pairs mapping between 600bp and 10kb apart spanned by ≥2 read independent read pairs (where at least one read pair had an alternative mapping quality ≥35).
Primers were designed to span the possible breakpoint by locating them in the 1 kb outside region the paired-end reads, for a maximum product size of 1kb. PCR reactions were performed on tumor and normal genomic DNA for each set of primers at least twice, using the following thermocycling parameters: 95°C x 15min (95°C x 30s, 60°C x 30s, 72°C x 30s) for 30 cycles, 72°C x 10min. Products giving a band were sequenced by conventional Sanger capillary methods and compared to the reference sequence to identify breakpoints. Somatically acquired rearrangements were defined as those generating a reproducible band in the tumor DNA with no band in the normal DNA following PCR amplification, together with unambiguously mapping sequence data suggesting a rearrangement.
Total RNA (100ng) from the tumor and matched constitutional DNA/lymphoblastoid cell lines was reverse transcribed into single stranded cDNAs using Reverse Transcriptase II (Invitrogen) and Oligo (dT)12-18 (Invitrogen) in 20μl reaction at 25 °C for 10 min, 42°C for 50min, 72°C for 15min. The cDNA was then diluted with 30μl of distilled water before subsequent PCR amplification. Resulting bands were sequenced to confirm the specificity of the reaction and the presence of the aberrant transcript. To detect fusion transcripts, we used forward primers in the putative 5′ partner gene and reverse primers from the 3′ partner. To detect rearranged transcripts, we used forward primers and reverse primers corresponding to the predicted exons fused. When multiple bands, possibly suggestive of splice variants were detected, all bands were excised from the gel and capillary sequenced separately.
FISH based on the split apart or fusion probe strategy was used to validate NFIA-EHF gene aberrations. For the NFIA-EHF fusion probe, BAC clones (http://www.ensembl.org/, Ensembl release 54) RP11-32I17, chromosome 1: 61,191,261-61,339,873; RP11-364M11, chromosome 1: 61,064,196-61,228,554 (red) and RP11-64P01, chromosome 1: 34,699,699-34,860,527; RP11-277N08, chromosome 1: 34,722,104-34,965,946 (green) were used. For the EHF split-apart probe, BAC clones RP11-64P01, chromosome 11:34,699,699-34,860,527; RP11-277N08, chromosome 11: 34,772,104-34,965,946 (green) and RP11-567H10, chromosome 11: 34,123,423-34,294,895; RP11-278N12, chromosome 11: 34,086,610-34,248,310 and RP11-686L07, chromosome 11: 33,936,895-34,109,642 (red) were used.
Dual colour FISH was used to detect the SLC26A6-PRKAR2A tandem duplication using BAC clones RP11-527M10, Chromosome 3: 45,948,424-46,115,480 (green); RP11-148G20, Chromosome 3: 48,575,991-48,781,362 (red).
BAC clones were purchased from BACPAC resource (Children’s Hospital Oakland;http://bacpac.chori.org/), and DNA from all BAC clones was purified, labelled and individually verified for specificity by FISH and direct sequencing as described previously36. BAC DNA was labelled with either biotin or digoxygenin-l I-dUTP (Roche) using the Bioprime kit (Invitrogen) and FISH was performed as described previously36. Biotinylated probes were detected with Cy5–Streptavidin (Invitrogen, Zymed Laboratories) and digoxigenin-labeled BACs, with anti-digoxigenin-fluorescein, Roche (green). Nuclei and chromosomes were counterstained with DAPI. Images were captured with a Zeiss Axioplan 2 microscope equipped with a CCD camera (Applied Imaging Diagnostic Instruments) and Cytovision software, version 2.81 (Applied Imaging). Only morphologically intact and non-overlapping nuclei were analyzed.
All 1832 breakpoints defined to the basepair level were used in the analysis of breakpoint sequence context, excluding shards and overlapping regions. Analysis was performed on all breakpoints together, and also on subsets divided into deletions, tandem duplications, amplicons, other intrachromosomal events, and all interchromosomal events. 10bp and 100bp on either side of the breakpoint sites were extracted for analysis. As a control, for each real breakpoint, 100 sequences of the same length were extracted from the regions extending from 10,000 to 20,000bp away from either side of the break. These matched control sequences were used as a comparison in the analysis to account for any regional differences such as large variations in GC or repeat content. The length of nucleotide tracts (polynucleotide, polypurine/polypurimidine, and alternating polypyrimidine/polypurine) were compared in the breakpoint and control regions using a one-tailed Mann-Whitney U test and the average GC content and presence of known motifs associated with DNA breaks37 were compared using a Fisher exact test.
To determine whether breakpoints were enriched in genic regions, we compared the number of breakpoints falling within genes to an empirically-derived expected proportion. We classified breakpoints as genic or intergenic based on if their coordinates fell within a gene as annotated by Ensembl (http://www.ensembl.org/, Ensembl release 54). To account for the fact that some areas of the genome will be difficult to sequence align to with short reads, we derived the expected proportion of breakpoints that should fall within a gene from the actual proportion of read pairs that aligned to genic regions. Treating each breakpoint of a rearrangement independently, we then compared the number of breakpoints falling within a gene to this expected proportion using a Chi-squared test to obtain a p-value for the overrepresentation of breakpoints in genes.
Comparison of microhomology and non-templated sequence distributions across individual samples was performed using Scholz and Stephens’ k-sample generalisation of the Anderson-Darling goodness-of-fit test, with 10,000 data permutations to generate the statistic’s null distribution38.
The authors are grateful to Maryou Lambros, Felipe Geyer and Radost Vatcheva for their assistance in the FISH experiments. We would like to acknowledge the support of the Kay Kendall Leukaemia Fund under Grant KKL282, Human Frontiers Award reference LT000561/2009-L, the Dana-Farber/Harvard SPORE in breast cancer under NCI grant reference CA089393, Breakthrough Breast Cancer, the Research Council of Norway Grants no 155218 and 175240, and the Wellcome Trust under grant reference 077012/Z/05/Z.
Supplementary Information is included with this manuscript.
Author Contributions M.R.S., P.A.F., P.J.C. and P.J.S. designed the experiment. S.E., D.J.M, P.J.S., M-L.L, I.V. L.J.M., J.B, M.A.Q., H.S., C.C., R.N., A.M.S., A.L., J.W.M.M., and C.L. carried out laboratory analyses. J.A.F., J.S.R-F., L.vV., A.L.R. D.P.S. and A.L.B-D. provided clinical samples. P.J.S., D.J.M., I.V., M-L.L., E.D.P., J.T.S., L.A.S., C.L., C.D.G., M.J., J.W.T., K.W.L., P.J.C., P.A.F., J.S.R-F., J.W.M.M., A.M.S., J.A.F., M.R.S., H.E.G.R., A.L.R., A.L.B-D., L.vV., A.L., P.J.C., P.A.F. performed data analysis, informatics and statistics. M.R.S., wrote the manuscript with comments from P.J.S., P.A.F., P.J.C., A.L.B-D., J.S.R-F., J.A.F., A.L.R., D.P.S., L.vV.