In the present study, we report the identification of several breast cancer fusion transcripts by refining our recently described bioinformatic fusion gene discovery pipeline
[19]. Together with the previous publication, we have therefore identified a total of 40 fusion genes, which is, to our knowledge, along with the recent work from the Chinnaiyan group
[10] the highest number of breast cancer gene fusions identified from the same experimental set-up. Importantly, all the genes we have identified, have also been experimentally validated and found to be specific to the sample where the RNA-seq experiment suggested them to be present. Although in recent years there has been a growing number of reports on identification of fusion genes in solid tumors
[24],
[25], including breast cancer
[10],
[13]–
[18], fusion gene detection has been complicated by the high false positive rate
[13],
[26]–
[29]. In our experience, the most indicative feature of a true fusion is the tiling pattern of the short reads running across the exon-exon boundary of the fusion gene. In these cases, as little as two paired-end reads and two junction covering short reads are supportive of a true fusion ()
[19].
In this study, we sought to address the effect of two bioinformatic steps on fusion gene detection: 1) updates in the annotation database Ensembl, and 2) commission of more than one fusion partner per gene per sample. As gene annotation is continuously evolving, updates in the annotation databases yield additional fusion genes, as demonstrated here by us. By allowing more than one fusion partner, we sought to identify possible indiscriminate fusion genes among our candidate gene fusions. Based on our data, fusion partners that can recombine with several distinct partner genes are often found in breast cancer samples. Some well-established fusion genes have also been shown to be promiscuous, examples including
MLL- in leukemias,
EWS- in sarcomas,
RET- in carcinomas and
TMPRRS2- and
ETV1-fusions in prostate cancer
[4],
[7],
[30],
[31]. We found promiscuous fusion gene partners within the same sample, possibly reflecting the more rearranged genomes of cancer cell lines, whereas the different
MLL-,
EWS- etc. fusions occur one per sample, with diversity in fusion partners between the samples. We found
BCAS3 fused with two different 5′ partners (
BCAS4-BCAS3,
MED13-BCAS3),
MED1 with two separate 3′ genes (
MED1-STXBP4,
MED1-ACSF2) and a 5′ gene (
USP32-MED1), and
TMEM49 was the 3′ partner fused to
AC099850.1 as well as
RPS6KB1 (,
[19]). Some of these partner genes have recently also been documented by others as having several fusion partners
[16],
[17]. One of the most intriguing fusion partners is
RPS6KB1, which we found fused to
SNF8 and
TMEM49 (
RPS6KB1-SNF8,
RPS6KB1-TMEM49), and that has been found in a subset of clinical breast cancer samples harboring 17q23-chromosome amplification, albeit with structural heterogeneity
[16]. Hence,
RPS6KB1 may function both as a recurrent and promiscuous fusion gene partner in a subset of breast cancers. Interestingly, both
MED1 and
TMEM49 are also amplified genes, and thus, it could be speculated that repeated chromosome breaks occurring in the same gene during amplicon formation (e.g. during breakage-fusion-bridge cycles) could pave the way for the same gene to form many fusions with different partner genes.
An emerging theme with fusion transcripts is their presence at genomic copy number transitions, and moreover, at high-level amplicons
[22],
[23]. We carried out aCGH in order to create a genomic map with roughly 2 kb resolution
[32] to study the chromosomal break points underlying structural fusion-generating rearrangements. From the 40 fusion genes described by us here and previously
[19], the majority (60%) were associated with gene amplifications (
Figures S1 and
S2). Interestingly, whereas balanced genetic rearrangements (albeit often with microdeletions) prevail in hematopoietic diseases, this seems not to be the case for the gene fusions discovered in solid tumors
[31]. Another structural feature of fusion genes appears to be the prevalence of intrachromosomal fusions over interchromosomal ones. Furthermore, there is a predominance of promoter-donating fusions over the coding-coding and coding-3′UTR ones, which is in line with recent research
[17],
[22]. This raises the question about possible genomic mechanisms and biological drivers of fusion gene formation. In a network analysis of known fusion genes in cancer, three separate hubs were found, involving mainly transcription factors and tyrosine kinases pointing to a non-random nature of fusion gene formation
[4]. Others have suggested a more indiscriminate nature of the process involving, for example, spatial proximity of the partner genes in the interphase chromosomes within the nucleus. Also, movement of genes on different chromosome loops into the same transcription factories has been proposed
[22]. However, these methods may apply better to leukemic fusion genes, which are less likely to involve copy number changes, such as high-level DNA amplifications. Chromothripsis, chromosome shattering in a spatially confined region, can also lead to rearrangements, and has been documented for example in colorectal cancer
[33]. At the sequence level, a small deficit of CG nucleotides, and in some cases sequences of overlapping microhomology have been documented at the rearrangement points
[22]. These facts could be indicative of genomic instability and non-homologous end-joining being active in fusion gene formation. Indeed, here we observe that the sequenced DNA stretches few hundred base pairs around the genomic fusion break point are very AT-rich for two out of three examined fusions; the AT-content being 60% for
THRA-AC090627.1 and 75% for
TOB1-SYNRG ( and
Table S2). In the immediate vicinity of the fusion junction short stretches of identical sequence, just few nucleotides long, can be seen on both sides of the break. For
TOB1-SYNRG, also a two-nucleotide long non-templated sequence is found at the fusion junction ( and
Table S2). These findings are in line with previous descriptions on nucleotide-level break point compositions
[22]. Most likely, the process of fusion generation is influenced by a variety of both genomic mechanisms and the potential clonal advantage or disadvantage for cell growth and survival, which both act in a context-dependent manner.
Here we discovered that most of the fusions display transcript variants (ca. 60%, 8/13), which is more than previously anticipated (,
S3,
S4,
Table S3). Before just a handful of breast cancer fusion genes were reported to be alternatively spliced by us
[19] and three additional studies
[16]–
[18]. Furthermore, here we observed transcript level retention of intronic sequences in the gene fusions (e.g.
MED1-STXBP4,
STX16-RAE1, ), adding yet another level of complexity to the fusion gene structure. As most fusion break points occur in introns, the transcriptional machinery is forced to switch to another exon or alternatively to acquire a new acceptor splice site in the intron where the breakage and fusion happen, in order to produce an
in-frame transcript variant. Indeed, intron retention in some of the transcript variants would indicate that this does not always occur. Whether all the fusion transcript variants produce
in-frame protein products with potentially distinct functional domains remains to be elucidated.
In conclusion, in recent years the rapid progress in next generation sequencing technologies has led to the concordant development of bioinformatic approaches for mining the raw sequencing data. We and others have exploited RNA-seq for the discovery of fusion genes
[13]–
[15],
[18],
[19],
[34],
[35]. Here, we demonstrated the need for review and development of bioinformatic fusion gene pipelines and by doing so, discovered and experimentally validated several breast cancer fusion genes. This emphasizes the importance of continuous re-evaluation of the bioinformatic methods to predict fusion genes. Furthermore, our data revealed that many of the fusion genes are expressed in several transcript isoforms, highlighting a previously unanticipated level of complexity in the fusion gene build-up. Even if the majority of fusion genes discovered in solid tumors are present at very low frequency or are private events, they may still contribute to the etiology and progression of the individual tumors. The roles of the individual fusion isoforms in these processes remain to be determined.