FusionFinder analyses FASTQ read data (of at least 50 nucleotides) to identify gene fusion candidates. This is achieved by performing the following integrated analysis and filtering steps, which are outlined in Box S1 and . By default, all filters described are enabled but some can be disabled with command line options.
Step 1. Alignment of Full Length Reads Against a Normal Coding Reference Transcriptome
The first step of the process (see Box S1) is to align the full length reads (e.g. 100 nucleotide reads in the case of Illumina HiSeq data) against a reference transcriptome containing only coding transcripts to identify those reads that fail to match to a normal dataset. We use Bowtie 
, to align the read data against this reference transcriptome and produce Sequence Alignment/Map (SAM) format 
output, comprising one row per read where each read has either a single transcript hit or fails to match anything in the reference. The latter could be due to one of three main reasons: (i) the normal reference transcriptome is not comprehensive enough and does not contain the transcript to which this read should match (i.e. a completely novel but genuine non-fusion transcript or an expressed non-coding transcript); (ii) the read is of poor quality and the alignment software could not align it to any transcripts in the reference; (iii) the read overlaps the exon-exon junction of a fusion transcript, the sequence for which is understandably not in the reference transcriptome. It is these latter sequences that are the target for FusionFinder.
A reference coding transcriptome that is compatible with our analysis pipeline, can be obtained from our website 
. The references comprise all coding transcripts of all annotated genes within recent versions of Ensembl 
Step 2. Creation of Pseudo Paired-end Reads
The next part of the process is to split each read with no matching hits from Step 1 into two smaller sections, so we can attempt to find sequences from different genes that match to each section (see Box S1 and ). From all full length reads having no hits in the reference transcriptome (i.e. reads that may potentially match a fusion gene, see ), a pair of FASTQ “pseudo PE reads” are derived with each pair comprising the first n bases and the last n bases of the sequence of the full length read, where n represents a proportion (0.4 by default but no greater than 0.5) of the length of the full length read (). It is important to note that these pseudo PE reads retain the read ID of the parent full length read and are simply appended with ‘/1’ or ‘/2’ to distinguish each member of the pair (similar to PE reads), allowing them to be later reunited. In addition, a line graph of the overall quality of all full length reads is produced in this step.
Step 3. Alignment of Pseudo PE Reads against the Coding Reference Transcriptome
The next step (see Box S1) is to align the pseudo PE reads against the reference transcriptome to establish which, if any, transcripts they align to. As in Step 1, we use Bowtie to align the pseudo PE reads independently against the coding reference transcriptome (). In Bowtie’s default mode, hits are determined based on 100% identity between a read and a transcript in the normal reference transcriptome over the first 28 bases of a read, while allowing for a 2 base mis-match. This acts as a quality control step for the data, filtering out poor quality reads.
Step 4. Analysis and False-positive Filtering of the Pseudo PE Read Results
The next steps of the process perform several stages of analysis and filtering to narrow down the most likely candidate fusion transcripts using the available evidence (see Box S1). Initially each of the pseudo read pairs are reunited (based on their common ID) and examined for which, if any, transcripts they hit. The first two filters (Steps 4A & 4B) then discard data where the read pairs both match the same gene or where either read of the pair does not match anything in the reference transcriptome. If the read pairs hit different genes, which we’ve termed a G1:G2 (i.e. Gene 1:Gene 2) pair, the read pairs are stored in order to build up a body of evidence for the existence of this G1:G2 pair (example provided in ). A G1:G2 pair can be regarded as a fusion candidate. The third and fourth filters (Steps 4C & 4D) remove false positive G1:G2 pairs where any read pairs hit either paralogous genes, or antisense transcripts that the alignment algorithm is unable to distinguish. These filters can be disabled using command line options if required. In the next filter (Step 4E) the read pairs are mapped back to the genome based on the coordinates where they aligned to G1 and G2 respectively. By default, Bowtie reports a single hit for a single read to a single transcript. Since transcript coordinates cannot be directly compared (due to differential exon usage between transcripts) all transcript/read alignment positions are transposed to genomic coordinates, which are then comparable. Using the canonical boundaries of each G1 or G2 exon, an assessment is made of whether the mapped positions are realistic given the size of the insert that should exist between them if the two exons were indeed fused in a transcript. For example two aligned 30mer pairs derived from a parent full length 76mer should have 16 bases between them when their respective mapped exon positions are assessed. Those read pairs mapping outside these constraints are filtered. The implementation of this last filter is responsible for eliminating many of the false positives including those arising from the existence of potential chimeric fusion artefacts in the RNA-Seq data. Finally (Step 4F) all pseudo PE reads evidencing all G1:G2 fusions are again realigned using Bowtie, firstly to the coding transcriptome reference and secondly to a reference containing only non-coding transcripts. In these alignment steps Bowtie is configured to allow all possible matches in each reference. G1:G2 pairs are filtered if their pseudo PE reads separately map to transcripts of the same gene in either reference. This step removes false positives arising as the result of genes sharing common exonic sequences. For example, the long intergenic non-coding (Linc) RNA SUZ12P is comprised of exons also found in both LPHN1 and SUZ12. Similarly, many unrelated coding transcripts (some novel or un-annotated) contain common exons (e.g. CCDC144A and USP32). Without a multi-mapping read filter many of these examples would be incorrectly reported as being transcript fusions. The multi-mapping step is performed at this stage within FusionFinder (i.e. post-filtering, as opposed to during the initial alignments in Step 1) since the required computation time is significantly reduced with the smaller number of candidate reads.
Step 5. Block Filtering and Identification of Fused Exons and Isoforms from Candidate Fusion Transcripts
This stage combines the remaining read evidence for each G1:G2 pair from Step 4
to identify the exons involved in the fusion (see Box S1). Firstly (Step 5A) the genomic coordinates of each mapped read are combined to construct “alignment blocks” comprising multiple overlapping reads that map to the same area on G1 or G2 (see also ). Each combination of blocks on G1 and G2 are then searched for overlapping repeat elements as indicated by RepeatMasker 
. If the block on G1 overlaps the same repeat element class as the block on G2, the block combination is filtered out, as such reads are likely to represent false positives. This filter (Step 5B) is optional and can be disabled if required.
The genomic location of the extremities of “alignment blocks” are then used to retrieve the Ensembl exon closest to this location (Step 5C). As depicted in , these discrete blocks represent the exons involved in the fusion. A single block on each of G1 and G2 indicates a single isoform for this fusion. Multiple blocks indicate multiple exon involvement and the existence of different fusion isoforms consisting of different combinations of blocks from G1 and G2.
Next, the number of pseudo PE reads providing evidence for each G1:G2 pair following filtering is determined (Step 5D). Here the user can opt to filter G1:G2 pairs that are not evidenced by at least a minimum number of pseudo PE reads. Restricting results to higher numbers of reads reduces the likelihood of false positives, while the acceptance of smaller read numbers will capture those fusion transcripts expressed at lower levels.
Finally (Step 5E), categories are assigned to each G1:G2 pair based on the given evidence as follows:
- Intrachromosomal - Genes originate from the same chromosome
- Interchromosomal - Genes originate from different chromosomes
- Potential Readthrough - Genes on the same chromosome and strand and <=20 kb apart
- Inversion - Genes on the same chromosome but different strands
Four main output files are produced. The first is a summary file, which contains a ranked list of fusion candidates based on their evidence strength (total number of sequence reads). The file provides the Ensembl and HUGO (Human Genome Organization) Gene Nomenclature Committee (HGNC) common name identifiers for G1 and G2, the number of blocks on each gene, an indication of how many isoforms exist for each G1:G2 pair and the category of fusion indicated by the pair. The second file gives the full details for each isoform of G1 and G2 and includes the genomic coordinates of the alignment blocks on G1 and G2, and their respective corresponding Ensembl exon IDs. For each isoform of a candidate fusion, the remaining two files provide (i) the sequences of the pseudo PE reads and the corresponding full length parent read and (ii) a forward three-frame translation of the fused nucleotide sequence of the two exons implicated in the candidate fusion. While FusionFinder is running, the software outputs all filtered read data to a separate file that contains a flag denoting the stage at which each read was filtered. In addition, statistics are produced detailing the raw numbers of reads that have been filtered at various stages of the algorithm and those remaining that provide the evidence for the fusion candidates contained in the summary file.
Further Investigation of Fusion Candidates
Using the common names of a G1:G2 pair of interest one can use FusionFinder to generate alignments to assist in the experimental confirmation of the candidate by RT-PCR. These alignments detail a) where the pseudo PE reads align to G1 and G2 and b) the exact location of the transcript breakpoint on the aligned parent reads (an example of this alignment is given in ).
Identification of the transcript breakpoint in each PRIM1:NACA isoform.
FusionFinder has been tested on various versions of Perl from 5.10 onwards and makes extensive use of the BioPerl 
and Ensembl API 
libraries, which are required to be installed locally with the software. Several other Perl libraries are also required (standard libraries available in the Comprehensive Perl Archive Network) that are detailed further in the software manual available from the project website 
. We have comprehensively tested FusionFinder on 64-bit Linux, but it can be run on both Windows and MacOS platforms provided Perl and the aforementioned dependencies are installed. FusionFinder requires access to an Ensembl database (ideally a locally installed) to perform some sections of the identification process.
The commands to run FusionFinder are described in the online user manual available from the project website 
. The source code for FusionFinder is made freely available from our website under the GNU General Public License (GPL).
Application of FusionFinder to a Published RNA-Seq Dataset
We applied FusionFinder to data arising from work by Levin and colleagues 
who performed a targeted RNA-Seq analysis of the K562 CML cell line. They generated sequencing data both pre- and post-enrichment for 467 cancer-related genes. We used the dataset representing the sequencing produced post-enrichment, which consisted of 14 million 76mer SE reads in FASTQ format. We refer to this dataset hereon as “the Levin dataset”. We generated 30mer pseudo PE reads and searched for candidates that were evidenced by ≥4 pseudo PE reads. We used a coding transcriptome reference containing only those Ensembl transcripts with a known translation and source data from a local installation of Ensembl (release 62 - April 2011). All read alignments were performed with Bowtie (version 0.12.7), allowing for a 2 base mismatch.
Our analysis produced a final list of 9 fusion candidates. gives the details of these candidates as presented in the FusionFinder summary file. shows the details found in the FusionFinder isoforms file for each of the fusion candidates from that were previously reported by Levin et al.
Summary file showing the 9 candidates from the FusionFinder analysis of the Levin dataset.
FusionFinder isoforms file for the six common fusion candidates reported by both FusionFinder and Levin et al.
Replication of the Levin Results
Using a similar (but not fully automated) technique to that presented here, Levin and colleagues confirmed the findings of Maher et al
, who also analysed the K562 cell line using a PE sequencing approach and observed the fusions BCR:ABL1
, fusions 1 and 2 in . In addition to these, when analysing the enriched dataset, Levin et al
. reported the previously unobserved fusion transcripts listed at 3, 4, 5 and 7 in . Overall they reported four isoforms of the NUP214:XKR3
fusion (involving specific combinations of exons 29 and 27 of NUP214
and exons 2, 3 and 4 of XKR3)
and four isoforms of the RCC1:PICALM
Within our candidates, we observed all six fusion events reported by Levin and colleagues (highlighted in ). These included the three isoforms of NUP214:XKR3
involving exon 29 of NUP214
and all four isoforms of RCC1:PICALM
. The fourth isoform of NUP214:XKR3
involving exon 27 of NUP214
was found in a separate analysis using a value of 0.2 when generating the 3' pseudo PE reads (data not shown). Importantly, FusionFinder also reported that due to their genomic proximity, candidates listed as 3, 5 and 7 in are potential read-through transcripts as described elsewhere 
In addition to those fusions found by Levin et al., FusionFinder generated evidence for three other gene fusions expressed in the K562 cell line (, non-shaded) as well as a second isoform of the PRIM1:NACA fusion (see ). These new candidates are ACCS:EXT2, C3orf10:VHL and CEP170:RAD51L1 the first two each having multiple associated isoforms ().
shows the alignment for both identified isoforms of PRIM1:NACA (i.e. the original isoform identified by Levin et al., and the novel isoform identified by FusionFinder) and an in-frame translation for the region 30 bases upstream and downstream of the transcript breakpoint. In wild type PRIM1, a stop codon lies within the exon downstream of the implicated transcript breakpoint. However, in both fusion isoforms an open reading frame is retained through the fused NACA exon, therefore generating a sequence coding for a novel fusion peptide.
Experimental Confirmation of Novel Fusions
Using RT-PCR and Sanger sequencing we experimentally confirmed the existence of two of the three novel fusions (8 and 9 in ) as well as the novel isoform of PRIM1:NACA using RNA extracted from the commercially available K562 cell line (). In addition we confirmed that the sequence at the transcript breakpoints of the previously identified fusions BCR:ABL1 and SLC29A1:HSP90AB1 were correctly predicted by FusionFinder. At least one predicted isoform of each confirmed novel fusion was detectable in our particular K562 cell line. Failure to detect every isoform of each predicted gene fusion is attributable to fact that we did not employ the targeted-enrichment protocol used by Levin et al prior to experimental confirmation, which would dramatically increase the representation of these transcripts within the RNA pool.
RT-PCR validation of the fusion candidates.
Comparison to Existing Software on Real and Simulated Data
Recently published software for the analysis of RNA-Seq data for gene fusions include FusionMap 
, FusionSeq 
, FusionHunter 
, deFuse 
and Tophat-Fusion 
. Of these, both FusionMap and Tophat-Fusion can process SE read data. FusionMap uses a similar strategy to FusionFinder by splitting reads into smaller sections and finding fusion candidates where sections align to different genes on an annotated genomic reference prior to filtering. Tophat-Fusion uses to a two stage process of firstly aligning reads with a modified version of the spliced alignment software Tophat 
to a genomic reference before secondly performing a post processing step to overlay annotation and perform filtering.
To compare the performance of FusionFinder with FusionMap and Tophat-Fusion we have run a full analysis with all three tools using the Levin dataset. For this dataset, comprising 14 million 76mer reads, a complete analysis with FusionFinder took approximately 3.1 hours on a single 2.4 GHz core of a multi-core AMD server with a peak memory usage of 1.8GB and using data from a local Ensembl (release 62) mirror. FusionMap (version 2012-03-03) with comparable parameters (α
2) and preloaded reference data running the same analysis using Mono (version 2.10.8) under 64-bit linux, again on a single 2.4 GHz core, took 2.1 hours to complete and at its peak consumed 7.2 GB memory. Tophat-Fusion with comparable parameters (alignment phase: –fusion-min-dist 10000 and post-processing: –num-fusion-reads 4–num-fusion-pairs 0–num-fusion-both 4) and reference data on the same platform took 15 hours to complete and consumed 9.6 GB memory at its peak during the post-processing step. Although the run time for FusionFinder is slightly slower than FusionMap on a single core both are considerably faster than Tophat-Fusion. In addition FusionFinder consumes far less memory under a Linux operating system than both FusionMap and Tophat-Fusion. It should be noted that both FusionMap and Tophat-Fusion can be run on multiple CPU cores and with the same dataset and parameters but with 5 CPU cores, the analysis took 0.8 hours and 4.5 hours respectively. Similarly Bowtie can be run on multiple CPU cores and using 5 cores for the alignment steps in the FusionFinder protocol improves the total analysis time to 2.4 hours. We are currently working on a fully multithreaded version of FusionFinder. These data are summarised in and a detailed breakdown of resources used by FusionFinder at each step of the protocol can be found in Table S4
Performance comparison of FusionFinder, FusionMap and Tophat-Fusion in an analysis of the Levin dataset.
In line with previous reports 
our analysis of the Levin dataset with FusionMap confirmed the findings of Levin et al
. and reported an additional 57 candidates in this dataset. In comparison, Tophat-Fusion reported 12 candidate fusions but did not successfully identify all those reported by Levin et al
or the additional candidates reported by FusionFinder, even when we allowed for the detection of the read-through transcripts we observed. Tophat-Fusion only reported two of the three isoforms of NUP214:XKR3
reported by FusionFinder and did not report CEP170:RAD51L1
but did report an additional isoform of BCR:ABL1
which was not reported by FusionFinder or FusionMap. The results of these analyses are presented in Table S1
a and b.
To further compare the performance of each software we generated a simulated dataset of approximately 13.5 million SE 75 nucleotide reads (see Methods
). The dataset contained normal background reads and “fusion reads” representing the transcript breakpoint of 55 fusion transcripts generated at random (see Table S2a
). The dataset simulated fusions designed to represent both high and low levels of expression with read numbers per fusion transcript ranging from 1 – 295. We ran FusionFinder, FusionMap and Tophat-Fusion against this dataset. FusionFinder was run with default parameters, generating 30mer pseudo PE reads. FusionMap was run with comparable parameters (α
1), though we note that in order for FusionMap to detect any simulated fusions it was necessary to alter the MinimalRescuedReadNumber parameter to 0. Tophat-Fusion was also run using comparable parameters for the post-processing step of the protocol (–num-fusion-reads 1–num-fusion-pairs 0–num-fusion-both 1). Sensitivity and Positive Predictive Values (PPV) were then calculated for the simulated dataset to assess the ability of each software to accurately detect simulated fusion genes (see Methods
). summarises the overall results of this analysis whilst a plot of these data is shown in . The raw results from each software can be found in Tables S2
b, c and d. The performance measures were calculated on subgroups of fusion genes where subgroups were selected based on the number of reads (i
) evidencing the fusion genes predicted by each software. For example, the point marked at 100 on the x axis of shows performance measures for all predicted fusion genes that were found to be evidenced by 100 or more reads.
Summary of the overall comparative performance of FusionFinder, FusionMap and Tophat-Fusion on a simulated dataset.
Comparison of sensitivity and PPV for FusionFinder, FusionMap and Tophat-Fusion.
Among the fusion gene predictions made by each software are what we have termed “synonymous fusions”. These are where at least one of the identified gene partners has been inaccurately predicted because it shares high sequence similarity with the expected partner gene, possibly because it is a member of the same gene family (for example, an S100A3:SULT1A4 fusion may be detected as an S100A3:SULT1A3 fusion). Although the informed researcher would frequently be able to distinguish these fusions by visual comparison with other candidates in the output files, in our assessment of sensitivity and PPV such synonymous fusions were considered to be false positives to provide the most stringent assessment of each software.
It can be seen from and that FusionFinder shows greater sensitivity and generally greater PPV than FusionMap in the detection of our simulated fusion genes. also shows that FusionFinder compares favourably against Tophat-Fusion with an overall greater sensitivity and a comparable PPV. With regard to the overall sensitivity in , FusionFinder detected 87% of the 55 simulated fusion genes, FusionMap reported 58% and Tophat-Fusion reported 64%. Importantly, FusionFinder and Tophat-Fusion only detected 15 and 5 false positives respectively (some of which were synonymous fusions - see Table S2
b, c and d) giving them a comparable PPV. In contrast FusionMap reported 582 false positives, which represents 95% of the returned results respectively (). This has a considerable effect on the PPV in at low read levels with FusionMap remaining significantly lower than both FusionFinder and Tophat-Fusion. Consequently, in the case of FusionMap the user is returned a large list of potential fusion genes consisting primarily of false positives. In contrast, the candidates reported by FusionFinder will be more robust and more likely to be confirmed experimentally.
For two of the fifty-five simulated fusions, the partner genes contained repeats of the same class at the transcript breakpoint. These gene fusions were detected by FusionFinder but due to the RepeatMask filter, were subsequently filtered. However, both of these appeared in the FusionMap results, suggesting that FusionMap does not take the presence of repeats in to account. This could explain the occurrence of so many false positives in FusionMap’s results. Both of these simulated fusions were also filtered by Tophat-Fusion.
It should be noted that when both FusionMap and Tophat-Fusion detected a simulated fusion gene they consistently detected all simulated fusion reads, however although FusionFinder detected more simulated fusion genes it did not consistently detect all fusion reads. This is because, FusionFinder analyses data from an alignment using Bowtie’s default parameters which does not provide results for multi-mapping reads. This means that given two genes from the same family, sharing high sequence identity, a read has an equal chance of hitting either of these genes. As a result, the expected fusion reads are distributed between all synonymous fusions. We are working on a method to analyse alignments of multi-mapping reads, which will significantly increase the numbers of reads detected.
Application of FusionFinder to a PE Dataset
While FusionFinder is more suited to the analysis of SE data, it can also be used to analyse PE data, by considering each PE reads file as a separate SE reads file. We applied FusionFinder to PE RNA-Seq data from the MCF-7 breast cancer cell line as previously described 
and also analysed by the authors of Tophat-Fusion 
. This dataset comprises approximately 17 million 50 bp PE reads. shows the results of an analysis of this data with FusionFinder using default parameters but creating 25 bp pseudo PE reads and searching for candidates with at least two reads evidencing them. We found seven fusions, six of which have been previously reported 
. Only three of these fusions were found by Tophat-Fusion 
Summary file showing the 7 candidates from the FusionFinder analysis of the MCF-7 Breast Cancer cell line paired-end dataset.