The first step in analysis of an RNA-seq data set is to align (map) the reads to the genome, which is complicated by the presence of introns. Because introns can be very long, particularly in mammalian genomes, the alignment program must be capable of aligning a read in two or more pieces that can be widely separated on a chromosome. The size of RNA-seq data sets, numbering in the tens of millions or even hundreds of millions of reads, demands that spliced alignment programs also be very efficient. The TopHat program achieves efficiency primarily through the use of the Bowtie aligner [13
], an extremely fast and memory-efficient program for aligning unspliced reads to the genome. TopHat uses Bowtie to find all reads that align entirely within exons, and creates a set of partial exons from these alignments. It then creates hypothetical intron boundaries between the partial exons, and uses Bowtie to re-align the initially unmapped (IUM) reads and find those that define introns.
TopHat-Fusion implements several major changes to the original TopHat algorithm, all designed to enable discovery of fusion transcripts (Figure ). After identifying the set of IUM reads, it splits each read into multiple 25-bp pieces, with the final segment being 25 bp or longer; for example, an 80-bp read will be split into three segments of length 25, 25, and 30 (Figure ).
TopHat-Fusion pipeline. TopHat-Fusion consists of two main modules: (1) finding candidate fusions and aligning reads across them; and (2) filtering out false fusions using a series of post-processing routines.
Figure 3 Aligning a read that spans a fusion point. (a) An initially unmapped read of 75 bp is split into three segments of 25 bp, each of which is mapped separately. As shown here, the left (red) and right (blue) segments are mapped to two different chromosomes, (more ...)
The algorithm then uses Bowtie to map the 25-bp segments to the genome. For normal transcripts, the TopHat algorithm requires that segments must align in a pattern consistent with introns; that is, the segments may be separated by a user-defined maximum intron length, and they must align in the same orientation along the same chromosome. For fusion transcripts, TopHat-Fusion relaxes both these constraints, allowing it to detect fusions across chromosomes as well as fusions caused by inversions.
Following the mapping step, we filter out candidate fusion events involving multi-copy genes or other repetitive sequences, on the assumption that these sequences cause mapping artifacts. However, some multi-mapped reads (reads that align to multiple locations) might correspond to genuine fusions: for example, in Kinsella et al.
], the known fusion genes HOMEZ
were supported by 2 and 11 multi-mapped read pairs, respectively. Therefore, instead of eliminating all multi-mapped reads, we impose an upper bound M
(default M = 2) on the number of mappings per read. If a read or a pair of reads has M or fewer multi-mappings, then all mappings for that read are considered. Reads with > M mappings are discarded.
To further reduce the likelihood of false positives, we require that each read mapping across a fusion point have at least 13 bases matching on both sides of the fusion, with no more than two mismatches. We consider alignments to be fusion candidates when the two 'sides' of the event either (a) reside on different chromosomes or (b) reside on the same chromosome and are separated by at least 100,000 bp. The latter are the results of intra-chromosomal rearrangements or possibly read-through transcription events. We chose the 100,000-bp minimum distance as a compromise that allows TopHat-Fusion to detect intra-chromosomal rearrangements while excluding most but not all read-through transcripts. Intra-chromosomal fusions may also include inversions.
As shown in Figure , after splitting an IUM read into three segments, the first and last segments might be mapped to two different chromosomes. Once this pattern of alignment is detected, the algorithm uses the three segments from the IUM read to find the fusion point. After finding the precise location, the segments are re-aligned, moving inward from the left and right boundaries of the original DNA fragment. The resulting mappings are combined together to give full read alignments. For this re-mapping step, TopHat-Fusion extracts 22 bp immediately flanking each fusion point and concatenates them to create 44-bp 'spliced fusion contigs' (Figure ). It then creates a Bowtie index (using the bowtie-build program [13
]) from the spliced contigs. Using this index, it runs Bowtie to align all the segments of all IUM reads against the spliced fusion contigs. For a 25-bp segment to be mapped to a 44-bp contig, it has to span the fusion point by at least 3 bp. (For more details, see Additional files 10
Figure 4 Mapping against fusion points and selecting best read alignments. (a) Bowtie is used to align all segments from the initially unmapped (IUM) reads against spliced fusion contigs, shown in gray on the right. For example, the brown read on the top left (more ...)
After stitching together the segment mappings to produce full alignments, we collect those reads that have at least one alignment spanning the entire read. We then choose the best alignment for each read using a heuristic scoring function, defined below. We assign penalties for alignments that span introns (-2), indels (-4), or fusions (-4). For each potential fusion, we require that spanning reads have at least 13 bp aligned on both sides of the fusion point. (This requirement alone eliminates many false positives.) After applying the penalties, if a read has more than one alignment with the same minimum penalty score, then the read with the fewest mismatches is selected. For example, in Figure , IUM read 1 (in blue) is aligned to three different locations: (1) chromosome i with no gap, (2) chromosome j where it spans an intron, and (3) a fusion contig formed between chromosome m and chromosome n. Our scoring function prefers (1), followed by (2), and by (3). For IUM read 2 (Figure , in green), we have two alignments: (1) a fusion formed between chromosome i and chromosome j, and (2) an alignment to chromosome k with a small deletion. These two alignments both incur the same penalty, but we select (1) because it has fewer mismatches.
We imposed further filters for each data set: (1) in the breast cancer cell lines (BT474, SKBR3, KPL4, MCF7), we required two supporting pairs and the sum of spanning reads and supporting pairs to be at least 5; (2) in the VCaP paired-end reads, we required the sum of spanning reads and supporting pairs to be at least 10; (3) in the UHR paired-end reads, we required (i) three spanning reads and two supporting pairs or (ii) the sum of spanning reads and supporting pairs to be at least 10; and (4) in the UHR single-end reads, we required two spanning reads. These numbers were determined empirically using known fusions as a quality control. All candidates that fail to satisfy these filters were eliminated.
In order to remove false positive fusions caused by repeats, we extract the two 23-base sequences spanning each fusion point and then map them against the entire human genome. We convert the resulting alignments into a list of pairs (chromosome name, genomic coordinate - for example, chr14:374384). For each 23-mer adjacent to a fusion point, we test to determine if the other 23-mer occurs within 100,000 bp on the same chromosome. If so, then it is likely a repeat and we eliminate the fusion candidate. We further require that at least one side of a fusion contains an annotated gene (based on known genes from RefSeq), otherwise the fusion is filtered out. These steps alone reduced the number of fusion candidates in our experiments from 105 to just a few hundred.
As reported in Edgren et al.
], true fusion transcripts have reads mapping uniformly in a wide window across the fusion point, whereas false positive fusions are narrowly covered. Using this idea, TopHat-Fusion examines a 600-bp window around each fusion (300-bp each side), and rejects fusion candidates for which the reads fail to cover this window (Figure ). The final process is to sort fusions based on how well-distributed the reads are (Figure ). The scoring scheme prefers alignments that have no gaps (or small gaps) and uniform depth.
Figure 5 Supporting and contradicting evidence for fusion transcripts. (a) Given a fusion point and the chromosomes (gray) spanning it, single-end and paired-end reads (blue) support the fusion. Other reads (red) contradict the fusion by mapping entirely to either (more ...)
TopHat-Fusion's scoring scheme of read distributions. A scoring scheme of how well distributed reads are around a fusion point; these result scores are used to sort the list of candidate fusions. Variables are defined in the main text.
Even with strict parameters for the initial alignment, many of the segments will map to multiple locations, which can make it appear that a read spans two chromosomes. Thus the algorithm may find large numbers of false positives, primarily due to the presence of millions of repetitive sequences in the human genome. Even after filtering to choose the best alignment per read, the experiments reported here yielded initial sets of about 400,000 and 135,000 fusion gene candidates from the breast cancer (BT474, SKBR3, KPL4, MCF7) and prostate cancer (VCaP) cell lines, respectively. The additional filtering steps eliminated the vast majority of these false positives, reducing the output to 76 and 19 fusion candidates, respectively, all of which have strong supporting evidence (Tables and ).
The scoring function used to rank fusion candidates uses the number of paired reads in which the reads map on either side of the fusion point in a consistent orientation (Figure ) as well as the number of reads in conflict with the fusion point. Conflicting reads align entirely to either of the two chromosomes and span the point at which the chromosome break should occur (Figure ).
The overall fusion score is computed as:
where lcount is the number of bases covered in a 300-bp window on the left (Figure ), lavg is the average read coverage on the left, max_avg is 300, lgap is the length of any gap on the left, rate is the ratio between the number of supporting mate pairs and the number of contradicting reads, |lavg - ravg| is a penalty for expression differences on either side of the fusion, and dist is the sum of distances between each end of a pair and a fusion. (For single-end reads, the rate uses spanning reads rather than mate pairs.) The variance in coverage lder is:
where lwindow is the size of the left window (300 bp).
TopHat-Fusion outputs alignments of singleton reads and paired-end reads mapped across fusion points in SAM format [28
], enabling further downstream analyses [29
], such as transcript assembly and differential gene expression. The parameters in the filtering steps can be changed as needed for a particular data set.