Prostate cancer selection and RNA extraction
All the prostate cancer samples were collected under an IRB (Institutional Review Board) approved protocol. Hematoxylin and eosin (H&E) slides were prepared from frozen tissue blocks and evaluated for cancer extent and tumor grade by the study pathologist (MAR). To ensure high purity of cancer cells and minimize benign tissue, tumor isolation was performed by first selecting for high-density cancer foci (< 10% stromal or other non-tumor tissue contamination) and then taking 1.5 mm biopsy cores from the frozen tissue block for RNA extraction using TRIzol Reagent (Invitrogen, Carlsbad, CA, USA). The RNA extract was then subjected to DNase treatment using a DNA-free™ Kit (Applied Biosystems/Ambion, Austin, TX, USA). The quality of RNA was assessed using the RNA 6000 Nano Kit on a Bioanalyzer 2100 (Agilent Technologies, Santa Clara, CA, USA). Up to 10 μg of RNA with RIN (RNA integrity number) ≥7 was determined suitable for sample preparation.
Sample preparation
The samples were prepared in accordance with the Illumina RNA sample preparation protocol (Part # 1004898 Rev. A September 2008). Briefly, mRNAs were fragmented at elevated temperature using divalent cations and transcribed into cDNA, thereby generating a library of cDNA fragments. RNA-Seq adapters were then ligated to the blunt ends of the cDNA fragments. The library of cDNA fragments subsequently underwent a size-selection step in which cDNAs were first electrophoresed through a 2.5% agarose gel in TAE buffer. Then, the desired fragment size products (200 or 300 nt) were retrieved from the gel and subjected to PCR amplification using universal primer sites present at the end of the ligated adapters. The library was then subjected to quality control steps such as verification of fragment size and concentration measurements using the DNA 1000 Kit (Agilent Technologies) on an Agilent 2100 Bioanalyzer.
All samples were sequenced using one lane of an Illumina Genome Analyzer II (GAII) flowcell, except for GM12878 and 99_T, which were sequenced using two lanes. Since the experiments were performed over several months as Illumina introduced advances to the GAII platform, the total number of reads and the read length vary (Table ). However, all samples were prepared following the same protocol.
Validation of TMPRSS2-ERG fusion isoforms with PCR
Aliquots from the same RNA stock were used for both RNA-Seq and PCR validation by conventional reverse-transcription PCR. RNA was reverse transcribed using a High Capacity cDNA Reverse Transcription Kit (Applied Biosystems, Foster City, CA, USA). The
TMPRSS2-ERG PCR was performed using Platinum
Taq DNA Polymerase (Invitrogen) with 1 mM MgCl
2, 0.1 μM of each primer (forward,
TMPRSS2 exon 1 - TAGGCGCGAGCTAAGCAGGAG; reverse,
ERG exon 5 - GTAGGCACACTCAAACAACGACTGG; as published by Tomlins
et al. [
23]) and 50 ng cDNA at an annealing temperature (Ta) of 63°C for 35 cycles and the PCR products were separated on a 2.5% agarose gel. For
TMPRSS2-ERG isoform IV, the PCR was performed, using a reverse primer specifically designed for the detection of isoform IV (TGCATTCATCAGGAGAGTTCCTGC), under the same conditions but with Ta 55°C and 40 cycles. The obtained products were isolated from the gel using the MinElute™ Gel Extraction Kit (Qiagen, Valencia, CA, USA) and subsequently sent for Sanger sequencing at the Core facility of Cornell University (Ithaca, NY, USA).
Mapping
We employed ELAND to map the PE reads against the Human Reference Genome (March 2006 Assembly - hg18). We allowed for up to two mismatches of the alignment and selected reads that passed the quality filter from ELAND. In case of pairs mapped to the same chromosome, we selected reads aligned to opposite strands. We also employed bowtie to map the reads to the human genome sequence [
49]. Since bowtie does not allow PE reads to be mapped on different chromosomes, we adopted the following strategy: the two ends were mapped separately to the genome and the best alignment was selected among the top ten candidates in the case of mapping to multiple locations. Two mismatches were allowed for bowtie too. Then, the two ends were paired together, if both ends were aligned. Moreover, for comparison purposes, we mapped the reads to a splice junction and a ribosomal library in addition to the genome (see Additional file
1 for details).
Filtration cascade
Large scale sequence similarity filter
Two paralogous genes resulting as fusion candidates are discarded because their homology can potentially cause a misalignment. We use TreeFam to identify and remove these candidates [
36,
37]. TreeFam is a database of phylogenetic trees of animal genes with the aim of providing a curated list of orthologs and paralogs.
Small scale sequence similarity filter
The small scale sequence similarity filter seeks broad similarities between two transcripts. However, it may be possible that there is high similarity between small regions within the two genes where the reads actually map. Hence, to search for similar sequences within the two candidate genes, we employ a two-step strategy. We first perform a fast search of the reads aligned to one gene against the full transcriptome, represented by all composite models, using bowtie [
49]. If more than a user-defined threshold (default of 1%) of the reads map to one gene 'hit', the partner gene, the candidate is discarded. This approach removes candidates where the reads have high similarity, since bowtie allows up to three mismatches only. For those candidates not filtered out by this approach, a second, more refined comparison is performed. We align the reads mapped to one gene to its partner's sequence by using BLAT [
50]. If the fraction of reads that have similarities to the corresponding partner is higher than a user-defined threshold (default of 1%), then the pair is discarded. In order to call a hit - that is, similarity to the partner gene - we require that at least 75% of the read has similarity to the corresponding gene.
Repetitive regions filter
Some reads may be aligned to repetitive regions in the genome due to the low sequence complexity of those regions, which may result in artificial fusion candidates. We thus remove reads mapped to repetitive regions using RepeatMasker to identify these regions [
51,
52].
Abnormal insert size filter
The PE RNA-Seq experimental protocol requires sequencing of the ends of cDNA molecules of a determined length: the fragment size. If we mapped those sequenced reads to the transcriptome (which we do not know exactly), we would obtain an insert-size distribution - the distance between the two reads - similar to the fragment size. However, since the reads are aligned to the reference genome, the insert-size distribution can be rather skewed (Figure S1b in Additional file
1). Using a splice junction library does not help in this context. Besides having potential biases given by the incomplete knowledge of the junctions, it cannot determine to which isoform the two ends belong. The composite model allows use of the concept of the insert-size also for RNA-Seq data (Figure ). The composite model is the union of all the exons from all known transcripts of a gene. This ensures that all exonic nucleotides are considered. The insert-size distribution computed using the composite model as reference should thus be comparable to the nominal fragment size selected during sample preparation (if there is more than one isoform, the insert size distribution computed with the composite model will be slightly shifted towards higher values because of the inclusion of all possible exonic regions in the composite model).
We then extend this concept to distinguish potentially real chimeric transcripts from artifactual ones. We generate a minimal fusion transcript fragment by using all the PE reads bridging two different genes (Figure ). The rationale is that the insert size distribution computed on this minimal fusion transcript fragment of a real chimeric transcript is similar to the insert size distribution of normal transcripts. This is because we expect inter-transcript PE reads to connect the regions around the junction only. Conversely, a fusion transcript generated by random chimeric pairing would have a rather long minimal fusion transcript because paired reads would randomly join different regions of the two genes. This, in turn, would yield a much higher insert-size distribution compared to that of the real case; that is, it would be abnormal.
Specifically, for each of the candidate chimeras, the insert-size distribution is computed using all paired reads mapping to the composite model of that gene: that is, the intra-transcript insert-size distribution. For this purpose, only reads that are fully contained within exons are considered. If a candidate has only intronic reads, this filter is not applied. Similarly, the 'anomalous' reads - reads that bridge two different genes - are first used to create a minimal fusion transcript (Figure ). Note that from the PE data we cannot determine the full fusion transcript, but only the region nearby the actual junction of the two genes, that is, the minimal fusion transcript fragment. Then, the insert-size distribution of the minimal fusion transcript is computed (inter-transcript insert-size distribution) and compared with the intra-transcript insert-size distribution. If the median of the inter-transcript insert-size distribution is much higher than the median of the intra-transcript insert-size distribution, it is likely due to misalignments. A
P-value is computed by randomly sampling the intra-transcript insert-size distribution. Candidate fusion transcripts having a
P-value lower than a user-defined cutoff are discarded as artifacts. Note that the candidates that are 'outliers' with respect to the intra-transcript insert-size distribution are discarded as artifacts, whereas, in the DNA context, they are kept as potential insertions or deletions (Figure S1a in Additional file
1).
For this analysis, we used a P-value cutoff of 0.01 (corresponding to approximately 2.5 standard deviations from the transcriptome norm) for all samples, except for 2621_D, for which we used a cutoff of 0.0001 because of the much tighter intra-transcript insert-size distribution given by the smaller fragment size compared to the other samples.
Ribosomal filter
The vast majority of transcripts in the cell are ribosomal RNA. Although the experimental protocol typically requires either selecting for non-ribosomal mRNA with polyA+ selection or depleting of ribosomal RNAs, this process is imperfect. This translates into a high abundance of ribosomal transcripts with a higher chance of generating random chimeras. If misalignment occurs too, this would result in artifactual candidates that appear to not involve ribosomal genes. Hence, this filter compares the reads of the candidates to the ribosomal genes sequence database using a more sensitive alignment tool such as BLAT [
50]. If the reads align to ribosomal genes, the candidate is removed.
Specifically, in order to identify reads that bear similarity to ribosomal genes but were mapped to another region, we require a read to have more than 75% similarity to a ribosomal gene to count it as a hit. If more than 10% of reads map to the ribosomal library, the candidate is discarded (Additional file
1). Note that this issue, although related, is independent of the mapping strategy. Indeed, even if we employ a ribosomal library during the alignment phase, there still may be reads that, due to misalignment, will map best to other regions of the genome.
Expression consistency filter
Highly expressed genes give rise to the same issue that occurs with ribosomal genes. This filter compares the expression signal (that is, number of reads) generated by the chimeric reads to the signal of the individual genes. The rationale is that, in the case of a real fusion transcript, the two genes would be expressed at the same or higher levels than the 'chimeric' signal, whereas, in the case of an artifactual candidate, the signal would be generated only from the chimeric reads and the signal of the two individual genes would be much lower.
In more detail, the expression signal of the fusion candidate is computed by counting the number or inter-transcript, that is, chimeric, reads mapped and normalizing by the length of the region covered by those reads. The expression of the individual genes is computed as the number of reads normalized by the length of the transcripts. If the chimeric reads have a higher signal than that of the individual genes, the candidate is discarded.
PCR filter
To avoid artifacts resulting from the PCR amplification step, we require the reads supporting a candidate fusion transcript to independently cover at least p nucleotides (default p = 5) in addition to the read size on both ends, otherwise the candidate is discarded. This ensures that several instances of the transcripts were expressed in the cells. In the case of sufficient coverage, it is also possible to compute the entropy to identify these cases and remove them.
Junction-sequence identifier module
The PE reads can identify the genes involved in a fusion transcript, but cannot directly determine the junction sequence because, typically, short read alignment tools do not allow gapped alignment for the single read. Hence, we developed this module to take advantage of the fast short read alignment tools and identify the sequence of the junction in an efficient way.
Let us assume we have some PE reads joining gene A with gene B, thus suggesting a fusion event between them. Those reads would connect regions around the junction. For each gene, we thus select the region that can include the junction sequence by first considering all exons that can be potentially involved in the junction as well as the intronic regions that are supported by chimeric PE reads. Those regions are extended considering the flanking 150 nucleotides. We then cover them with a set of 'tiles' that are spaced 1 nt apart and construct a fusion junction library by creating all pairwise junctions between these tiles. Since we do not know a priori what the specific form of the fusion transcript is, we create two libraries, one assuming gene A is upstream of gene B and the other assuming gene B is upstream of gene A (Figure ). This fusion junction library plays the same role as a canonical splice junction library: it enables the alignment of short reads, thus overcoming the need for a computationally expensive gapped alignment for reads bridging two exons or, as in this case, regions of different genes.
All the reads, including the non-mapped ones, are then mapped against this library. In this case we consider the two ends independently. The rationale is that the actual junction sequence will be described by a certain pair of tiles, and reads not previously mapped anywhere in the genome now can be aligned to this fusion junction. Moreover, reads that previously mapped with one or two mismatches to the reference genome may now map perfectly to the fusion junction and thus increase the evidence supporting the junction. The size of these tiles depends on the read size as well as the amount of overlap across the two joined tiles required by the user. For example, for reads that are 36 nt long and a required overlap of at least 10 nucleotides, each fusion junction element is 52 nt long, that is, each tile is 26 nt long. This ensures that every 36-nt read, if mapped to this junction element, will have at least 10 nucleotides mapped to the tile of each gene.
To select the true junction sequence, we determine which fusion junction obtains the highest support, that is, the junction with the highest number of reads aligned to. In addition, we also require the set of single-end reads to be uniformly distributed across the junction to provide further evidence. Provided there is enough coverage overall, we employ a Kolmogorov-Smirnov statistical test, otherwise we apply a simple heuristic by requiring that at least
n reads align to the junction with at least
m different starting positions on the junction sequence. The latter parameter ensures that no PCR artifacts affect the junction identification. Also, we search for similarity of the identified junction elsewhere in the genome using BLAT [
50], in order to eliminate potential spurious junctions.
From a computational viewpoint, let us assume that we have about 1,000 virtual tiles for each gene. By creating all pair-wise combinations of these virtual tiles for the two genes and considering both directions - gene A upstream of gene B and
vice versa - will result in 1,000 × 1,000 × 2 = 2 × 10
6 putative junctions. If we have approximately 30 candidate fusion transcripts, the putative fusion junction library will thus contain approximately 6 × 10
7 = 60 million elements. Using fast alignment tools, this analysis is feasible, although it requires large-scale computational resources. Indeed, we use bowtie to first create an index of the fusion library and then map the reads against it [
49]. To fully exploit the parallelization of a multi-node computing cluster, each fusion candidate is analyzed independently on different nodes. Moreover, the fusion junction library itself is also split across multiple nodes in order to optimize the generation of the indexes.
Sequencing depth and detection of fusion candidates
To assess the impact of sequencing depth on the detection of the fusion candidates, we randomly selected a fraction of mapped reads from sample 2621_D. Specifically, we extracted 10%, 25%, 50%, 75%, and 90% of all PE mapped reads (1.1 million, 3 million, 6 million, 9 million, and 10.8 million PE reads, respectively). The number of fusion candidates with more than five PE reads clearly correlates with sample size: 0, 1, 3, 4, and 7, respectively. The SLC45A3-ERG fusion was detected as the top candidate, starting with 3 million mapped PE reads, with a SPER of 4.7. The relatively low SPER for this candidate is related to the smaller fragment size that has been adopted for this sample (200 nucleotides compared to 300 to 330 nucleotides for the others). The smaller fragment size limits the number of PE reads that could span the junction. From this analysis, it appears that 3 million reads are sufficient for detecting this fusion in this context. However, this result is difficult to generalize. It might be true only for fusion transcripts that are expressed at a similar level to SLC45A3-ERG. We cannot exclude the presence of less abundant fusion transcripts that would have been uncovered by deeper sequencing.
Scoring the candidates
We may take into account different types of information to score the candidates. Potentially we could use the number of PE reads bridging the two genes, the number of reads supporting the main junction, and the 'shape' of the coverage as indicators of the reliability of the candidate. Practically, since it may be possible that the true junction is not detected because of lack of coverage, the more general quantities are based on the number of PE reads supporting the fusion candidate. Hence, every fusion transcript candidate is first scored using
SPER, the normalized number of supportive PE reads, the most intuitive quantitative measure (see Results - Scoring the candidates). One may argue that a 'local' score - a score that takes into account the expression of the genes involved in the fusion - might be a reasonable choice. We defined
LSPER (local
SPER) as the number of inter-transcript PE reads supporting the fusion divided by the average gene expression value computed as RPKM [
3]. However, in many cases, only one allele contributes to the fusion transcript. Hence, the expression of the fusion transcript (estimated by the number of inter-transcript reads because the structure of the whole fusion is unknown) may not correlate with the expression of the genes generating it and thus this may impair the correct ranking of the candidates (text and Figure S6 in Additional file
1). After computing
SPER for each candidate, we need to assign a 'confidence' to this number. We compare it with two expectations. The first one,
DASPER (the difference between the observed and analytically calculated expected
SPER), is based on the observation that if two ends were randomly joined, the probability that this occurs for gene A and gene B is proportional to the product of the probability that the two single-ends of the pair are mapped to gene A and gene B:
where P(A) and P(B) are the probabilities that a single-end is mapped to gene A and B, respectively. Note that this is a very conservative estimate because it does not take into account that single ends should also be within a certain distance, based on the fragment size, to be joined in a pair. Nevertheless, as a first approximation, the expected SPER can be estimated as the ratio of the number of single-end reads mapped to gene A and gene B and the total number of mapped single-end reads. For the i-th candidate, involving gene A and B, we have:
where <mAB> is the expected number of inter-transcript PE reads under the null hypothesis, and mA and mB are the number of single end reads mapped to gene A and B, respectively. By subtracting this number from the observed SPER, we can rank the fusion candidates according to DASPER score:
We chose to compute the difference between these two quantities compared to a more traditional ratio or log-ratio because it is more robust in cases of low coverage (that is, low number of reads) than computing a ratio. More accurate estimations of the expected SPER can certainly be devised for cases with low coverage, although they would likely require the specific characteristics of the sequencing platform and the mapping approach adopted to be taken into account, thus reducing the broader applicability of this method. Although DASPER can reliably rank the candidates within a sample, it may be possible that when comparing candidates from multiple samples DASPER may not properly account for different fragment sizes. Indeed, smaller fragment sizes decrease the likelihood of sequencing PE reads bridging two genes, resulting in lower SPER, and consequently, lower DASPER, affecting the comparison among samples. To address this issue, for each fusion transcript candidate i, we compute the ratio of its SPERi to the average SPER of all candidates of a sample, that is, RESPER:
where M is the total number of fusion transcript candidates for a sample. Since this quantity is independent of the fragment size, it is more suitable for comparisons across samples. Also, as long as the sequencing depth increases, RESPER is expected to increase for a real fusion transcript compared to an artifactual one (Figure ).
In the case of sufficient coverage, we can also integrate the information related to the junction-sequence identifier analysis, such as the number of single-end reads supporting a junction as well as how evenly the single-end reads cover it. Ideally, the entire fusion junction should be uniformly covered by the reads. If this does not occur, the chimeric transcript might have been generated during sample preparation and the PCR amplification step resulted in an over-representation of that transcript. However, definitive determination of uniform coverage requires great sequencing depth.
Computational complexity
One of the main issues to address is the computational complexity of processing RNA-Seq data. Computationally, the three modules have different requirements. The fusion transcript detection module depends on the total number of mapped reads. Once the alignment is performed, it takes about 15 minutes to run this module on 20 million mapped PE reads using one core of a dual 2 Intel® Xeon® CPU E5410 at 2.33 GHz (four cores each, for a total of eight cores), with 6 MB cache, 32 GB RAM, and a 156 GB local disk. The filtration cascade module takes about 15 to 30 minutes to run on the same architecture. The difference depends on the number of candidates initially identified. A more intensive effort is required for the junction-sequence identifier analysis, the main bottleneck being the indexing of all the virtual tiles. The time complexity also depends on the size of the region being tiled. The alignment of the reads after the indexing is much less computationally demanding. In fact, the time to complete a junction-sequence identifier analysis for a single candidate in both directions, AB and BA, ranges from about 90 to 180 minutes if run on a single machine. However, by splitting the fusion junction library in different files, it is possible to run the indexing step in parallel, thus substantially decreasing the time complexity. Indeed, by splitting the fusion junction library in files with 2 million elements, it is possible to complete the indexing and the mapping in about 15 minutes for both orientations.
Report of the analysis results
We also developed a set of tools to report the analysis results through a web interface and the UCSC Genome Browser (text and Figure S7 in Additional file
1) [
53]. All programs of FusionSeq take as input one of the standard formats we defined (Additional file
1), and additional tools convert them into files that can be interpreted by the UCSC Genome Browser, such as WIGGLE, BED or GFF. This integration is facilitated by the use of a web interface to interrogate the samples. The user selects the sample and the list of potential candidates is shown with the candidates sorted according to
DASPER (Figure S7a in Additional file
1). Information regarding the genes involved, such as gene symbols (including aliases), gene description and genomic coordinates are also reported (Figure S7b in Additional file
1). By clicking on the genomic coordinates the corresponding UCSC Genome Browser page is displayed. Also, each candidate has a detailed page reporting detailed information, including the junction-sequence identifier analysis results (Figure S7c in Additional file
1).
Although we extensively rely on the data format of the UCSC Genome Browser, it is not possible to show the results for inter-chromosomal fusions (that is, those between genes on different chromosomes) since it can display only one chromosome at a time. In order to address this issue, we developed SeqViz (Figure S8 in Additional file
1), an application that is based on Circos, an open source software that is particularly suited to the display of genomic information by representing the full genomes as a circle [
54]. An example of a Circos image can be found in Figure . Among the main features of Circos is the high flexibility in adding and showing many types of information: connection between the two ends of a PE read, gene annotation sets, expression values, and so on.
Software and data availability
FusionSeq is available for download at [
34]. Data sets used in this study are available via dbGaP [
55] (study accession [phs000311.v1.p1]).