Targeted alignment method
Based on 27,157 well-annotated RefSeq transcript alignments to the human genome, we identified 2,470,383 exon-exon junctions between same-strand transcripts that spanned a potential intron of 200,000 bp or less. We also identified 1,856,519 possible intragenic exon-exon junctions by taking all pairs of exons within a given transcript, regardless of distance. Each of these junctions was extended by 80 nt on each side and used as targets for the alignment of reads. An alignment therefore supports a given exon-exon junction when it crosses the midpoint by a certain overhang. Longer overhang requirements provide greater specificity, while shorter ones are more sensitive.
Using an overhang of 11 nt, intragenic splicing was supported by 2-4% of 33-nt reads and by 6-11% of 50-nt reads (Table ), showing that longer reads provided significantly more raw material for identifying splicing events. For TIC splicing, a threshold of 8 nt yielded a unique alignment to a TIC target in about 1 in 30,000 reads. At 11 nt, the frequency dropped to 1 per 40,000-120,000 reads. Overall, we found evidence for TIC splicing to be rare in RNA-Seq data.
Results of alignment to intragenic and TIC targets
To achieve sensitivity with such few reads, we started with the set of TIC alignments with an overhang of 8 nt, and used a clustering method to to obtain specificity. Clustering allowed a given TIC exon-exon junction to be supported by the entire collection of alignments in demonstrating a sufficient overhang. Before clustering, though, we filtered our set of TIC alignments to remove those that corresponded to intragenic alternate splicing events. These spurious TIC alignments arose because they spanned a particular splice combination across two RefSeq transcripts for the same gene that was not represented by any single RefSeq transcript. This filtering process eliminated 3702 (29%) out of 12,400 putative TIC alignments.
In the clustering process, we took the remaining 8698 TIC-aligning reads over all samples, grouped them according to their exon-exon junction, and created a multiple sequence alignment for the cluster, resulting in 608 clusters. To reduce the incidence of false positives due to poor alignments, we implemented a filtering method based on the consistency of matches and mismatches on both sides of the exon-exon junction, essentially requiring that at least one read have a match to the genome at all 11 bp on both sides of the junction. Our filtering criteria were designed to eliminate false alignments, but still accommodate sequencing errors, which can occur at a rate of 1% or more in short read data.
The filtering step eliminated 137 (23%) of the clusters to leave 471 TIC candidates supported by 3373 TIC alignments. However, the number of supporting reads was not evenly distributed over the different candidates. In particular, 2195 (65%) TIC alignments supported 12 candidates corresponding to various pairs of exons from the human leukocyte antigen (HLA) genes, namely, from HLA-B to HLA-C (7 intergenic splicing candidates), from HLA-DRB1 to HLA-DRB3 (1 candidate), and from HLA-G to HLA-A (4 candidates). Because these genes are highly polymorphic and share sequence similarity with one another (e.g., 92% sequence identity between HLA-B and HLA-C), such TIC candidates most likely represent misalignment due to sequence errors or polymorphisms rather than true TIC events. Another example of likely false positives were 15 TIC candidates supported by 41 TIC alignments involving pairings of the metallothionein family members MT1A, MT1B, MT1E, MT1F, MT1 H, MT1 M, MT1X, MT2A, and MT3.
To eliminate such cases of false positives due to homologous genes, we implemented another filtering step based on sequence similarity among the TIC splice and its component 5' and 3' genes. Among the TIC candidates eliminated were 26 TIC candidates involving pairs of various zinc finger proteins; 7 TIC candidates involving pairs of keratins KRT5, KRT6A, KRT8, KRT14, KRT17, KRT31, KRT32, KRT76, KRT77, KRT78, KRT81
, and KRT83
; and 4 TIC candidates involving pairs of kallikreins KLK2, KLK3, KLK9
, and KLK11
. Overall, our homology filtering step eliminated 32% of the TIC candidates to leave a final set of 339 TIC events supported by 822 alignments (Additional files 1
To assess the false discovery rate (FDR) of our analysis pipeline, we modified our set of artificial exon-exon junctions by removing 5 nt from both sides of the junction, and performed the same alignment and filtering steps against these modified junctions. This method was previously used in determining an FDR rate for alternative splicing predictions [20
]. This control experiment gave a total of 5 putative TIC candidates, yielding an estimated FDR rate of 5/339 = 1.5%.
Of the 822 surviving TIC-aligning reads, 459 (56%) came from the MAQC samples sequenced at 50-nt read lengths, 229 (28%) from the N1 and N2 samples sequenced at 50- and 75-nt read lengths, and the remaining 134 (16%) from the prostate samples sequenced at 33-nt read lengths. This skewed distribution is likely to be due largely to the utility of longer read lengths in detecting TICs, rather than the underlying frequency of TIC events in the samples.
Among the 339 TIC events, two-thirds (212) were supported by a single alignment and one-third (127) were supported by multiple alignments. The TIC events with the largest numbers of supporting reads were PMF1-BGLAP e4/5-e2/4 (55 reads), AZGP1-GJC3 e2/4-e2/2 (41 reads), BPTF-KPNA2 e9/30-e2/11 (23 reads), RBM14-RBM4 e1/3-e2/4 (18 reads), and C15orf38-AP3S2 e5/6-e2/6 (16 reads). (In this notation, we indicate the number of exons in the gene, so "e4/5" denotes the fourth exon out of five in the gene.)
The 339 TIC events showed several cases of multiple TIC isoforms across 302 distinct gene pairs. About 11%, or 32, of the gene pairs had multiple isoforms, with the pair PLEKHO2-ANKDD1A having 4 isoforms (Figure ), and three pairs, KIAA1984-C9orf86, GCSH-C16orf46, and RBM14-RBM4, each having three isoforms. In almost all cases of multiple isoforms, the isoforms had one splice site in common, suggesting the existence of a preference for that splice site. The finding of multiple TIC isoforms has been reported experimentally in isolated cases, but not in previous EST-based surveys, which were not designed to identify them.
Figure 1 Complex isoforms observed in transcription-induced chimeras. TIC splicing events are shown by dashed arrows, labeled with splice distance and samples or ESTs with supporting alignments. Standard splicing is shown by solid lines. (A) Multiple isoforms (more ...)
Comparison with existing databases
The AceView database [33
] is intended to store all observed alternative splicing events in various genomes. Manual examination of our TIC events on the AceView Web site indicated that many are annotated as complex loci, those in which a protein product is generated from the fusion of the two genes, although the two genes may still have distinct expression. We performed a search of AceView for all of our TIC events and found that 88 (26%) were previously identified in that database. The 32 gene pairs with multiple isoforms were more often listed in AceView as complex loci, with 40% (13) having that annotation.
We compared our TIC events with the 212 EST-based events reported by Akiva and colleagues, and found that 37 fusion events (11%) were identified with the same splice sites, with another 39 gene pairs (12%) identified with a different pair of splice sites. In comparison with the 176 human TIC events reported by Parra and colleagues, these values were 14 (4%) and 16 (5%), respectively. A comparison among the EST-based surveys and our study at the gene pair level shows relatively little overlap among the three studies (Figure ). Therefore, our RNA-Seq study increases the catalog of gene pairs with observed TIC events by 66%.
Figure 2 Characteristics of TIC events. (A) Comparison of TIC gene pairs found in previous EST-based surveys and those found by RNA-Seq in this study. (B) Distribution of TIC events across tissues. Only TIC events with multiple supporting reads are included. HBR (more ...)
We also scanned genomic alignments of all GenBank ESTs to find support for our TIC events, and found supporting ESTs for 100 events (29%), which covered 82 of those found in AceView plus an additional 18 events. In addition, 12 of our TIC events (4%) were also found by Maher and colleagues in their analysis of the HBR and UHR samples. A comparison of our findings with the FusionSeq-based prostate cancer study [18
] showed that we had reads for 6 of their 11 candidates, including VMAC-CAPS
, which was not otherwise supported by external evidence. However, one of their candidates, ZNF649-ZNF577
, was removed by our conservative homology filtering step, because we found that the two genes share a region with 42 matches in a window of size 50. Altogether, 128 (38%) of our TIC events had support from another database or study. We noted a difference in support between our candidates that had a single read finding and those that had multiple reads, with external support for only 59 (28%) of the 212 single-read candidates, but for 69 (54%) of the 127 multiple-read candidates.
Characteristics of TIC events
Among the 127 events supported by multiple alignments, 95 (74%) had reads from different samples. If we consider the prostate tumor and normal samples as a single tissue type, then 79 events (62%) had support from two or more different tissue sources in prostate, brain, and universal reference. The distribution of tissue sources among these multiply-supported TIC events supports a ubiquitous expression pattern for some events, with others that may potentially be brain- or prostate-specific (Figure ).
Distances of TIC splices showed an exponential distribution (Figure ), consistent with an earlier study [5
]. Approximately one-fourth had distances of less than 12,000 bp, one-half less than 26,000 bp, and three-quarters less than 54,000 bp. These values appear larger than the median of 8500 bp reported in the previous study, but that study measured the distance between genes, rather than the distance between spliced exons. When we computed the intergenic distance for each TIC, we obtained a comparable median of 8866 bp. To rule out the possibility that the preference of TICs for shorter splice distances was due to our underlying set of artificial exon-exon junctions, we compiled a comparative distribution over the artificial splice distances (Figure ), which shows a much different distribution favoring longer distances uniformly above 40,000 bp.
TIC splices occurred predominantly between the last donor site of the 5' gene and the first acceptor site of the 3' gene. In our set of readthrough fusions, this splicing pattern between the (n -
1) and +2 exon represented 54% of the cases, somewhat more than the 44% seen in a previous study [5
]. Exons upstream of the (n
- 1) exon spliced with the +2 exon 25% of the time, while the (n
- 1) exon spliced with exons downstream of the +2 exon 10% of the time. Therefore, alternate choices were more likely to occur with the donor site than with the acceptor site. In 12% of the cases, a TIC event spliced both an upstream exon other than (n
- 1) and a downstream exon other than +2.
To determine whether our TICs were likely to generate a functional protein, we computed a probable coding sequence (CDS) for each TIC (Figure ). For TIC breakpoints that occur after the transcription start site (TSS) of the 5' gene, we expect that the TSS should be preserved, serving as the start of the reading frame for the rest of the TIC transcript. Starting from the original TSS, the reading frame was preserved for the 3' gene in 120 cases (35%), ending at the original stop codon, and frameshifted in the other 193 cases (57%) with a post-TSS breakpoint. In the remaining 26 cases (8%) with a pre-TSS breakpoint, the TIC transcript must utilize a new TSS, in either the 5' or 3' gene, and we predicted the TSS in such cases based on the longest open reading frame. In 22 of the 26 cases, the new TSS site was in the 3'
gene and preserved its reading frame. Altogether, 155 TIC events (46%) preserved the frame of the 3' gene, which is somewhat higher than the chance expectation of 33% and the finding of 36% reported in a previous study [6
In the 54% of cases where a frameshift occurred in the 3' gene, we determined whether the CDS ended within the last exon, because transcripts with a premature termination codon (PTC) should be degraded by the cellular nonsense-mediated decay (NMD) mechanism [34
]. We found a PTC in 167 (91%) of the 184 TIC events with a frameshift in the 3' gene. Overall, our analysis indicates that half of TIC events should be degraded by NMD, and half should generate a protein product.
We extended our analysis to predict the domains encoded by the TIC protein products, and to compare them with the original domains of the 5' and 3' genes. We found that conservation of domains depended on whether the reading frame was preserved (Figure ). For TIC events with a full CDS from the original TSS of the 5' gene to the original stop codon of the 3' gene, domains should be lost only when the breakpoint occurs before a 5' domain or after a 3' domain. Among the 5' genes that originally had an identifiable domain, all of the domains were preserved in 70% of TIC events and at least one was preserved in an additional 10%. For the 3' genes, these values were 92% and 4%, respectively, indicating that 3' domains are highly likely to be preserved in these TICs. Among the 86 TIC events with identifiable domains in both the 5' and 3' genes, all domains of both genes were preserved by the TIC protein in two-thirds (58) of cases.
For TIC proteins that had a 3' frameshift or a new TSS, but still had their terminating codon in the last exon, domains were more likely to be lost. All 5' domains were preserved in only 45% of cases, and all 3' domains in only 44% of cases. TIC proteins with a premature termination codon (PTC) have a frameshift in the 3' gene, and therefore preserve no 3' domains. Theoretically, the predicted effect on 5' domains shows that they should largely stay intact, with 78% of these cases preserving all 5' domains. However, since proteins with a PTC are subject to degradation by NMD, they should not produce a protein product.
Expression of TICs and their constituent genes
One advantage of RNA-Seq data is that it can provide expression levels in addition to splicing information. Although our targeted alignments are not useful for determining expression, alignment of reads to either known transcripts or the genome can provide expression levels of genes or even exons. Our spliced alignment method, described later, provided genomic alignments for comparing the expression levels of 5' and 3' genes. We used these alignments to compare the expression of genes involved in observed TIC events with those not involved. To remove possible confounding factors, we required that the genes have an exon with a potential TIC exon within 200,000 bp and that they be expressed in the given sample, as determined by having an observed intragenic splice in that sample. The results (Figure and ) indicate that 5' and 3' genes with higher expression levels are more likely to give rise to observable TIC events.
However, RNA-Seq is limited in its ability to measure expression levels of rare splicing events, such as TICs. To obtain expression levels of TIC splice events and to confirm our computational predictions, we performed experimental qRT-PCR assays of six TIC events from among the 37 that had multiple supporting reads including one from a prostate tumor sample. Regardless of which samples had an observed TIC splice, we evaluated these TICs across our set of six prostate tumor and normal samples, as well as in a commercial sample of pooled normal prostate RNA. In all cases, the TIC events were detected in all samples (Figure , left sets of barplots), showing a ubiquitous expression pattern across prostate tissues. However, we did see some variability in expression across our samples, especially for MSMB-NCOA4, which was highest in the T2 and N2 samples; SLC45A3-ELK4, which was highest in T3 and N3; and AZGP1-GJC3, which were low in T2 and T3.
Figure 3 Expression of TICs and their component genes. (A-F) Each panel contains expression data for a TIC and its component genes, and is labeled with the splice distance. The leftmost plot in each panel shows the expression of the TIC splice using qRT-PCR measurements (more ...)
We also saw differences in the overall expression levels across different TIC events, and the TIC events in Figure are arranged in increasing order of measured TIC splice expression. The two fusions with the highest overall expression levels relative to GAPDH were MSMB-NCOA4 and SLC45A3-ELK4, each showing expression levels of up to 0.05 times the level of GAPDH. The expression levels of the remaining fusions tested were much lower, with ADCK4-NUMBL having only 0.004, and HDAC8-CITED1 only 0.0009 the level of GAPDH.
We used our RNA-Seq data to determine the expression of the 5' and 3' genes in each sample (Figure , right sets of barplots). In two of the TIC events (Figure ), the 5' and 3' gene expression levels are comparable, and in one TIC event, DUS4L
(Figure ), the 3' gene expression is higher than that of the 5' gene expression. However, in the remaining three TIC events (Figure ), expression of the 5' gene is several orders of magnitude higher than that of the 3' gene. When we compare the expression pattern of the TIC splice with that of its component genes, in most cases, we found a general pattern of correlation between expression of the TIC event and that of the 5' gene, although there was also some correlation with 3' gene expression. These results are generally consistent with other cases studied in the literature, which have also shown correlations between TIC expression and that of the 5' and 3' genes [10
]. However, in our data, the TIC event MSMB
, which had the greatest variance in expression across samples showed consistency of TIC expression with that of the 5' but not the 3' gene (Figure ).
We can compute the relationship between the TIC splice expression with that of the 5' gene, either as a scatterplot (Figure ) or as a ratio (Figure ). This relationship reflects the efficiency of TIC splicing relative to the amount of 5' transcript available. Since the data are plotted on a logarithmic scale, they suggest that TIC splicing efficiency varies among different events by several orders of magnitude, with the DUS4L-BCAP29 event having an efficiency 3000 times as high as that of MSMB-NCOA4. TIC splicing efficiency does not appear to be related to splicing distance, since we see much different levels for the two TIC events with the shortest splicing distances of 4166 and 5955 nt.
Tissue specificity of TICs
Previous reports of the SLC45A3
e4-e2 TIC have found it to be specific to prostate cancer samples, particularly those that lack ERG
]. We performed a detailed study of this TIC by qRT-PCR in an additional panel of 20 matched prostate tumor and normal samples (Figure ). Our assay detected at least some level of TIC expression in all prostate samples but none in any of 9 lung cancer cell lines, revealing that expression is indeed tissue-specific. We also saw variability of TIC expression with extremely high levels in 2 of 7 ERG-
negative tumors, at 0.35 and 0.11 times the expression of GAPDH
, and to a lesser degree, in their matched normal samples. In the remaining 5 ERG-
negative tumors, and in all 16 ERG-
positive tumors, expression of the TIC was detectable, but at lower levels, and with only three minor exceptions, expression of the TIC was lower in cancer than in the matched normal sample. Our data support earlier findings that low expression of ERG
appears to be a prerequisite, but not a sufficient condition, for high cancer levels of the e4-e2 TIC event.
Figure 4 Expression patterns of the SLC45A3-ELK4 e4-e2 TIC and related genes. (A) qRT-PCR levels of SLC45A3-ELK4 e4-e2 TIC in the sequenced prostate tumor and normal sample pairs T1/N1, T2/N2, and T3/N3 pairs (labeled as 1-3, and marked with "-" for ERG-negative (more ...)
To explain the prostate-specific expression pattern of the SLC45A3-ELK4 TIC, we examined expression data from 2823 human normal and 1437 tumor samples measured on the Affymetrix HG-U133 GeneChip, taken from the GeneLogic (Gaithersburg, MD) database. We found that the 5' gene SLC45A3 is expressed specifically in prostate tumor and normal samples (Figure ), whereas the 3' gene ELK4 is expressed broadly across multiple tissues (Figure ). Therefore, the prostate-specific expression pattern of this TIC appears to be consistent with expression of the 5' gene.
We further compared the expression of SLC45A3 with that of ERG in prostate samples (Figure ). The highest levels of expression of the 5' gene are found in the ERG-negative prostate tumor samples, although not in all of them. Hence, we found further similarity of the expression pattern of the TIC event and that of the 5' gene relative to ERG expression, where low levels of ERG expression appear to be a prerequisite, but not a sufficient condition, for the highest levels of the 5' gene expression.
We also used the GeneLogic database to find other examples of TIC events with prostate-specific expression of the 5' gene. We found such patterns for MSMB-NCOA4, AZGP1-GJC3, ENTPD5-FAM161B, TMC5-CP110, TPD52-MRPS28, IVD-BAHD1, and KLK11-KLK7. Among these gene pairs, the read evidence was strongest for MSMB-NCOA4 (8 reads over two isoforms) and AZGP1-GJC3 (41 reads), with only 1 or reads for each of the other gene pairs.
MSMB (Figure ) shows high expression in normal stomach and lung cancer samples, in addition to prostate samples, and appears to show higher expression in prostate normal samples than in prostate cancer samples. The expression of AZGP1 (Figure ) is high in breast, head and neck, and liver samples in addition to prostate. For both of these 5' genes, their corresponding 3' genes (Figure and ) do not show evidence of prostate specificity.
Figure 5 Expression profiles for 5' and 3' genes in prostate-specific TICs. Expression profiles for (A) MSMB, (B) NCOA4, (C) AZGP1, and (D) GJC3. Panels A and C represent the 5' genes of TIC events, while B and D represent 3' genes. Affymetrix probe sets used (more ...)
The prostate specificity of our supporting reads was consistent with the microarray-based prostate specificity of the 5' genes. For example, the MSMB-NCOA4 event was supported by 3 prostate tumor reads and 2 normal reads for the e3-e2 isoform, and by 1 tumor and 2 normal reads for the e2-e2 isoform, and was not observed in the HBR or UHR samples. Likewise, the AZGP1-GJC3 profile was supported by 7 reads across all three prostate tumor samples, and by 33 reads across all three normal prostate samples; it was also supported by one read from the HBR sample. The TIC events with prostate-specific 5' genes each had TIC support from one or two prostate samples, without any reads from HBR or UHR.
To survey the extent of tissue specificity in all of our observed TIC events, we constructed a heatmap to show the mean expression of the 5' gene across a panel of normal tissues (Figure ). This heatmap shows that 5' genes have a varying degrees of tissue specificity. Although it is difficult to define tissue specificity precisely, approximately half of the genes, in the bottom part of the heatmap, are strongly specific to one organ, with many expressed specifically in either leukocytes or in the brain. An additional one-fourth show weaker levels of tissue specificity, and one-fourth have relatively uniform expression over all tissues. The implication of these findings for the tissue specificity of TIC expression depends on the dependence of TIC expression on that of the 5' gene.
Figure 6 Tissue specificity of 5' genes in observed TICs. Heatmap of gene expression across a panel of normal tissues for the 5' genes corresponding to all observed TICs. Data are taken from the GeneLogic (Gaithersburg, MD) database. Each bar in the heatmap represents (more ...)
Spliced alignment method
We also explored an alternative, spliced alignment strategy for finding TICs by identifying splicing events within individual reads using our GSNAP program [27
]. Our program provides options for finding splicing based on a probabilistic splice site model or a user-provided database of known splice sites. We used both sources of evidence in our alignments to the human genome, with the same set of splice sites based on RefSeq transcripts as we used for the targeted alignment method. Local splicing involving known splice sites within 200,000 bp should therefore yield a subset of the results found in the targeted alignment approach. This method should be less sensitive than targeted alignment, because the version of GSNAP we used requires 14 nt on each side of an exon-exon junction to identify a spliced alignment, rather than the 8 nt we used for targeted alignment. (More recent versions of GSNAP can detect short overhangs when a database of known splice sites is provided.)
For the resulting spliced alignments, we applied the same clustering and filtering steps as for the targeted alignment approach, and found over two-thirds (231) of the 339 TIC events that were found by targeted alignment. These events were based upon 427 TIC alignments, which is half of the number found with targeted alignment. However, the sensitivity rate depended heavily on read length. For samples sequenced at 33-nt, spliced alignment yielded only 24 TIC alignments, which is only 18% of those found with targeted alignment. For samples sequenced at 50-nt or more, spliced alignment yielded 59% of the TIC alignments found by targeted alignment.
Transcription-induced chimeras with intervening exons
Although spliced alignment is less sensitive than targeted alignment, it does have the ability to find novel splicing at locations not included in a database of known splice sites. We observed many novel splice sites occurring within genes, corresponding to cryptic splice sites or novel exons. However, we were especially interested in novel exons in the context of TIC events. A previous study [5
] found several cases where EST clusters revealed a novel exon between two genes in a TIC, accounting for 12% of the gene pairs in their study. Such transcription-induced chimeras with intervening exons, which we call TICIEs, are more challenging to find using RNA-Seq 33- to 50-nt short reads, because they are generally not long enough to span the two introns surrounding most exons. However, we can detect parts of a TICIE by finding novel splicing events on both ends of an apparent intervening exon.
To identify such events, we looked for spliced alignments of 200,000 bp or less involving one known site from the database and one novel site based on a probabilistic splice model as implemented in GSNAP. For each sample, we recorded pairs where the novel splice sites were within 300 bp of each other, which was our threshold for exon length. For the T1, T2, T3, and N3 samples with 33-nt reads, we found 22, 39, 70, and 40 novel exons, respectively. For the N1, N2, HBR, and UHR samples with longer reads, we found 1315, 753, 1427, and 1502 novel exons, respectively. Novel splicing is reported by GSNAP more frequently with longer read lengths, because the program uses a sliding scale of alignment length and probabilistic model score to help find true positive novel splicing events.
The vast majority of these novel exons represent intragenic cases where both known splice sites belonged to the same gene. To find TICIE events, we extracted novel exons where the known splice sites were on different genes and further applied our homology filtering step described previously to eliminate probable false positive events. We further considered only cases where the known genes were coding and well annotated, as indicated by a RefSeq accession prefix of "NM_".
We found 30 TIC events having an intervening exon, with 22 (73%) having their intervening exon occurring between the (n -
1) exon and +2 exon (Additional file 3
). Therefore, the (n -
1) to +2 exon pattern was stronger in TICIE events than in TICs, where only 51% had this pattern. In half (14) of the cases, the intervening exon was located in the intergenic region, distinct from exons of the original genes; in 7 cases, it overlapped the last exon of the 5' gene; and in 8 cases, it overlapped the first exon of the 3' gene. In the remaining case, EIF3K-ACTN4
e4/8-e2/21, the intervening exon was a novel exon between e4 and e5 of the 5' gene.
These TICIE events involved 26 distinct gene pairs, because four gene pairs had a second isoform. Three of these isoforms involved an alternate splice site in the intervening exon, (e.g., Figure ), while the remaining isoform involved an alternate splice site in the upstream gene (Figure ). For 6 of the gene pairs, our previous targeted alignment analysis had also found a direct TIC event without an intervening exon between the 5' and 3' gene pairs. One example, VAMP8-VAMP5 (Figure ), shows that the direct TIC and multiple TICIE isoforms can be found in the same sample N1.
We measured the coding potential of the TICIE events both with and without the intervening exon, and compared each with the original genes. Without the intervening exon, 10 events would give a full CDS, 14 started from the original TSS but had a frameshift of the 3' gene, and 6 had a new TSS. With the intervening exon, these counts changed to 5, 19, and 6, respectively. Therefore, intervening exons generally had a detrimental effect on the coding potential, relative to the original frames. The lack of preference for coding potential was also supported by the fact that intervening exons had lengths that were a multiple of 3 in 12 cases, approximately the same as the 10 that would be expected by random chance. The 18 intervening exons with lengths that were not multiples of 3 caused the 3' gene to go out of frame in 4 cases; go into frame in 2 cases; change one premature stop codon to a different premature stop codon in 8 cases; and had no effect in 4 cases because the new TSS occurred after the intervening exon. The intervening exons caused loss of domains in 6 cases relative to the TIC if it had lacked the intervening exon, and a gain of domains in one case by restoring the frame of the 3' gene. We checked to see if any intervening exons introduced a new domain into the protein, but did not find any such cases.
We found that EST-genomic alignments supported three-fourths of our TICIE splicing events: 3 of the 30 events had EST support for their upstream-to-intervening splices only; 11 had EST support for their intervening-to-downstream splices only; and 9 had EST support for both splices. Among the 9 TICIE events with EST support for both splices, 4 had at least one EST that spanned both the upstream-to-intervening and intervening-to-downstream splices (including the two events in Figure ).
We compared our TICIE gene pairs with those found by Akiva and colleagues in their study, and found five in common, although in three cases, the previous study found TIC events for the gene pair rather than our TICIE events. In one case, ZNF763-CHST7, our intergenic exon was identical to that found previously (although the upstream splice site was different), and in the other case, VAMP8-VAMP5, we found two intergenic exons that were different from that found previously, with all three exons sharing the same 3' end and having different 5' ends (Figure ).
Another advantage of the spliced alignment approach is its ability to detect long-range intrachromosomal fusions and interchromosomal events. Although the identification of translocations in RNA-Seq data is not novel, our single-end short reads are not as ideal for this purpose as paired-end or longer reads of 100-
200 nt [11
]. Nevertheless, our results based on GSNAP on short reads are noteworthy for comparison with previous studies, especially those analyzing the same HBR and UHR samples [17
]. In addition, our analysis of distant fusions provides a contrast with our analysis of TIC events, even though both are supported by the spliced alignment approach.
Our genomic alignments gave a total of 29 long-range intrachromosomal fusion candidates of greater than 200,000 bp; 36 scrambled candidates in which the acceptor splice site was upstream of the donor splice site; 59 inversion candidates with the splice sites on opposite strands of the same chromosome; and 223 interchromosomal candidates, or translocations (Additional files 4
Almost all of these fusions were identified through the 50-nt reads from N1, N2, HBR, and UHR. Filtering steps are also necessary to eliminate false positives from among these candidates, due to various causes, including library artifacts. Since most of our candidates were supported by a single read, one simple filtering criterion was to consider only fusions supported by multiple reads, of which we found 19 (Table ).
Distant gene fusions with multiple read support
In contrast with TIC events, which often had read support from multiple samples, distant gene fusions were each found in only a single sample, with the exception of TMPRSS2-ERG
, which was found in T2, T3, and N3 (although the N3 had only a single read which may be due to contamination with adjacent tumor tissue). Also, whereas we found many cases of multiple isoforms for TIC events, such isoforms were rare for distant fusions, with the exception of the RPS6KB1-TMEM49
scramble, where we found two isoforms, e2-e12 and e4-e12, in the UHR sample. We cannot determine from the transcriptional sequence data if these isoforms are present at the genomic DNA level, or are due to alternative splicing. However, numerous isoforms have been found for the TMPRSS2-ERG
fusion junction, including multiple isoforms in the same sample, suggesting that they are due to alternative splicing occurring after a single genomic event [36
We also found that spliced alignment using even 50-nt reads can be challenging, as demonstrated by our alignment of the BCAS4-BCAS3 fusion. We found this fusion only because GSNAP provides SNP-tolerant alignment, since the minor T allele rather than the reference G allele occurs at SNP rs2272962 in BCAS4 13 nt upstream of the exon-exon junction. Without our SNP-tolerance feature, this SNP would preclude a consecutive 14-nt stretch of upstream matches needed by GSNAP for genomic localization. Alternatively, we could also have avoided this difficulty with longer read lengths that would have given sufficient sequence specificity to tolerate a SNP near the fusion junction.
As shown in Table , many of our distant gene fusions have confirmatory evidence from the literature, especially in the MCF7 breast cancer cell line. The presence of cell line fusions in UHR can be explained by its derivation from 10 cancer cell lines, originating from breast adenocarcinoma, cervical adenocarcinoma, glioblastoma multiforme, melanoma, hepatoblastoma, embryonal carcinoma, liposarcoma, Hodgkin's lymphoma, plasmacytoma and T-cell lymphoblastic leukemia [37
]. One long-distance fusion, IQCJ-SCHIP1
, found in HBR and spanning 501,759 bp, has been found previously to be highly expressed in the brain [38
]. There are no other RefSeq transcripts between these two genes and the fusion pattern fits the characteristic (n
- 1) to +2 pattern, so this fusion may possibly represent an unusually long TIC. Our method identified five of the seven most prevalent fusions in UHR as reported by a method using paired-end reads [17
]. In addition, we also observed the NUP214-XKR3
fusion reported in that paper, but had only a single read supporting that fusion.
In contrast, the only distant gene fusions with evidence from 33-nt reads were two long-range interchromosomal fusions (C16orf58-NUPR1 and TMPRSS2-ERG e1-e4), three apparent inversions, and 14 apparent translocations. Among these putative fusions, the only ones with multiple supporting reads were TMPRSS2-ERG (with 5 reads from T2, 15 from T3, and 1 from N3) and the translocations MBPTS1-SERF2 and SEC31A-C6orf62 (each with 2 reads from T3).
fusion is known to be prevalent in prostate cancer, with the predominant isoform being e1-e4 [31
], which we found in our reads. The second most common isoform is e1-e5 [36
], which we did not find in our reads. However, we performed quantification of both isoforms in our samples by qRT-PCR, and found that both isoforms were present in T2 and T3 and at low levels in N2 and N3 (Figure ), with the e1-e5 isoform being present at one-tenth the level of the e1-e4 isoform. Since the TMPRSS2-ERG
fusion is known have a genomic origin, due either to a translocation or to a 3-million-bp genomic deletion between the genes [40
], the presence of both isoforms in the same sample can be explained either by tumor heterogeneity or by alternative splicing. We performed comparative genomic hybridization (CGH) microarray analyses of our samples, and found the deletion to be visibly evident in T3 but not in T2 (Figure ), indicating that T2 is of the translocation type.
Figure 7 Distant fusions. (A) Expression level of TMPRSS2-ERG e1-e4 and e1-e5 fusion splices in prostate tumor and normal samples measured by qRT-PCR, compared with ERG expression as measured by RNA-Seq. qRT-PCR measurements are shown for prostate tumor samples (more ...)
We further tested for the presence of the other candidate long-range interchromosomal fusion C16orf58-NUPR1 in our prostate samples but failed to find it by qRT-PCR, suggesting that spliced alignment of 33-nt reads can give false positives in distant fusions. On the other hand, we also tested for the translocation SEC31A-C6orf62 e1-e2, and confirmed its presence in the T3 sample, but not in other samples (Figure ).
Although the version of GSNAP we used requires at least 14-nt on each side of the exon-exon junction to to report a candidate gene fusion, we relaxed this criterion to see if we could identify other distant fusion events, and obtained the candidate long-range intrachromosomal gene fusions KRT24-NCOR1, LIN37-GPSN2, and IRS2-NUFIP1. We tested each of these candidates by qRT-PCR, but found confirmation only for IRS2-NUFIP1 e1-e8 in T2 (Figure ). This fusion spans 64 million bp on chromosome 13. However, our CGH microarray data showed no evidence of a genomic deletion (Figure ), suggesting translocation as the probable mechanism.
To evaluate the generalizability of the two novel distant gene fusions that we verified experimentally, we tested for the presence of both SEC31A-C6orf62 and IRS2-NUFIP1 by qRT-PCR in an additional 51 primary prostate tumor samples, but were unable to detect these fusions in any of these other samples, indicating that they are private fusion events. In addition, when we examined the expression of the downstream genes using our RNA-Seq data, we found no evidence that these fusion events increase the expression of the 3' gene (Figure and ). This is in contrast with TMPRSS2-ERG, where presence of the fusion greatly increases expression of the downstream gene (Figure ).
Functional analysis of IRS2-NUFIP1 shows that it has an in-frame coding region starting from the original TSS of IRS2 and maintaining the frame of NUFIP1. It retains the IRS and PH domains of IRS2, but because the breakpoint occurs after the NUFIP1 domain of NUFIP1, it loses that domain and presumably functionality of the 3' gene. For the SEC31A-C6orf62 fusion, the breakpoint occurs before the TSS of SEC31A. The longest open reading frame of the gene fusion gives a peptide that consists of the 171 amino acids on the C-terminal of the original 229-aa protein for C6orf62. No domains in the Pfam database were found for either SEC31A or C6orf62.