|Home | About | Journals | Submit | Contact Us | Français|
Tyrosine kinase (TK) fusions are attractive drug targets in cancers. However, rapid identification of these lesions has been hampered by experimental limitations. Our in silico analysis of known cancer-derived TK fusions revealed that most breakpoints occur within a defined region upstream of a conserved GXGXXG kinase motif. We therefore designed a novel DNA-based targeted sequencing approach to screen systematically for fusions within the 90 human TKs; it should detect 92% of known TK fusions. We deliberately paired ‘in-solution’ DNA capture with 454 sequencing to minimize starting material requirements, take advantage of long sequence reads, and facilitate mapping of fusions. To validate this platform, we analyzed genomic DNA from thyroid cancer cells (TPC-1) and leukemia cells (KG-1) with fusions known only at the mRNA level. We readily identified for the first time the genomic fusion sequences of CCDC6-RET in TPC-1 cells and FGFR1OP2-FGFR1 in KG-1 cells. These data demonstrate the feasibility of this approach to identify TK fusions across multiple human cancers in a high-throughput, unbiased manner. This method is distinct from other similar efforts, because it focuses specifically on targets with therapeutic potential, uses only 1.5µg of DNA, and circumvents the need for complex computational sequence analysis.
Tyrosine kinases (TKs) are tightly regulated signaling enzymes that control multiple cellular processes. When TK signaling becomes deregulated due to mutations or rearrangements involving the kinase domain, the resultant sustained activity can lead to cancer. Because constitutive kinase activity can also be required for tumor maintenance, aberrant TKs serve as attractive therapeutic targets (1). The best example of this concept involves the BCR-ABL (BCR - breakpoint cluster region gene; ABL - Abelson murine leukemia viral oncogene homolog 1 gene) TK fusion protein in patients with chronic myelogenous leukemia (CML) (2). As a consequence of the fusion, the ABL kinase is constitutively activated (3,4). CML cells are dependent upon signaling from BCR-ABL and die upon treatment with the kinase inhibitor, imatinib (Gleevec). Clinically, the drug has revolutionized treatment of the disease.
Thus far, only a limited number of TK fusions have been found in cancers. The majority of TK fusions have been identified in hematopoietic malignancies as opposed to solid tumors, because the latter are difficult to karyotype, harbor multiple genomic aberrations, and are often clonally heterogenous (5). Nevertheless, fusion proteins do exist in epithelial cancers. TK fusions involving RET, NTRK1, -3, and the serine/threonine kinase, BRAF, have been found in radiation-induced thyroid cancers (6–8). Recently, a subset of non-small cell lung cancers (NSCLCs) was discovered to harbor gain-of-function ALK and ROS fusions (9–11). Importantly, only 2 years later, an ALK inhibitor has already demonstrated promising activity in patients with EML4-ALK-fusion-positive lung tumors (12). However, the identification of these fusions involved highly laborious techniques not amenable to rapid screening.
The recent application of high throughput next generation sequencing technologies to whole genomes and whole transcriptomes has facilitated the discovery of multiple translocation events in cancer cells (13–19). These efforts have been enhanced further by re-sequencing of selectively captured regions of interest (20). Capture technologies utilize complementary oligonucleotide ‘baits’ either immobilized on a chip (solid capture) or in-solution (liquid capture) (21–24). Traditionally, chip based arrays have been coupled with long read 454 sequencing (21,23,24), while ‘in-solution’ capture was has been paired with short read sequencing (i.e. Illumina GA and AB SOLiD) (22). However, these methods require large amounts of starting material and generate vast numbers of sequences. Furthermore, these methods have not focused specifically on identifying TK fusions.
Here, we present the development of a ‘rationally designed’ DNA capture strategy that overcomes many of the limitations of current fusion discovery platforms. Our strategy focuses specifically on the discovery of novel TK fusions because of their clinical significance and inherent druggability. We hypothesized that tumors contain as yet unidentified TK fusions whose discovery has been hindered by experimental limitations. In contrast to other targeted approaches, our design was based upon unique conserved genomic properties of existing TK fusions. Importantly, our capture-sequence approach is feasible, rapid, and requires minimal amounts of starting tumor cell DNA, making it amenable for high-throughput screens to identify systematically TK fusions in any cancer.
The nucleotides encoding the GXGXXG motif for all 90 TK kinases (25) and four serine/threonine kinases (BRAF, AKT-1, -2, -3) were mapped using MapBack (hg18), a locally created database that maps protein residues to corresponding genomic coordinates (http://cbio.mskcc.org/Public/products/human_mapped/Mapback; A. Lash and C. Byrne, manuscript in preparation). From these regions, genomic coordinates for the three preceding introns and the two preceding exons were mapped using ENSEMBL (hg18; http://www.ensembl.org). All exons/introns were labeled according to ENSEMBL numbering. In cases where the GXGXXG motif was encoded by more than one exon, only the first exon was included in capture, as breaks are likely to occur upstream of this motif. For ABL1, capture included intron 1-2 to exon 4 (GXGXXG motif), and for ROS1, capture included intron 31-32 to exon 36 (GXGXXG motif). An extra exon and intron upstream of the target region were also included for ABL2. Coordinates were submitted to Agilent Technologies (Santa Clara, CA, USA) for custom bait design. Repetitive elements, as identified by the UCSC genome browser (26) (http://www.genome.ucsc.edu), were excluded from bait design.
The human cell lines KG-1 and TPC-1 have been characterized previously (27,28). KG-1 cells and TPC-1 cells were kindly provided by R. Levine and J. Fagin (MSKCC), respectively. KG-1 cells were cultured in RPMI media (American Type Tissue Collection, ATCC) supplemented with 10% fetal bovine serum (Gemini Bio Products) and pen-strep solution (Gemini Bio Products; final concentration 100U/ml penicillin, 100µg/ml streptomycin). TPC-1 cells were cultured in Dulbecco's Modified Eagle Medium (DMEM) supplemented with glucose (4.5g/l), 5% fetal bovine serum, pen-strep solution (final concentration 100U/ml penicillin, 100µg/ml streptomycin) and 2mM glutamine. All cells were grown in a humidified incubator with 5% CO2 at 37°C.
Genomic DNA from all samples was extracted using standard phenol extraction protocols. 1.5µg was sheared with a Roche Nebulizer to 300–500bp fragments. Fragment size was confirmed on a BioAnalyser, DNA 7500 assay (Agilent). 454 adaptors (Roche) were ligated according to the manufacturer’s instructions. Ligated products were size selected on an agarose gel, purified using the AMpure kit (Agencourt), and PCR amplified for 15 cycles. The PCR products were purified with a mini-elute PCR purification kit (QIAGEN). Capture was performed at Agilent Technologies (Santa Clara, CA, USA) using their SureSelect Target Enrichment System. Subsequently, 2–4µl of eluted single stranded DNA was used for emulsion PCR with emPCRkit I (Roche). Approximately 300000 beads/sample/run were used for sequencing on a 454 FLX sequencer (Roche).
Two independent BLAT-based methods were used for 454 sequence analysis. The first method aligned 454 reads to a custom library of known genes derived from BLAT (29). Only TK-containing sequences (and BRAF and AKT-1,-2,-3-containing sequences) were considered for further evaluation (Supplementary Figure S2). To minimize recovery of repetitive elements and low complexity sequences, candidate fusion sequences then had to meet the following criteria: (i) the entire length of the sequence had to map to ≤3 targets within the genome; (ii) the sequence overlap between the targets could not exceed 5 bps and (iii) any sequence gaps between the targets could not exceed 5 bps.
The second method mapped the 454 reads to the entire human genome using BLAT and considered only those reads that hit one kinase target and one other region in order along the query sequence. The candidate sequences in the high-scoring segment pairs (HSPs) of the BLAT output then had to meet the following criteria: (i) the distance between the targets could not exceed 5 bps; (ii) no more than one mismatch was allowed, and gaps were ≤2 bps; (iii) alignment to the two targets had to account for ≥95% of the query sequence; and (iv) no more than 5 bases could be removed from either end of the sequence.
Additionally, sequences from candidate fusions identified by either method had to be recovered from at least two independent 454 sequences. A combined list of TK fusion candidates was generated from sequences that met these criteria. These sequences were then aligned back to the entire genome using BLAT, and those with ≥98% identity to a single repetitive element were eliminated. The remaining candidate fusion sequences were validated by PCR using fusion point-spanning primers and appropriate genomic DNA. Upon PCR confirmation, the genomic fusion sequence was queried against all the 454 reads from the same sample to find additional fusion sequences that may have been missed by automated methods. In particular, this step allowed us to recover fragments where the length of sequence from one fusion partner was <25 bps, the minimum length required for BLAT searches.
454 sequences were mapped initially to the human genome (hg18) using BLAT (29) with parameters that would allow for partial mapping of reads, as would be the case if fusions were present. Average read length was determined from all reads so as not to exclude any fusion sequences mapping to multiple targets.
The capture efficiency was calculated from the fraction of mapped sequences that overlapped with bait target regions. To put this number in proper context, and to account for the small portion of the genome used for capture, we computed a ‘normalized’ enrichment factor defined as:
This equation corrects for cases where a small genomic region was targeted with baits.
The mapping outputs from each sample were then converted to and loaded as .bed files into the UCSC genome browser (see Supplementary Table S3 for sample barcodes) in addition to the genomic coordinates for each of the custom baits provided by Agilent Technologies. We also loaded the target sequences used for bait design as reference. Genomic coordinates for the individual baits are available upon request. Fusion sequences were mapped to the region with highest homology. The mapping of the sequences and coordinates across the entire genome is publicly available at: http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg18&hgt.customText=http://cbio.mskcc.org/~socci/JC/groupA.bed.gz.
PCR amplification of the candidate fusion genomic breakpoints and wild-type sequences was performed with M13-tagged primers (Supplementary Table S3) using HotStarTaq Master Mix (QIAGEN) and standard cycling conditions (95°C for 15m, 35 cycles of 94°C for 30s, 60°C for 30s, 72°C 1min and final extension at 72°C for 10m). Normal male DNA (Promega) was used as a negative control. PCR products were separated by agarose gel electrophoresis. Excess primers and dNTPs were removed with ExoSAP-IT (USB Corporation), as per the manufacturer’s instructions, prior to direct dideoxynucleotide sequencing at the Vanderbilt DNA Sequencing Facility.
Total RNA was extracted from KG-1 cells with TriZol reagent (Invitrogen). Extracted RNA was treated with DNase I (Sigma-Aldrich) and precipitated using standard protocols. Rapid amplification of cDNA ends (RACE) was done with a 5′-RACE system (Invitrogen), as per the manufacturer’s instructions. Five micrograms of RNA was used for the initial cDNA reaction with FGFR1.GSP1 (ACGGTTGGGTTTGTCCTTGT). Following dC-tailing, the cDNA was amplified with FGFR1.GSP2 (TCAGAGACCCCTGCTAGCAT) and the provided abridged anchor primer. PCR products were confirmed by agarose gel electrophoresis, and cloned into pCR.II-TOPO using a TOPO-TA cloning kit (Invitrogen). The presence of a PCR product insert was confirmed by Eco-RI digestion (New England BioLabs), and inserts were sequenced using the T7 tag within the plasmid (Vanderbilt DNA Sequencing Facility).
Long-range PCR was performed with the LongRange PCR Kit (QIAGEN), as per the manufacturer’s instructions. An FGFR1OP2-F primer (AGATGATCCGGGTATAATAA) within exon 4 and an FGFR1-R primer (AGAAGAACCCCAGAGTTCAT) within exon 10 were used to amplify the genomic fusion sequence. The products were separated by agarose gel electrophoresis, and excess dNTPs and primers were removed with ExoSAP-IT (USB Corporation). The product was sequenced in steps using the primers FGFR1.fus-R1 (TCCAAAGACCATGGTAGGCC), FGFR1.fus-R2 (CACCTCTTCCAGCTTGACAT), FGFR1.fus-R3 (CGGTCATTCTTGCACACACC) and FGFR1.fus-R4 (ATGGGAGGGACCTGGTAGGA) at the Vanderbilt DNA Sequencing Facility.
Using an in silico approach, we analyzed the protein sequences from known cancer-derived TK rearrangements (n=59; Supplementary Table S1). Strikingly, all TK alterations identified to date contain an intact conserved GXGXXG motif (Figure 1A), which is essential for kinase activity (30). We also found that fusion points within the TK protein sequence usually occur within ~200 amino acids upstream of this motif, with few exceptions (e.g. JAK2 and ABL1).
This result prompted us to examine corresponding genomic fusion points at the DNA level. Although fewer DNA fusion sequences have been mapped, these also occur within a defined region (Figure 1B). In most instances, the GXGXXG motif is encoded by a single exon. The distance from the GXGXXG motif to the fusion point was variable due to the differences in intron and exon sizes. However, 80% of the TK fusions we analyzed had a fusion point within three introns upstream from the GXGXXG-encoding exon (Figure 1C). Breaks within ABL1, -2 and ROS occurred outside of this pattern. Assuming that novel TK fusions will follow a similar pattern, it should be possible to search systematically in a non-biased manner for fusions involving breaks within these regions in any of the 90 TKs in the human genome (25) using genomic DNA.
Our strategy for TK fusion discovery employs existing ‘DNA capture’ technology (21–23) followed by ‘next-generation’ sequencing (31) of recovered sequences. SureSelect technology (Agilent) captures specific DNA regions of interest (‘catch’) using 120-mer RNA ‘baits’ in-solution, allowing for enrichment of target sequences compared to unselected DNA (22). For this project, we paired SureSelect Target Enrichment technology with 454 sequencing (Figure 2) because the amount of starting template needed for SureSelect is the least (1.5µg) of any of the existing capture platforms, and the 454 platform delivers the longest read-lengths per sequence among next-generation sequencing platforms. To date, pairing ‘in-solution’ capture with long read sequencers (e.g. 454) has not been reported. We reasoned that longer reads would allow us to directly find fusion points without the need for intensive bioinformatic mapping algorithms.
To capture genomic regions of interest, we mapped the nucleotides encoding the GXGXXG motif for all 90 TKs in the human genome (Supplementary Table S2). We also included AKT-1, -2 and -3, and BRAF; these serine-threonine kinases have been implicated in cancer, and BRAF is rearranged in a subset of thyroid cancers (32). These regions were extended to include the entire GXGXXG-encoding exon, two preceding exons, and three preceding introns (Figure 1C). For TKs shown repeatedly to break outside this pattern (e.g. ABL1, -2 and ROS), the capture region was increased to include those areas where previous breaks had been observed. Based on our in silico analysis, capture of these mapped regions should detect 92% of known fusion points. The collective genomic coordinates were then submitted for custom bait design with 2x coverage for all capture regions. Coordinates for AATK were inadvertently left out of bait design. This focused capture strategy targets the regions where fusion points are most likely to occur and should reduce recovery of ‘diluting’ wild-type sequences. To decrease the amount of low complexity DNA, repeating regions (as identified by the UCSC genome browser, hg18) were excluded from bait design. Bait tiling averaged 73% across all targets (range: 34–100%, excluding AATK; Supplementary Table S2). Therefore, evenness of bait coverage across the target regions was dependent on the presence or absence of repetitive elements.
The assay was validated using DNA from two human cancer cell lines with known TK fusions: the thyroid papillary carcinoma cell line, TPC-1, known to harbor a CCDC6-RET fusion (27), and the acute myeloid leukemia cell line, KG-1, known to contain an FGFR1OP2-FGFR1 fusion (28). For both cell lines, fusions had been previously identified only at the mRNA level. Genomic DNA was sheared and ligated to 454 sequencing adaptors (Figure 2). The length of the sheared DNA (300–500nt) was significantly longer than the baits (120-mers), allowing for capture of fusion sequences with only a short TK-containing portion.
Following DNA capture, ~60000–100000 reads per sample were generated using the 454 FLX platform (Table 1). In total, the length of recovered reads averaged 193bp (Supplementary Figure S1), and sequences corresponding to TK ‘baited’ regions were enriched ~776-fold, indicating the efficiency of our capture. Twenty two per cent of the ‘catch’ mapped to bait regions, and the average enrichment across all kinases ranged from 80- to 3180-fold (excluding AATK; Supplementary Table S2). To visualize the ‘bait’ and ‘catch’ coverage across the genome, these files were loaded as custom tracks into the UCSC genome browser (publicly available, ‘Materials and Methods’ section). The average bait coverage for TPC-1 sequences was 0.90x (range 0–17.0x) and 1.51x for KG-1 sequences (range 0–29.4x). Of the 13972 baits used for capture, 2552 (18.3%) were covered by >2 TPC-1 sequences. 3697 baits (26.5%) had >2x coverage by KG-1 sequences. The differences in coverage between the two cell lines can be explained partly by the greater number of sequences recovered from capture of KG-1 DNA (Table 1).
The recovered sequences were analyzed for fusions using two novel independently derived computational algorithms (Supplementary Figure S2). Both BLAT-based methods separated target-containing sequences from non-target containing sequences as an initial filter. This was achieved by aligning the sequences to a library of known human genes or the entire human genome. Each algorithm generated a list of potential candidate fusions from the target-containing sequences using slightly different alignment and stringency criteria (‘Materials and Methods’ section). An entire sequence read had to be completely ‘mappable’ with small (2–5bp) gaps or overlapping regions. To reduce the number of false positives, fusion sequences also had to be recovered at least twice. Both methods produced similar results, and a combined list of fusion candidates was compiled for each sample (Supplementary Table S3).
Approximately 60000 reads were recovered from captured TPC-1 DNA (Table 1). Around 22% of sequences mapped to baited target regions, translating to a 772-fold enrichment of kinase-containing DNA over other genomic sequences. Computational analyses identified 12 potential kinase rearrangements (Supplementary Table S3), including a fusion sequence recovered twice that mapped to intronic regions from CCDC6 and RET (Figure 3A). This CCDC6-RET fusion was the only candidate validated by PCR and direct sequencing (Figure 3B–D). The 12nt sequence surrounding the fusion point was subsequently queried against the entire pool of TPC-1 454 sequences. One additional fusion sequence was found; it was likely missed by the automated algorithms, because it contained only a short (15bp) fragment of CCDC6. In total, all three sequences contained the same fusion point (Figure 3A). These data demonstrate the feasibility of our platform to detect TK fusions from genomic DNA.
We next investigated the ‘bait’ and ‘catch’ coverage across RET kinase by simultaneously mapping these coordinates within the UCSC browser (26) (Figure 3E). Approximately 97% of RET was covered with baits, and RET-containing sequences covered most of the target region (1053-fold enrichment; 3.6-fold average coverage; Supplementary Table S2). Areas with few or no recovered sequences contained mostly repetitive regions, which did not have corresponding baits. The CCDC6-RET fusion sequences could have been captured by four separate baits within intron 11-12 of RET.
Reads recovered from captured KG-1 DNA were enriched ~794-fold for TK-containing sequences (Table 1). From the 22% of sequences that mapped to baited target regions, we identified 30 potential TK alterations, one of which involved two non-contiguous portions of FGFR1 (Figure 4A, Supplementary Table S3). Only the fusion involving FGFR1 was confirmed by PCR using fusion-specific primers and direct sequencing of the PCR products (Figure 4B and C). No additional reads were found after querying the FGFR1 fusion sequence against the pool of KG-1 454 sequences. The four sequences containing this fusion mapped to the forward strand of a portion of exon 9 and to the reverse complement of a portion of the intron between exons 9 and 10, suggesting an FGFR1 rearrangement occurring 5′ to exon 10 (Figure 4D). However, from the 454 reads alone, we were not able to deduce the exact upstream partner. Subsequent application of 5′–RACE using primers specific to exons 10 and 11 of FGFR1 found only the FGFR2OP1-FGFR1 fusion previously reported (data not shown).
To elucidate further how the recovered 454 fusion sequences were related to the FGFR2OP1-FGFR1 alteration, we performed sequencing of long-range PCR products obtained from amplification of the DNA sequence between exon 4 of FGFR1OP2 and exon 10 of FGFR1. This ~5kb region contained sequences that matched 100% to our 454 fusion reads, but we found that the genomic structure was much more complex, involving elements from intron 4-5 of FGFR1OP2, the inverted truncated exon 9, and intron 9-10 of FGFR1 (Figure 4D). Collectively, these data illustrate two points. First, data from our capture-sequence approach suggested a breakpoint occurring at a specific region in the kinase and were sufficient to allow us to find readily the upstream partner using 5′-RACE. Second, fusion breakpoints can be more complex than just the juxtaposition of two intronic elements from two different genes.
Approximately 88% of the FGFR1 capture region was covered with baits (Figure 4E; Supplementary Table S2), and FGFR1-containing KG-1 sequences were enriched 1922-fold with an 11-fold average coverage of the target region. Simultaneous mapping of the baits and sequences revealed a similar pattern to that observed for TPC-1 reads (Figure 4E). A large repeating region between exons 9 and 10 contained no sequence coverage due to the lack of baits. The four FGFR1 sequences containing the rearrangement could have been captured by 8 baits.
Three well-characterized lung cancer cell lines without known fusions were also included as controls: NCI-H820, harboring mutant EGFR (exon 19 deletion, T790M) and amplification of MET (33), NCI-H1703, with amplification of PDGFRβ (9) and NCI-H3255, containing amplified mutant EGFR (L858R) (34). Analysis of the sequences from H820, H3255 and H1703 identified 8, 12 and 24 candidate fusions, respectively (data not shown). None of the fusions were confirmed by PCR with breakpoint spanning primers. Enrichment of TK-containing sequences for these lines was similar to that of the fusion lines (data not shown). These data are consistent with a modest false positive rate also observed with other fusion discovery efforts that have used 454 sequencing (13,17). We hypothesize that the false positive sequences are likely an artifact of the ligation step used to add sequencing adaptors onto DNA fragments. We expect this number to decrease in future iterations of the method as refinements are made to the sequencing preparation steps.
The success of the ABL TK inhibitor, imatinib, in CML patients with BCR-ABL translocations and of the new ALK TK inhibitor in lung cancer patients with EML4-ALK fusions illustrates that cancer-driving TK fusions serve as excellent therapeutic targets. Exactly how many TK fusions exist in cancers, though, is currently unknown, because identification of TK fusions has traditionally required highly laborious techniques. As an example, BCR-ABL rearrangements were identified using conventional karyotyping, which first revealed the Philadelphia chromosome (35), followed by identification of the t(9;22) translocation (36), and eventual cloning of the ABL translocation (37). EML4-ALK fusions were discovered by (i) application of a tumor-derived cDNA expression library to a mouse 3T3 fibroblast focus formation assay (10) and (ii) immunoaffinity phosphoproteomic profiling by mass spectrometry (9).
Based upon an in silico analysis, we found that known TK rearrangements display conserved fusion point properties that might make them amenable to systematic screening using new technologies. This analysis led us to design a novel DNA-based approach to identify TK fusions in a selectively targeted high-throughput manner. Regions of genomic DNA likely to be involved in TK fusion events are captured by hybridization to custom designed RNA baits. Captured DNA is eluted and sequenced using next-generation 454 sequencing. We chose 454 sequencing over other deep sequencing platforms specifically because of its ability to achieve longer read lengths (~200nt; Table 1, Supplementary Figure S1), which could facilitate direct identification of fusion points that lie far upstream from the captured region (Figure 3A). Novel computational algorithms then allow for rapid identification of candidate fusions, which are validated by simple direct PCR or 5′-RACE methods. As proof that this approach is feasible and robust, we used it to map previously unknown genomic breakpoints in two human cancer cell lines (including both solid and hematologic malignancies) harboring fusions with RET and FGFR1, respectively, using only 1.5µg of DNA from each line. In both cases, we identified novel genomic fusion sequences and structures of the breakpoints. Importantly, having established workflow for the platform, identification of candidate fusion sequences in a given tumor sample could theoretically be completed in ~2–3 weeks, encompassing DNA isolation and preparation, DNA capture, 454 sequencing and computational analysis.
The advent of high throughput next-generation sequencing technologies has recently facilitated the discovery of multiple types of translocation events in cancer cells (Table 2). One approach involves whole genome sequencing, which requires adequate starting material and large numbers of sequences for adequate coverage. As a result, genome-wide next-generation sequencing efforts are often coupled with copy number and karyotype data to prioritize regions where translocations may have occurred (15,16). This strategy requires complex computational algorithms that integrate sequencing and chromosomal analyses.
A number of RNA-based approaches (RNA-seq) also have been used to detect multiple types of fusion events (Table 2). These include whole transcriptome sequencing (13,14,18), and screening for disparate levels of expression between 5′ and 3′-ends of kinase genes on exon arrays (with expression higher in the 3′-end) (38). Whole transcriptome sequencing provides meaningful expression level data while eliminating background from non-coding genomic elements. However, this approach also requires analysis of a large number of sequences due to the abundance of housekeeping, ribosomal and mitochondrial transcripts, and is often coupled with copy number data as well. Recently, selective capture of 467 cancer related genes was applied to a cDNA library, and analysis of the subsequent massively parallel sequencing found multiple fusion events, demonstrating the use of this capture-sequence approach to detect translocations (20). However, the sequencing analysis demands are high. Most importantly, the sample requirements (20µg DNA, up to 100µg of total RNA, or 1µg of mRNA) for the above DNA- and RNA-based platforms are prohibitive for most standard patient samples, thus limiting the broad utility of these approaches. A DNA-based capture platform avoids the use of RNA, which is inherently less stable than DNA and more difficult to extract without degradation from clinical specimens.
While the fusion events identified through other methods are important for understanding the pathogenesis of cancer cells, their therapeutic potential remains unknown. Therefore, we focused specifically on identification of TK fusions which can be potentially targeted with specific kinase inhibitors. Based on the genomic properties of known TK fusions, we designed a highly focused capture platform that should detect 92% of known TK fusions. From ~160 000 sequences, we recovered only one sequence that fell within the housekeeping gene category. Furthermore, high density SNP analysis (Affymetrix 6.0 arrays) of DNA from both TPC-1 and KG-1 cells revealed that only the CCDC6-RET fusion was apparent from copy number studies (data not shown). Although this approach was validated using only cell line DNA, we do not anticipate problems using DNA from more heterogeneous tumor samples. Given the increased efficiency of capture methods developed by Agilent since our pilot study, we expect a greater percentage of the ~100000 recovered sequences from each run to map back to our target regions. Even from this pilot study, we know that fusions are still detectable with as few as ~13000 sequences (the number of TPC-1 sequences that mapped back to kinase targets; Table 1).
While our capture-sequence platform has distinct advantages, there are some limitations. First, if fusions occur outside of the targeted regions or within large repetitive elements, they would not be identified with this method. However, based on our in silico analysis, the majority of fusions should occur within the targeted regions, and our original design attempted to balance feasibility with the effort required to sift through hundreds of thousands of wild-type sequences. Other groups have been unable to detect a conserved motif at translocation breakpoints (16), indicating that fusions may occur in broadly defined regions versus at specific sequence motifs. Many of the genomic fusion sequences that we have analyzed occur outside of repeat elements, and those that occur within masked regions are still amenable to capture due to the length of the recovered sequences extending into regions without baits (Chmielecki and Pao, unpublished observations). Second, we detected a modest number of false positive fusion events (~0.025% of total sequences). However, this is a common problem for all fusion discovery platforms (with DNA and RNA), and the number of candidate fusions identified with our method is comparable to other similar methods. Third, the percentage of total sequences mapping to our baits is much lower than similar capture reports (22). However, our study used an early ‘beta’ version of the SureSelect technology, and capture has improved significantly since the official launch of the product. Additionally, SureSelect was not optimized for 454 sequencing at the time of this study. Fourth, other types of fusion events (e.g. ETS gene fusions) were omitted from our design, because targeted therapies for these types of translocations have yet to be developed. However, in theory, a similar capture platform could be designed based on the genomic properties of transcription factor fusions. Fifth, gaps in sequence coverage may be unavoidable due to deletions in tumor cell DNA, GC-richness, and regions not amenable to PCR amplification. This fact is demonstrated in our pilot project where sequence gaps across regions are not the same for each sample (see custom UCSC Genome Browser). Finally, we note that strategies for fusion identification (Table 2) are rapidly evolving, and this platform represents just one approach towards kinase fusion discovery. The capture strategy we designed could theoretically be paired with any sequencing technology (e.g. SOLiD and Illumina GA, for which SureSelect is already optimized). As computational algorithms are refined further for such applications, these different sequencing approaches may represent alternatives that offer greater cost-effective analysis.
In the future, we plan to increase bait coverage from 2x to 5x, allowing for greater enrichment of our target regions. Baits with poor capture efficiency are also being redesigned to better capture some genomic regions. One redesign strategy includes creating baits against the opposite DNA strand in the event that the opposite sequence is more amenable to capture. Recovered sequences will then be sequenced using the 454 Titanium platform, which allows for much longer reads (~500bp). This longer read length will enhance sequence coverage across the target regions and allow for sequencing into areas not covered directly with baits. Additionally, the improved bait design paired with longer sequencing reads should further decrease the number of false positive candidate fusions by mapping more precisely the recovered sequences.
In summary, we have devised a DNA-based platform using highly focused targeted capture and 454 sequencing for rapid and systematic discovery of TK fusions in cancers. Our data demonstrate that this novel method is feasible, rapid and applicable to routine tumor samples. Using this platform, we identified for the first time genomic fusion sequences of two TK fusion proteins (CCDC6-RET and FGFR1OP2-FGFR1), including the unique genomic structure of FGFR1OP2-FGFR1. We now plan to screen tumor sets for novel kinase rearrangements. To extend further its utility, we will also determine if this method can be used on DNA extracted from formalin-fixed paraffin-embedded tissue. Ideally, we hope to accelerate discovery of multiple novel TK fusions and facilitate clinical development of targeted anti-cancer therapies.
Supplementary Data are available at NAR Online.
National Institutes of Health National Cancer Institute (NCI) [grants R01-CA121210 (to W.P.); P01-CA129243 (to W.P.)]; Stand Up To Cancer-American Association for Cancer Research Innovative Research Grant [SU2C-AACR-IR60109 (to W.P.)]. The MSKCC Genomics core is supported by an NCI CCSG award to MSKCC (P30-CA008748). W.P. received additional support from Vanderbilt’s Specialized Program of Research Excellence in Lung Cancer grant (CA90949) and the VICC Cancer Center Core grant (P30-CA68485). R.K.T. is supported by the German Federal Ministry of Science and Education (BMBF) as part of the German National Genome Research Network (NGFNplus) program (grant 01GS08100). Funding for open access charge: SU2C-AACR Innovative Research Grant (SU2C-AACR-IR60109).
Conflict of interest statement. None declared.
We thank Juan Li of the MSKCC Genomics Core Laboratory for assistance with sample preparation and 454 sequencing; Ross Levine and Jim Fagin (MSKCC) for providing KG-1 and TPC-1 cells, respectively; Alex Lash and Caitlin Byrne of the MSKCC computational biology core for assistance with genomic mapping; Paula Woods (VICC) for technical assistance and Jennifer Pietenpol (VICC) for critical reading of the manuscript.