We demonstrated that the targeted sequencing approach with an optimized analysis setting can be used to identify oncogenic mutations. This approach could be of particular interest for the detection of point mutations in a set of important oncogenes and tumor suppressors or other disease related genes for diagnosis, prognosis prediction or therapy choice. Such information could be generated in a relatively short timeframe and with unprecedented detail. One of the major advantages over classical Sanger sequencing is the higher throughput of this method allowing that all exons of a gene set of this size can easily be sequenced. As such, full information is provided and rare variants or even previously undiscovered mutations in a particular gene can be detected. Indeed, of the 160 exonic and splice site variants (excluding the 61 synonymous variations) detected in the cell lines and patient samples across our panel of cancer genes, only 40 are found in the COSMIC database
[16], of which 24 are associated specifically with T-ALL. Although for some genes mutation hotspots exist (e.g., the KRAS G12, G13, Q61 mutations), the function of most cancer genes can be affected by mutations at different positions. Therefore, for most cancer genes the entire coding sequence needs to be re-sequenced, and for this the Roche/454 technology is particularly suitable.
To detect mutations using next-generation sequencing - either to replace or complement molecular diagnosis - standardized bioinformatics analysis pipelines with very high accuracy are required. Such a pipeline consists of a mapping algorithm to align the sequence reads to the reference genome, a variation calling algorithm to identify differences between the sample and the reference, and a variation filtering algorithm.
We compared multiple combinations of mapping and variation calling algorithms, and found that combining two mappers, namely SSAHA-2 and BWA-SW, followed by Atlas-SNP2 yields the most accurate variation detection results. Adding two mapping algorithms filters out false positive variant predictions due to erronous mapping, and the error model of Atlas-SNP2 enables the elimination of reads that have multiple best matches in the reference genome. We also found that additional data filters on depth of coverage and on variant allele frequency further increased both the sensitivity and specifity of variation detection.
We encountered several technical limitations during data analysis. First, we had to remove duplicate reads introduced by PCR amplification steps during sample preparation since we noticed these were causing false positive SNV predicitons. Second, we could only predict SNVs, while indels (small insertions and deletions) had to be ignored since our work (data not shown) and previous studies indicate that 454 reads are not suited for indel detection due to the large amount of false positive results
[4]. In a diagnostic setting, where 100% specificity is pursued, it is critical to identify genes or regions in genes that are prone to acquisition of indels and to design alternative assays to investigate them. Likewise, genomic rearrangements are important causes of T-ALL but require complementary detection technologies.
We believe that using a long read sequencing technology, such as Roche/454 or the more recent Pacific Bioscience, provides particular advantages with regards to both sensitivity and specificity of variation detection. First, long read alignment allows better distinction between highly similar genes in the genome. For example, one of the genes we re-sequenced was NOTCH1, a gene with multiple homologs (namely NOTCH2, NOTCH2Nl, NOTCH3 and NOTCH4). However, we observed no reads mapping to any of these homologs, even though we mapped the reads to the entire genome. This indicates that both the sequence capture and the mapping were specific. On the other hand, we also encountered an example where the sequence capture was not specific. Namely, the PMS2 gene is one of the targeted genes in our study, yet we observed reads mapping to the PMS2 pseudogene, PMS2CL, which contains the first six exons of PMS2 gene. Thanks to the use of long reads, this causes no problems for variation detection because for each gene the respective reads mapped uniquely to the correct gene, either PMS2 or PMS2CL. Note that the capture technology provides additional cues to achieve higher specificity because not only the exons are covered in the capture but also the flanking intronic regions. Therefore, the alignment is ‘aided’ by the intronic regions, where sequence similarity between homologs is lower, allowing for the reads to be correctly attributed to their origins in the genome.
Second, mapping long reads to a reference genome is more robust towards extensive local variation, which can be present in particular genomic regions, or can be higher when samples are sequenced from a different ethnicity compared to the reference genome
[28]. We indeed found several regions that contain a high number of SNPs within a short sequence window. For example, there are 22 SNP clusters across all samples in a window of 200 bp with at least 3 SNPs, and 5 distinct clusters in a window of 100 bp with at least 3 SNPs.
Figure S3 shows several examples, such as cluster of three SNPs within 100 bp in the SUMF1 gene in the ALLSIL cell line, and a cluster of 4 SNPs in a 200 bp window in the PTPRM gene in CCRF-CEM cell line. Nevertheless, in both cases a high coverage is obtained (36x and 46x respectively). These examples show that long reads enable a correct alignment and variation discovery, in contrast to short read sequencing technologies for which the mapping algorithms usually allow for a maximum of two mismathes per read.
We applied our analysis strategy to T-ALL by sequencing a set of 97 genes. This set consists of 58 known oncogenes and tumor suppressors in T-ALL and other cancers, and 39 genes selected via a candidate approach. Regarding the identification of variations in these genes using 454 sequencing and our optimal optimized analysis pipeline, we reached 95% sensitivity and 93% specificity on a confirmation set of 210 variants validated by capillary sequencing. Furthermore, we detected 85.7% of the mutations reported in 11 cell lines that were also sequenced in the Cancer Cell Line project. High performance of our resequencing approach is also illustrated by the fact that we identified mutations in known candidate drivers in T-ALL that were included in the collection of known cancer genes such as
NOTCH1
[29],
FBXW7
[30],
PTEN
[31],
PHF6
[14],
WT1
[32],
[33] and
PIK3CA
[34].
We detected mutations in several known cancer genes where a link to T-ALL has not been established yet, such as JAK3. Interestingly, a recent article confirmed the mutation status of this gene in the context of T-ALL
[20]. We also identified novel mutations in genes that were not previously associated with T-ALL tumorigenesis such as TET1, SPRY3 and SPRY4.
It is remarkable that more novel sequence variants are found per cell line sample than per patient, and that genes were in general more frequently mutated in cell lines than in patients. Excessive gene mutations can be explained by potential genomic instability of cells in culture, or can be caused by in vitro cell culture conditions. This hypothesis could be confirmed for TYK2, a very striking example for which 7/18 (38%) T-ALL cell lines contain novel TYK2 sequence variants as opposed to only 2/93 (2%) T-ALL patients. Interestingly, we could demonstrate that several TYK2 variants in cell lines had been acquired during culture. It remains to be determined what is promoting the frequent acquisition of TYK2 variants in these T-ALL cell lines as opposed to T-ALL patients. The most obvious explanation are differences between the in vitro cell culture conditions and the physiological environment of T-ALL cells. As several cytokine signaling pathways depend on TYK2, presence of different cytokines and/or different concentrations of cytokines that use TYK2 signaling might be critical. These observations underscore once more that data obtained from cell culture models should be interpreted with care, especially when extrapolating these data to patient samples.
It is nevertheless interesting to note that this tendency of higher mutation frequency in cell lines compared to patient samples does not extend to all analyzed genes. The most evident example is TET1, showing novel variants in only 1/18 cell lines (KARPAS-45) versus 5/37 (13.5%) patients.
In conclusion, we describe a method for fast re-sequencing of a moderate size gene set of 97 genes using 454 next generation sequencing equipment that would be suitable for implementation into the clinic. Our results show that this setting is useful to identify (i) known mutations in known driver genes; (ii) new mutations in known drivers; and (iii) oncogenes or tumor suppressors that had not previously been associated with a specific subtype of cancer based on a candidate gene approach.
The optimized data analysis pipeline, which was assembled from publicly available tools, slightly exceeded the performance of the Roche gsMapper software with 95% sensitivity and 93% specificity for SNV detection, and subsequent analysis of the Roche/454 data from the T-ALL cell lines and patient samples confirmed previously known oncogenes and tumor suppressors in T-ALL and identified previously unrecognized rare somatic mutations in TET1 and SPRY4 in T-ALL patients. Screening a larger patient series should reveal the exact mutation frequency of these genes in T-ALL and whether mutations in these tumor suppressors also play a role in other types of hematopoietic malignancies.