Somatically acquired structural variations (SVs) can induce alterations in genes that directly contribute to cellular transformation1
and whole genome sequence analysis3, 4
of tumor and matched germ line samples have led to a marked improvement in our ability to identify SVs in cancer. Nevertheless, the accurate identification of SVs using next generation sequencing (NGS) remains challenging. Typically in these analyses, 30–100bp reads from the two ends of a sequence fragment are obtained, mapped to the reference human genome, and discordances in distance, orientation, and/or mapping status (e.g. whether a read is mapped or unmapped to the reference genome) are used to identify structural variations5–9
. These approaches only infer the approximate genomic locations of a SV but fail to pinpoint their exact breakpoint at the nucleotide level. Moreover, the available methods tend to generate a high frequency of false positives when applied to experimental data due to the presence of PCR and/or sequencing artifacts and the inherent difficulty of accurately mapping sequences in repetitive regions.
To overcome some of these deficiencies, we explored an alternative approach for SV discovery that is based on directly mapping of SV breakpoints at the nucleotide level without relying on the discordant mapping of paired end reads. By definition a sequence read that spans a bona fide structural variation will have partial alignment to each of the two sides of the junction (). Current NGS mapping algorithms like BWA10
compute local alignment (that is, partial alignment) for a read either automatically when its mate is globally aligned to the genome or by user request. The unaligned portion is masked by a process termed “soft-clipping” because the unaligned subsequence is retained but not trimmed even though it does not map to the current genomic location (). With longer NGS read length (≥75bp), these soft-clipped subsequences can be of sufficient length to map unambiguously to a different genomic location, thus, identifying the second breakpoint for a putative structural variation.
Figure 1 Mapping SV breakpoints using soft-clipped reads. (a) Illustration of SV analysis using discordantly mapped paired-end reads versus mapping using soft-clipping reads. Red and blue segments represent two discontinuous genomic regions. (b) An example of (more ...)
Based on this concept, we developed CREST (Clipping REveals STructure), a software tool that uses the soft-clipping reads to directly map the breakpoints of structural variations. For each SV, the first breakpoint is identified by presence of soft-clipped reads while its partner is found by an assembly-mapping-searching-assembly-alignment procedure (, Online Methods
). The identified SVs are then classified into the following five subtypes based on location and orientation of the breakpoints: (1) inter-chromosomal translocations (CTX), (2) intra-chromosomal translocations (ITX), (3) inversions (INV), (4) deletions (DEL), and (5) insertions (INS) (Supplementary Figs. 1 and 2
We applied CREST to whole genome DNA sequence data obtained from five cases of childhood T-lineage acute lymphoblastic leukemia (T-ALL) with matched tumor and normal samples that were sequenced as part of the St. Jude Children’s Research Hospital, Washington University Pediatric Cancer Genome Project. This analysis identified a total of 110 SVs (Supplementary Table 1
) including 36 CTX, 25 ITX, 1 INV, 26 DEL, 22 INS. PCR primers were designed successfully for 107 (97%) of the predicted SVs and Sanger sequencing of the generated amplicons from the respective tumors confirmed the predicted SV breakpoints in 89 (82% validation rate, representative results are shown in ). Across the five samples, the validated SVs include 31 CTX, 19 ITX, 1 INV, 22 DEL and 16 INS. The validated translocations detected through CREST ranged from balanced translocations to highly complex rearrangements that involved multiple chromosomes. A representative example is shown in in which a complex rearrangement involving chromosomes 1, 4, 5, and 10 was defined in one sample.
Figure 2 SV validation result for one T-ALL sample (SJTALL003). (a) PCR amplification of 28 SV breakpoints predicted by CREST. All putative SVs except for those tested in lanes marked in blue were validated by Sanger sequencing. Lanes marked in red point to amplicons (more ...)
To compare the performance of CREST to other available algorithms, we first reanalyzed this data set using BreakDancer5
, a commonly used tool that implements a paired-end discordance mapping (PEM) algorithm. BreakDancer identified only 27 out of the 89 validated SVs that were defined by CREST. Moreover, although BreakDancer identified another 1,037 putative SVs, none of these survived a post-processing quality check and thus represented false positive predictions. A second PEM algorithm, GSAV, detected 76 (85%) of the validated SVs amongst a total of 5,880,492 predictions, demonstrating that this relatively low false negative rate was achieved with a cost of an extremely high false positive error. Re-analysis using Pindel11
, a program that uses unmapped reads across insertion/deletion (indel) breakpoints, detected only five of the 89 validated SVs found by CREST suggesting that different methods are required for finding gross structural variations and indels. Details of the superior performance of CREST compared to these algorithms are provided in Supplementary Data 1
To further assess the performance of CREST, we applied it to a published whole-genome sequencing dataset from the metastatic melanoma cancer cell line COLO-82912
. Using a paired-end discordant mapping method3
the published analysis reported 37 validated SVs12
. By comparison, CREST identified 76 SVs (Supplementary Data 2
, Supplementary Table 2
) including 26 of the 37 reported SVs. Of the 11 reported SVs that were not identified by CREST, 6 were found to have soft-clipped reads in the matching normal sample COLO-829BL, indicating that these six SVs represent germline polymorphisms but not tumor specific somatic SVs. Of the five remaining SVs, three lacked soft-clipped reads, one had soft-clipped reads that mapped to multiple genomic locations and one had low-quality soft-clipped reads across the breakpoints.
CREST identified 50 additional SVs that were not reported previously12
. We selected 20 to directly validate by PCR amplification of DNA extracted from the COLO-829 cell line (Supplementary Table 3
). 18 of the 20 novel SVs, including 7 CTX, 9 DEL and 2 INS were validated by Sanger sequencing (Supplementary Figs. 3 and 4
To assess the false negative rate of CREST in identifying germline structural variations, we simulated whole-genome sequencing data for the 887 copy number variations (CNVs) in NA12878, one of the individuals characterized by the 1000 Genomes Project by applying 19 different SV detection methods on high-coverage sequencing data generated by 3 different platforms13
. The false negative rate of CREST is 22–27% with 3% false positive calls, demonstrating its superior performance in both sensitivity and accuracy compared with BreakDancer and Pindel (Supplementary Table 4
). 52% of the CNVs missed by CREST are in regions of segmental duplications where germline CNVs are frequent (26% of NA12878) but somatically acquired copy number alterations (CNAs) are rare (6% of the 5 T-ALLs) based on the data analyzed in this study, suggesting that the false negative rate of somatic CNAs could be lower than that of germline CNVs. The results of this analysis are presented in more details in Supplementary Data 3
Although the concept of using sequences that span breakpoints has been previously explored for finding chimeric mRNAs2
, for mapping viral integration sites by targeted sequencing14
and for identifying indels11
, CREST is the first use of this approach for mapping structural variations at the level of the whole genome. CREST is particularly well suited for identifying somatically acquired structural variations in paired tumor-normal samples, where its precision in finding the breakpoints coupled with its integrated ability to subtract common variations present in both germline and tumor samples also allows the removal of false lesions caused by the artifacts generated during library construction and the difficulties inherent in accurately mapping short sequence reads. Although other computational methods for detecting SV have been developed, none outperform CREST in our comparative analysis (see Supplemental Data 1
). Moreover, methods specifically designed for the identification of germline deletions15
are not capable of finding inter- and intra-chromosomal rearrangements, which are key mechanisms for creating oncogenic fusion proteins in cancer. The entire CREST package can be downloaded from http://www.stjuderesearch.org/site/lab/zhang
with user manual and test data.
Although CREST provides a significant improvement over standard paired-end approaches for identifying SVs, it continues to have difficulty with repetitive DNA sequence regions, rearrangements that occur within or adjacent to germline polymorphic structural variations, and rearrangements that contain non-template DNA sequences that are inserted at the breakpoints and are of similar or longer length than the NGS reads. In addition, CREST, like all mapping methods, demands high quality DNA reads of sufficient coverage to accurately define the DNA sequence (details in Supplementary Discussion
). The method provides base-pair level resolution of breakpoints and can therefore be used not only for identifying the number and type of SVs within a tumor genome, but should also allow an analysis of the breakpoint DNA sequence as a way to gain insights into the mechanism responsible for the generation of the structural rearrangement.