This study demonstrates the potential of massively parallel sequencing technologies for the investigation of cancer genomes. By using a paired-end strategy, we were able to identify and characterize to the base-pair level acquired deletions, tandem duplications, inverted duplications, inversions and interchromosomal rearrangements, as well as obtain high-resolution copy-number information. The data have shown that the patterns of somatic structural variation encountered in the two cancers studied differ markedly from those found in the germline, allowed resolution of rearrangements previously detected by cytogenetic approaches, yielded previously unknown fusion transcripts, shown that many rearrangements occur within or between amplicons and uncovered a distinctive pattern of somatic tandem duplication operative outside amplified regions. The results therefore illustrate the substantial amount of information pertaining to somatic structural change that will emerge by application of such approaches to the genomes of many cancer classes.
We used relatively small insert sizes in this study (200–500 bp), a strategy that has the advantages of tight size selection of DNA fragments (and therefore greater sensitivity for small intrachromosomal rearrangements) and straightforward PCR confirmation of variants. In contrast, using larger insert sizes, as in a recently published study of germline structural variants10
, has the advantage of greater genomic coverage per sequenced fragment, though with the drawback of increased difficulty with sequence annotation of breakpoints. It is likely that complete characterization of all rearrangements in a cancer cell line will require paired-end sequencing of several libraries of different insert sizes, allowing reads to fall outside any repetitive elements at the breakpoints, while maintaining the capacity to identify small insertions, deletions and genomic shards. Approaches using cDNA, such as paired-end diTags16
, show promise for identifying fusion genes and could readily be adapted to new sequencing technologies and used in combination with our protocol to annotate genomic and transcriptional consequences of rearrangements.
The digital read-out of copy number predicted aberrations as small as 30 kb in size that were proven to be genuine through mapping of the actual breakpoints from paired sequence reads. This provides comparable sensitivity to the current generation of array-CGH platforms, and the paired-end strategy has the additional potential to identify the actual breakpoints underlying a given copy number change. This additional information can be important for determining transcriptional effects associated with copy number changes. From the copy number data alone, it would be impossible to distinguish the genomic arrangement of the 11 acquired tandem duplications () from the inverted duplication (). Moreover, copy number analyses are blind to rearrangements such as balanced translocations and inversions. The capacity of massively parallel sequencing to reconstruct such rearrangements is important for the identification of oncogenic fusion genes.
Moreover, in contrast to other strategies for studying copy number such as array CGH, the paired-end strategy allows resolution to be improved simply by increasing the amount of sequence generated. This is effectively the situation in amplicons, permitting detailed annotation of copy number changes across the amplified regions that can be correlated with breakpoint locations. The complexity that emerges from the analysis of the NCI-H2171 amplicons implies that amplification involved an iterative process during which aberrant sister chromatid exchange to repair double-stranded DNA breaks led to progressive reorganization and expansion of the amplicons under selection pressure. It can be argued that the earliest rearrangements in the genesis of the amplicon will be those breakpoints that are themselves most amplified and that demarcate the greatest changes in copy number, as exemplified by both the PVT1-CHD7 fusion in the MYC amplicon of NCI-H2171 and the tandem insertion evident in the MYCN amplicon of NCI-H1770. The ability to extract quantitative read-out of breakpoint frequency and correlate this with copy number changes will be a powerful application of new sequencing technologies to explore the evolution of cancer amplicons.
Our screen has identified four previously unknown rearrangements leading to structural alterations in mRNA transcripts, including two fusion genes and two tandem exon duplications. None of the genes involved has previously been implicated in cancer development, with the exception of PVT1
), and it is difficult to ascertain the role these aberrations play here. There is precedent, for example, for partial tandem duplications of exons to produce oncogenic proteins, most convincingly with MLL
in acute leukemia19,20
. However, it is notable that one of the tandem duplications observed here affected a gene found in a chromosomal fragile site, GRID2
), raising the possibility that its occurrence is more a reflection of genome instability and abnormal DNA repair pathways than oncogenicity. It may well be that a proportion of acquired genomic rearrangements, including those that generate abnormal or fusion transcripts, are ‘passenger’ events not associated with cancer development, as has been observed for point mutations23
This study demonstrates the potential of massively parallel sequencing technologies to annotate large numbers of somatically acquired genome-wide rearrangements in cancer to the base-pair level. In addition to the insights this provides into the diversity of aberrant processes sculpting the genome that underlie the evolution and development of cancer, it is anticipated that these technologies will lead to the identification of previously unknown fusion genes and other rearrangements that may be future therapeutic targets.