The importance ascribed to different types of genome aberrations in cancer is frequently directly coupled to the technology available to measure them; classic cytogenetics demonstrated the functional significance of translocations in tumors with simple karyotypes, whereas loss of heterozygosity, CGH, and array-CGH studies have led to an explosion of interest in recurrent copy-number aberrations. More recently, targeted [32
] and whole genome exon resequencing [31
] has demonstrated the importance of coding mutations. The Cancer Genome Atlas project [36
] promises to increase drastically the number of known coding somatic mutations. However, it is likely that structural rearrangements in tumor genomes are as important to tumor biology and the development of biomarkers and therapeutics as are coding point mutations [37
]. We have demonstrated that ESP provides direct access to the structural complexity of tumor genomes by identifying and cloning all classes of structural rearrangements, including fusion genes and their transcripts. ESP also proved to be a powerful tool for analysis of structural polymorphism present in the normal human genome [39
]. Moreover, identification of the HYDIN
gene fusion by ESP reveals that duplicon-mediated genome rearrangements can result in expression of structurally novel genes. Using this approach, it is also possible to survey the spectrum of mutations and/or SNPs present in a tumor genome in an unbiased manner.
Many of the recurrent breakpoints that we identified arise from micro-rearrangements of less than 2 Mb (Figure ). Although some of these rearrangements are likely to be novel structural polymorphisms, micro-rearrangements have also been observed in evolution [41
] and in some tumors [43
]. Because micro-rearrangements are largely invisible to cytogenetic techniques, the collection of the breakpoints reported in this paper provides an excellent resource for future studies of the mechanisms, prevalence, and consequences of these micro-rearrangements in tumorigenesis.
Sequencing BAC clones identified by ESP was performed to localize and validate about 90 breakpoints in this and in a previous study [7
]. To our knowledge, this is currently the largest collection of sequenced rearrangement breakpoints in cancer. Importantly, this collection can be easily extended as needed, because ESP also created the largest collection to date of hundreds of sequence-ready breakpoint-spanning BAC clones. Most breakpoint-spanning BAC clones, including all BAC clones sequenced from primary tumors, contain single breakpoints. However, in the three cell lines, 17 clones containing multiple breakpoints were identified and confirmed by PCR. These observations were supported by DNA fingerprinting (Marra M, personal communication) [21
]. The observed differences between the primary tumors and cell lines may be due to genomic heterogeneity (and consequently lower sequence coverage) of tumor samples, differences in tumor type and/or stage, or intrinsic differences in genomic organization between cell lines and primary tumors. It will be informative to perform ESP on primary breast tumors with copy-number profiles very similar to those of the cell lines studied here [10
] and to establish the degree of the structural similarity of the samples with similar copy-number and expression profiles.
Our analyses of breakpoint junction sequences revealed that the overwhelming majority of identified rearrangements (96%) are consistent with aberrant NHEJ repair. This observation is consistent with the previously reported predominant role of nonhomologous recombination in generation of pathologic translocations [44
] and in frequent rearrangements at chromosomal ends [45
]. Although there are reports of associations between locations of cancer breakpoints and evolutionary breakpoints [46
], ESP data did not reveal a significant association in our samples (data not shown).
We used sequenced breakpoints to refine the mapping of amplicon structures in MCF7 using PCR in seven independent BES clusters. This process identified breakpoint heterogeneity in five clusters (Figure and Additional data file 2 [Figure S3]). One explanation for this phenomenon is variability in the location of breakpoints in multiple fusions of the same loci, analogous to the variability of breakpoints in fusion genes in hematopoietic malignancies. Alternatively, the heterogeneity might reflect early events present in a minority of cells in the population. To our knowledge, this is the first example of structural heterogeneity observed on a molecular level in tumor genomes.
Analysis of SNPs in BAC end sequences identified elevated rates of SNPs in each tumor sample compared with the normal sample, with the ovarian tumor exhibiting a rate significantly above the other samples. Although the ability to distinguish somatic mutations from sequencing errors or germline mutations is limited in the present study, there is no reason to suspect that these confounding factors vary enough between samples to explain the observed differences. The mutational spectra of SNPs in these samples share some features with those from exon resequencing studies [31
], but there are also many differences. These differences might be due to different mutational biases in coding regions, but further study is needed to support this hypothesis. Given that the BES arise from a genome-wide survey, it is not surprising that we identify few candidate mutations in coding regions. However, it is intriguing that even the relatively small numbers of putative mutations are enriched for zinc finger genes, including the known breast cancer oncogene ZNF217
Using ESP it is possible to reconstruct tumor genome structure and evolution [4
]. ESP data from the three breast cancer cell lines identify clones that fuse noncontiguous amplified loci, possibly suggesting functional coupling of co-amplified genes. The discovery of recurrent breakpoints and regularly spaced breakpoints in the cell-line data could be a molecular signature of breakage/fusion/bridge (B/F/B) cycles [7
]. In some cases, ESP data suggest a specific temporal progression in which amplification follows translocations or deletions. For example, a cluster of 19 clones span a 17;20 translocation in MCF7. This coverage is highly unlikely (P
) for a nonamplified locus, and PCR mapping confirmed identical breakpoints in these clones. The most parsimonious explanation is that the translocation preceded the amplification. In a second example, a cluster of six BT474 clones spans a deletion. Once again the simplest explanation is that the deletion preceded amplification of the surrounding locus, because a cluster of size six clones is highly unlikely (P
) in a nonamplified locus. Interestingly, this deletion may truncate the THRA1
gene, as reported by Futreal and coworkers [25
], and fuse it to the SCAP1
gene. Amplification of a breakpoint might occur because the fused genomic region encodes a fusion gene that confers a selective growth advantage. Alternatively, amplification might be a random byproduct of genomic instability near the rearrangement breakpoint. Regardless, the breakpoint information is valuable for determining the temporal evolution of tumor genome organization.
The identification of TMPRSS2
translocations in about 50% of prostate tumors [3
] underscores the significance of structural rearrangements in solid tumors. Although our prostate sample does not contain the TMPRSS2
translocation (Rubin M, personal communication), ESP mapping and breakpoint sequencing provide numerous examples of possible gene fusions, including the previously published BCAS4/3
fusion in MCF7. Moreover, integration of public EST data with ESP data demonstrates that this approach can identify fusion transcripts en masse
. We identified a fusion transcript that results from an evolutionarily recent rearrangement of the normal genome and obtained evidence for the first recurrent fusion transcript in breast cancer. In this study the clonal coverage of tumor genomes ranged from only 0.15-fold to 0.7-fold redundancy. It is probable that many additional gene fusions will be identified upon deeper paired end analysis of both normal and tumor genomes and transcriptomes.
The extension of ESP to multiple tumor types demonstrates that its application is not restricted to specific tumor types and that ESP functions well even with small tumor specimens. This is important because advances in diagnostics have resulted in a reduction in the average volume of many surgically excised tumors. For example, the average size of breast tumors excised before 1985 was 25 mm, whereas after 1985 it decreased to 21 mm [49
], a 1.6-fold decrease in the volume of excised breast tumors. Moreover, tumor heterogeneity and normal cell admixture necessitates dissection further reducing subsequent yields of tumor cell DNA. Finally, clinically annotated tumor specimens are an extremely valuable resource and should be used as sparingly as possible. Therefore, it is significant that we were able to construct a tumor BAC library from less than 20 mg of a frozen and partially necrotic tumor (B421).
DNA yields from the tumors suggest that libraries comprised of 200,000 to 400,000 clones are possible, meaning that the genomes of these tumors can be immortalized and made widely available. This study demonstrates the utility of ESP for whole genome screening of SNPs/mutations. The immortalization of the tumor genome in a clone library is important, because some studies report underestimation of the mutation load because of heterogeneity in tumors [50
], and overcoming this problem will require either development of the novel software or implementation of the novel sequencing technologies, allowing analysis of single DNA molecules [51
]. Because clone libraries can be duplicated and their DNA pooled, it becomes feasible to perform large exon resequencing projects on small tumors, such as those of the breast and prostate. In addition, because BAC clones contain DNA from a single tumor cell, identification of rare SNPs/mutations in heterogeneous tumors is theoretically possible in a manner analogous to the identification of breakpoint heterogeneity in tumor amplicons reported here. Finally, the ability to rapidly identify sequence variants in DNA pools and to then recover the physical clone means that studies aimed at determining the biologic relevance of the variants are possible using established in vivo
and in vitro
ESP is less impeded by tumor heterogeneity or contamination by normal cells than is aCGH, because each end sequenced clone originates from a single DNA molecule from a single cell. Deep sequencing of many clones allows one to overcome normal tissue admixture and enables direct measurements of heterogeneity and detection of rare events. Eventually it will be possible to apply techniques from metagenomics [52
] to study the heterogeneous pool of cells that are present in early stage tumors, with the goal of identifying the earliest informative biomarkers and therapeutic targets. At present, the relatively high cost of ESP limits its application to a small number of tumors, but advances in massively parallel sequencing technologies capable of paired-end sequencing (for review [9
]) will permit large-scale ESP studies at a fraction of the current cost. However, much of the cost savings realized by the current crop of next generation sequencing technologies result from skipping the immortalization of the tumor genome as a clone library. Such cloning enables further sequencing of breakpoints and evaluation of their functional significance via in vitro
and in vivo
]. Combining ESP with such assays will enable tumor progression studies aimed at identification of events linked to initiation, progression, and metastasis. Thus, although the selection of a particular implementation of ESP will be driven by the cost/benefit analysis for the specific goals of the project, paired end sequencing approaches promise to revolutionize our understanding of the complex organization of the genomes of solid tumors.